Story Corpus
As part of my thesis I developed a system for automatically identifying blog posts that were primarily a first person description of events in the life of the author (a personal story). Over 5,000 weblog posts were manually annotated according to our definition. A statistical classifier was trained on the data to identify posts containing a story from those that do not. About 5% of posts were labeled as positive examples.
The classifier was applied to 44 million weblog posts, which covered a two month span in 2008. 1.4 million stories were identified. This corpus can be recreated by using the following steps:
- Downloading the Spinn3r dataset from the ICWSM dataset challenge website
- Extracting the relevant posts using this index. The index is a tab delimited file with 4 columns:
- the relative path to the Spinn3r dataset file
- the classifier score (from a linear classifier)
- the starting line number of the Spinn3r item
- the final line number of the Spinn3rItem
- If you make use of this corpus please cite one of the relevant publications.
Relevant Publications
Andrew S. Gordon and Reid Swanson. 2008b. StoryUpgrade: Finding Stories in Internet Weblogs. In International Conference on Weblogs and Social Media, Seattle, Seattle, Washington, March.
Andrew S. Gordon, Qun Cao, and Reid Swanson. 2007. Automated story capture from internet weblogs. In Proceedings of the 4th international conference on Knowledge capture, pages 167–168, Whistler, BC, Canada, October. ACM.
Reid Swanson. 2007. First Person Narrative Story Extraction and Retrieval. Masters, University of Southern California.