As part of my thesis I developed a system for automatically identifying blog posts that were primarily a first person description of events in the life of the author (a personal story). Over 5,000 weblog posts were manually annotated according to our definition (about 5% were positive examples) and a statistical classifier was trained on this data. The classifier was applied to 44 million weblog posts over a two month span in 2008 and 1.4 million stories were identified. This corpus can be recreated by using the following steps:
- Downloading the Spinn3r dataset from the ICWSM dataset challenge website
- Extracting the relevant posts using this index. The index is a tab delimited file with 4 columns:
- the relative path to the Spinn3r dataset file
- the classifier score (from a linear classifier)
- the starting line number of the Spinn3r item
- the final line number of the Spinn3rItem
- If you make use of this corpus please cite one of the relevant publications.