Story Corpus

As part of my thesis I developed a system for automatically identifying blog posts that were primarily a first person description of events in the life of the author (a personal story). Over 5,000 weblog posts were manually annotated according to our definition (about 5% were positive examples) and a statistical classifier was trained on this data. The classifier was applied to 44 million weblog posts over a two month span in 2008 and 1.4 million stories were identified. This corpus can be recreated by using the following steps:

Relevant Publications

