Story Corpus

As part of my thesis I developed a system for automatically identifying blog posts that were primarily a first person description of events in the life of the author (a personal story). Over 5,000 weblog posts were manually annotated according to our definition (about 5% were positive examples) and a statistical classifier was trained on this data. The classifier was applied to 44 million weblog posts over a two month span in 2008 and 1.4 million stories were identified. This corpus can be recreated by using the following steps:

Relevant Publications

Andrew S. Gordon and Reid Swanson. 2008. StoryUpgrade: Finding Stories in Internet Weblogs. In International Conference on Weblogs and Social Media, Seattle, Seattle, Washington, March.
Andrew S. Gordon, Qun Cao, and Reid Swanson. 2007. Automated story capture from internet weblogs. In Proceedings of the 4th international conference on Knowledge capture, pages 167–168, Whistler, BC, Canada. ACM.
Reid Swanson. 2007. First Person Narrative Story Extraction and Retrieval. Masters, University of Southern California.