Story Corpus

As part of my thesis I developed a system for automatically identifying blog posts that were primarily a first person description of events in the life of the author (a personal story). Over 5,000 weblog posts were manually annotated according to our definition. A statistical classifier was trained on the data to identify posts containing a story from those that do not. About 5% of posts were labeled as positive examples.

The classifier was applied to 44 million weblog posts, which covered a two month span in 2008. 1.4 million stories were identified. This corpus can be recreated by using the following steps:

Relevant Publications

Andrew S. Gordon and Reid Swanson. 2008b. StoryUpgrade: Finding Stories in Internet Weblogs. In International Conference on Weblogs and Social Media, Seattle, Seattle, Washington, March.

Andrew S. Gordon, Qun Cao, and Reid Swanson. 2007. Automated story capture from internet weblogs. In Proceedings of the 4th international conference on Knowledge capture, pages 167–168, Whistler, BC, Canada, October. ACM.

Reid Swanson. 2007. First Person Narrative Story Extraction and Retrieval. Masters, University of Southern California.