Topic-Based Novelty Detection

Computers are being increasingly used to manage large volumes of news and information increasingly available in electronic form. The task of the computer is to organize the incoming data into segments or stories which are related and to index them in a way which makes it easier for the user to digest them.

A key problem of digesting new data is deciding which parts contain redundant information so attention can be focused on the new material. This project proposes to investigate the problem of analyzing newly arrived news stories for two purposes: (1) to decide if the story discusses an event or topic that has not been seen earlier (first story detection); and (2) to identify, within a sequence of stories on the same pre-defined topic, which portions of subsequent stories contain new information and to determine the new named entities that are central to the topic (within-topic novelty detection). The project will focus on extending and combining Information Retrieval and Natural Language Processing Extraction techniques toward addressing these questions. Specifically, the team will look at identifying who/where/when entities and how to use them in Information Retrieval and other language modeling approaches for addressing this problem. An important component of the proposed project is investigating the impact on the detection results of using (degraded) text put out by a speech recognition system. The evaluation of the project’s results will be based on established measures from the Topic Detection Tracking initiative in the case of first story detection, and on accuracy of aligning predicted new text with actual new information (as identified by human experts prior to the workshop) in the case of novelty detection.

Final Report


Team Members 
Senior Members
James AllanUMass
Hubert JinBBN
Martin RajmanEDFL
Charles WayneDoD
Graduate Students
Daniel GildeaICSI
Victor LavrenkoUMass
Undergraduate Students
David CaputoYale
Rose HobermanUTexas

Center for Language and Speech Processing