Strategies for Coreference from the Perspective of Information Exploitation – Breck Baldwin (Alias-i, Inc.)

December 6, 2005 all-day

Coreference entices us with the promise of radically improved information exploitation via data mining, search and information extraction. Coreference in its canonical form involves equating text mentions of Abu Musab al-Zarqawi with mentions in Arabic, phone calls which reference him, images that contain him. Once such a foundation of coreference is established over a body of information, questions like “get me all individuals with some relation to al-Zarqawi” become feasible. It also is a dynamite research problem. Progress has been made in text mediums with apparently excellent results in named entity recognition, pronoun resolution, cross document individual resolution and database linking. This suggests that some sort of Uber-search/indexing engine should fall out the bottom of a series of 90% f-measure results in these key areas. Unfortunately, this is not the case and for good reasons. In this talk I will argue that there are fundamental flaws in how we think about coreference in the context of information access. The argument ranges from basic philosophical issues about what an entity or an ontology is to an analysis of why first-best approaches to entity detection hobble performance in significant ways. As a proposed strategy for approaching the problem I will discuss our own efforts two directions: 1. Targeting known entities using match filtering as well as n-best driven analysis with character language models, and 2. targeting unknown entities with n-best chunking approaches to named entity extraction as opposed to first-best approaches commonly used.

