This page contains lists of questions, large and small, that we are
hoping to investigate. When we develop an approach to answering the
question, we will describe what we did to get there as well as the
answer we hope that we found.
The under-investigation questions are listed first, in no
particular order. Then open questions that we
have not tackled yet. And finally the answered
questions.
Questions being investigated
- Do named entity co-occurrences indicate anything interesting with
regard to first stories? For example, is "Clinton and Yeltsin"
different from "Clinton" and "Yeltsin"? (Dan
Gildea)
- Can we find some term weighting schemes that might be more
effective than simple tf-idf measures? And should we remove some
features (e.g., numbers or one-character "words") from the set we're
considering? (Hubert Jin)
- How can we build confidence measures that are a hybrid of multiple
systems? (Martin Rajman)
- Do named entities occur in a different way in early stories of a
topic than they do in later stories? For example, do the early
stories tend to introduce more new named entities? (Rose Hoberman)
- Do the similarity functions that we're using (e.g., cosine and a
weighted average) successfully capture relationships between stories
in a topic? That is, are within-topic scores higher than across-topic
scores? (Victor Lavrenko)
- For the named entity work that we're doing, does the fact that
named entities are not normalized have any impact? For example, "Bill
Clinton" and "William Jefferson Clinton" have been treated as
different names so far. (Dave Caputo)
- To what degree does speech recognition destroy named entity
recognition (or even existence)? What about other words? Do the word
stems have the same problem? This is a form of word error rate
analysis, but at a cruder level. (Charles Wayne)
Open questions not being looked at yet
- Extending one of the active questions, what do the score
distributions look like for known first stories and for known
non-first stories?
- How do all/some of the stories on a topic score against one
another and against the/their topic mean?
- Can a human differentiate first stories from non-first stories?
- How useful is it for an algorithm to have prior stories
(correctly) clustered by topic?
- How are named entities distributed across stories within a topic?
- Does the (likely) loss of named entities account for (most of) the
difference between ASR results and non-ASR results?
- What is the effect of evaluating results in terms of a "weighted
depth" of the story detected as the first story? That is, suppose a
system gets partial credit for missing the first story but calling the
second story the first story?
- We have been using named entities for several things and have
found them to be effective. Are we confused at all by reporters'
names? They clearly have no strong relationships to the content of
the topic. What about the names of witnesses or bystanders--i.e.,
people who are mentioned essentially in passing.
- Does the position of a named entity in a story mean anything about
its sigificance? Do named entities mentioned at the start have more
significance than those that do not appear until late in the story?
- What is the impact of ASR errors on effectiveness?
- What makes first stories different? Perhaps just early stories?
- Can POS information be useful for weighting or selecting features?
(Martin Rajman)
- Can we leverage the output of a chunking parser to help us select
features and/or relations? That is, can bigrams with syntactic
relationships improve over simple bigrams? We'll use Charniak's
parser for this.
- Can we use "distributional semantics" [Rajman et al, 1995] as a
method for smoothing sentences or documents for comparison? This
would be an alternative to LSI or LCA sorts of things.
- Does the KL divergence measure capturing something more robustly
than the ad-hoc "cosine similarity" approach?
Questions that we have already addressed