WS 99 TDT Resources
Where to find things we have installed locally:
Data
All corpora are located under
/export/tdt/ws99/data
:
NID data we have so far:
tdt2-novtag-train/
.nov files are SGML data from LDC, .vecs files are input format for the NID software.
The entire TDT2 corpus:
/export/tdt/ws99/data/tdt2/
Text version:
eng_nat_man.text
Vectors of word frequencies:
eng_nat_man.vecs
This is the input format for Victor's FSD software. The features can be anything, not just word counts.
Judgments of topic relevance for 100 topics form TDT2:
/export/tdt/ws99/data/tdt2.rel1
Named Entity tagging for TDT2 from BBN:
/export/tdt/ws99/data/tdt2-NEtag
Other corpora that might useful as additional data:
LA-Times98
,
SDR99-Newswire
Tools
Software is located under
/export/tdt/ws99/tools/
:
Charniak statistical parser:
CharniakParser_v4
Preliminary FSD program:
tdt/nid
The binary to run:
/export/tdt/ws99/tools/tdt/nid/BIN-linux/nid
. This takes
-task fsd
or
-task nid
A nice wrapper script to run the NIST eval program on the output of the
nid
program.
/export/tdt/ws99/misc/tdt-lab/fsdeval
NIST FSD evluation software:
TDT3eval_v1.2
UMass FSD evluation software:
umassEval
Miscellany:
Stop word list in
/export/tdt/ws99/tools/etc/stoplist
Useful Links
1999 Topic Detection and Tracking Evaluation Project (TDT3)
from NIST
Linguistic Data Consortium
's
Instructions for tagging
new information.
LDC's
TDT3 info
, including
rules of interpretation
Eugene Charniak
's statistical parser:
Statistical parsing with a context-free grammar and word statistics
, Proceedings of the Fourteenth National Conference on Artificial Intelligence AAAI Press/MIT Press, Menlo Park (1997).
Probabilistic Latent Semantic Indexing
For Information Retrieval:
Probabilistic Latent Semantic Indexing
Thomas Hofmann Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99)
For Language Modeling:
Probabilistic Topic Analysis for Language Modeling
, Gildea, Daniel and Thomas Hofmann, Eurospeech 1999.
Perl Manpage