In this lab assignment, you will install an existing IR system (MG) and run it on a standard English document collection. Then you will do some experimentation with cross-language information retrieval, using the formal evaluation methods of the information retrieval community to assess experimental alternatives.
mkdir ~/mgdata
setenv MGDATA ~/mgdata
setenv MGSAMPLE ~edrabek/ws02/mg/mg-1.2.1/SampleData
set path = ($path /home/ws01/edrabek/ws02/mg/bin/)
rehash
mgbuild alice
mgquery alice
Based on these specs, MG assumes all of the documents are in a single file with ctrl-B separating each document. MG strips SGML markup by default, which is appropriate for these documents, but it indexes everything else.
To actually build the index, invoke mgbuild trec.
mg_get trec | mg_passes -S -f $MGDATA/trec/trec
mv $MGDATA/trec/trec.docnums $MGDATA/trec/trec.DOCIDS
The first of these commands invokes the code in
src/text/mg.special.c, which simply writes anything between
start and end DOCNO tags to a file.
That is, put this into a file:
.set query ranked
.set mode docnums
india and pakistan negotiations over kashmir
.quit
And then run:
cat $myfile | mgquery trec > $myotherfile
/home/ws01/edrabek/ws02/utils/trec_eval. As you know, evaluating an IR
system means comparing its output on a set of queries with a set of
true relevance judgments. In TREC parlance, each query corresponds to
a "topic" (a statement of the information being sought, containing a
title, a description, and a longer narrative characterization of
relevant documents), each document is identified by a "docno"
(document number), and true relevance judgments appear in "qrels"
files (qrels is short for "query relevance judgments").
The directory /home/ws01/edrabek/ws02/data/trec_eval_example contains an example of the relevant
files.
You will see that the format of the system results (SampleResults.dat) contains 5 columns, for example:
030 Q0 ZF08-175-870 0 4238 prise1 qid iter docno rank sim run_idThe fields that really matter are qid (integer query ID), docno (document number, a string), rank (integer starting from 0), and run_id (identifying the system name). The iter (string) and sim (float) fields are be ignored, though the sim field might be useful in identifying when document rankings are arbitrary because of tie scores.
Relevance for each docno to qid is determined from text_qrels_file, which consists of text tuples of the form You will see that the format of the query relevance judgments (qrels.txt) contains 4 columns, for example:
353 0 FR940314-0-00049 1 qid iter docno relThe qid, iter, and docno fields are as just described, and rel is a Boolean (0 or 1) indicating whether the document is a relevant response for the query. If you run
trec_eval -q qrel.txt SampleResults.datyou will see a variety of evaluation statistics. See trec_eval_example/trec_eval_desc.txt for documentation of how to run trec_eval and what's reported.
The goal of this exercise is to experience the process of trying out several different cross language IR variations and evaluating them formally. The full group should break up into several teams trying different things out. Some obvious approaches to try include:
Each group should do its development using topics 1-28, and at the end of the lab all groups should show their evaluation results using topics 29-54.
Note that you'll almost certainly need to use Chinese segmentation (see tokenize, below) and you may want to experiment with English stemming. If you're feeling particularly ambitious you might find some interesting way to use the WordNet data discussed below.
Some tools you might want to work with: