Note: the instructions contained on this page were written for use specifically in the CLSP WS02 Lab. They may refer to applications, files or other materials that are not accessible from other locations.

Information Retrieval Evaluation Lab

In this lab assignment, you will install an existing IR system (MG) and run it on a standard English document collection. Then you will do some experimentation with cross-language information retrieval, using the formal evaluation methods of the information retrieval community to assess experimental alternatives.

Getting an IR system running

Doing IR evaluation

To do evaluation, you will use the trec_eval software, located on the system at /home/ws01/edrabek/ws02/utils/trec_eval. As you know, evaluating an IR system means comparing its output on a set of queries with a set of true relevance judgments. In TREC parlance, each query corresponds to a "topic" (a statement of the information being sought, containing a title, a description, and a longer narrative characterization of relevant documents), each document is identified by a "docno" (document number), and true relevance judgments appear in "qrels" files (qrels is short for "query relevance judgments").

The directory /home/ws01/edrabek/ws02/data/trec_eval_example contains an example of the relevant files.

    You will see that the format of the system results (SampleResults.dat) contains 5 columns, for example:

      030 Q0 ZF08-175-870 0 4238 prise1 
      qid iter docno    rank sim run_id
    
    The fields that really matter are qid (integer query ID), docno (document number, a string), rank (integer starting from 0), and run_id (identifying the system name). The iter (string) and sim (float) fields are be ignored, though the sim field might be useful in identifying when document rankings are arbitrary because of tie scores.

    Relevance for each docno to qid is determined from text_qrels_file, which consists of text tuples of the form You will see that the format of the query relevance judgments (qrels.txt) contains 4 columns, for example:

           
       353   0  FR940314-0-00049  1
       qid  iter  docno          rel
    
    The qid, iter, and docno fields are as just described, and rel is a Boolean (0 or 1) indicating whether the document is a relevant response for the query. If you run
      trec_eval -q qrel.txt SampleResults.dat 
    
    you will see a variety of evaluation statistics. See trec_eval_example/trec_eval_desc.txt for documentation of how to run trec_eval and what's reported.

    Evaluating IR runs

    Now that you've seen how to run an IR system and how to evaluate an IR system, it's time to put the two together. First, let's do a simple example using the same data you just worked with.

    A more substantial experiment

    Directory data/clir contains a large (>24K documents) collection of Chinese articles, together with TREC-style topics (in both Chinese and English) and relevance judgments. The topics and judgments are broken up into two sets, 1-28 and 29-54 -- this will make it easy for you to use one set for development and the other set for testing.

    The goal of this exercise is to experience the process of trying out several different cross language IR variations and evaluating them formally. The full group should break up into several teams trying different things out. Some obvious approaches to try include:

    Each group should do its development using topics 1-28, and at the end of the lab all groups should show their evaluation results using topics 29-54.

    Note that you'll almost certainly need to use Chinese segmentation (see tokenize, below) and you may want to experiment with English stemming. If you're feeling particularly ambitious you might find some interesting way to use the WordNet data discussed below.

    Some tools you might want to work with:

    Some data sources you have available that you might want to work with: ~edrabek/ws02/data/dicts/


    Philip Resnik
    Last modified: Mon Jul 1 15:03:21 EDT 2002