Learning Noun Phrase Query Segmentation: Query and Feature Data used in Experiments

If you use any of this data in your work, please cite:

See the Bergsma and Wang (2007) paper for all details on how the queries were collected, how the segmentations were annotated, and how the feature information was used. The queries were adapted from the AOL query dataset, available online (Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A Picture of Search. In The First International Conference on Scalable Information Systems).

Please send an e-mail to sbergsma@jhu.edu if you use the query segmentation data. We'd also be happy to help if you need any assistance.

Query Data

The query data used in the experiments:

Train
Development
Test

The annotations of the test set done by the two additional annotators:

Test-2
Test-3

Feature Data

The following frequency information was collected from the Google SOAP search API during March, 2007. Each line in the file is an exact query (searched with quotations around it) and the corresponding page count. The feature that each file corresponds to should be fairly self-explanatory after reading the paper and looking at the contents of the file. The order in Table 2 roughly corresponds to the order listed here, with counts.ngrams providing the web-count and pair-count (and trigram-count, etc.) information.

ngrams
the
collapsed
ands
genitive
anywhereQueryDB
exactQueryDB

Thanks and good luck!