Alignment-Based Discriminative String Similarity: Cognate Data used in Experiments

If you use any of this data in your work, please cite:

Please send an e-mail to sbergsma@jhu.edu if you use the data. We'd also be happy to help if you need any assistance.

Bitext Data

The bitext data was generated from materials provided for the Shared Task at the NAACL 2006 Workshop on Machine Translation (Philipp Koehn and Christof Monz, 2006. Manual and automatic evaluation of machine translation between European languages, in NAACL Workshop on Statistical Machine Translation, pages 102-121). These shared task materials were generated from the Europarl corpus. Please see section 5.1 in Bergsma and Kondrak (2007) for further details on the processing of the bitext data.

[German Training Set]
[German Development Set]
[German Test Set]

[French Training Set]
[French Development Set]
[French Test Set]

[Spanish Training Set]
[Spanish Development Set]
[Spanish Test Set]

Dictionary Data

The file dictionaryProcessing.tgz contains scripts to generate LCSR-pair data from translation dictionary pairs, with accompanying instructions. Dictionary pairs above an LCSR of 0.58 are labelled as positives, all other pairs with an LCSR above 0.58 are labelled as negatives. Access with "tar -xzvf dictionaryProcessing.tgz."

For our experiments, dictionary pairs were used from the Freelang program. The Freelang program and its accompanying translation dictionaries are freely available for download from the Freelang website.

Manually-Annotated Cognate Data

In the first paragraph of our results, we describe testing the LCSR of a set of "known French-English cognates". These cognates are available:

French-English Cognates

These pairs were originally used in the work: Diana Inkpen, Oana Frunza and Grzegorz Kondrak. Automatic Identification of Cognates and False Friends in French and English. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2005), pp. 251-257, Borovets, Bulgaria, September 2005.

The pairs were generated as follows: A thousand word pairs were taken from a dictionary. All pairs are possible translations. Dr. Kondrak then manually went through all the pairs and marked those that are cognate. He did not distinguish between genetic cognates and other cognates. For the cognate judgement, the roots of both words have to be related; having related prefixes or suffixes is not sufficient. Compound words count as cognate if any of the roots are related. 636 pairs are cognate (and provided here), including 140 that are identical. The remaining pairs are unrelated and not included.

Thanks and good luck!