Distributional Clustering of N-grams

[CLUSTER DATA]
- split in 10 parts, gzipped, 2.7 GB uncompressed

If you use this data in your work, please cite as:

Please send an e-mail to shane.a.bergsma@gmail.com if you need any assistance.

This data was generated from a web-scale N-gram corpus, generously donated by Google, Inc. (www.google.com). We gratefully acknowledge Google's assistance with this project.

Format

The file contains an alphabetical listing of phrases and their cluster memberships. Each line is a tab-separated list of the phrase and each cluster it belongs to, along with the similarity to each cluster centroid. Up to twenty clusters with the highest centroid similarities are included for each phrase.
AutoCAD LT      401     0.260717        736     0.196809        783     0.183296        525     0.177165        808     0.165705        218     0.162994        815     0.141620        97      0.140854        812     0.139386        244     0.127848        111  0.124953 163     0.124443        55      0.123724        324     0.117858        832     0.115209        70      0.113849        197     0.111080        5       0.109667        357     0.107718        833     0.107164

There are 9901505 phrases in total.

Shane Bergsma
June 12, 2010