The first name, last name, and location clusters are available in the
File Format: Files are gzip'd. Each line in each file gives the string followed by its cluster memberships (closest centroids) in the 50-centroid, 200-centroid, and 1000-centroid (K-centroid) clusters, with the similarity to the cluster centroids provided. See the paper below for full details.
String Processing: To look up strings in our clusters, you will need to normalize your strings using the same processing that we used (e.g. lower-casing, removal of honorifics, symbols, punctuation, and numbers). To facilitate this, we share here the Perl scripts that we used when we generated the clusters. One script will format a name (and then you would extract out the first/last token as desired) and one will format a location. The scripts are in a tar'd and zipped file: processAttributeScripts.tgz.
The gold-standard training, development, and test data used for the seven tasks explored in our experiments is also available as a tar'd and zipped file: gold.exper.twitclusters.tgz.
If you use this data in your work, please cite as:
Please send an e-mail to email@example.com if you have any questions.Shane Bergsma