Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter

This page provides the data associated with this publication.

Cluster Data

The first name, last name, and location clusters are available in the following folder:
Clusters/

File Format: Files are gzip'd. Each line in each file gives the string followed by its cluster memberships (closest centroids) in the 50-centroid, 200-centroid, and 1000-centroid (K-centroid) clusters, with the similarity to the cluster centroids provided. See the paper below for full details.

String Processing: To look up strings in our clusters, you will need to normalize your strings using the same processing that we used (e.g. lower-casing, removal of honorifics, symbols, punctuation, and numbers). To facilitate this, we share here the Perl scripts that we used when we generated the clusters. One script will format a name (and then you would extract out the first/last token as desired) and one will format a location. The scripts are in a tar'd and zipped file: processAttributeScripts.tgz.

Experimental Data

The gold-standard training, development, and test data used for the seven tasks explored in our experiments is also available as a tar'd and zipped file: gold.exper.twitclusters.tgz.

Attribution

If you use this data in your work, please cite as:

  • S. Bergsma, M. Dredze, B. Van Durme, T. Wilson, D. Yarowsky, Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter, In Proc. NAACL-HLT 2013. [pdf] [bib]

    We share all the data on this page under a Creative Commons Attribution 3.0 Unported License. So feel free to share, remix, or make commercial use of this data, provided that when you share any data derived from our work, you mention the above paper. Where any of our data is already in the public domain, that status is not affected by this license.

    Questions?

    Please send an e-mail to shane.a.bergsma@gmail.com if you have any questions.

    Shane Bergsma
    July 16, 2013