This page provides the data for two publications that involve classifying hidden attributes of scientific authors:

(1) Stylometric Analysis of Scientific Articles: Data

The following file provides the data used in our paper:

[Stylo.data.tgz]

You can uncompress the file using "tar -xzf Stylo.data.tgz". The data includes processed versions of ACL Anthology papers and the exact division of papers used for our experiments. The data includes a README that gives further details.

If you use this data in your work, please cite as:

The ACL Anthology (specifically the ACL Anthology Network) should also be acknowledged as the ultimate source of our data. Please see our paper for full details.

Please send an e-mail to sbergsma@jhu.edu if you have any questions.

Shane Bergsma
July 5, 2012

(2) Explicit and Implicit Syntactic Features for Text Classication: Data

Overview of data collection

In this paper, we experimented with native language classiffication on scientific articles using articles from the ACL Anthology Network (AAN). We took all papers from the years 2001-2007 for training data, and randomly divided papers from years 2008 and 2009 to create evenly-sized development and test sets. We took steps to anonymize the papers, and to remove references, footers, table captions, etc.

We focus on the five most-common native languages of ACL authors in our training era: English, Japanese, German, Chinese, and French. The AAN provides normalized author names and affiliation information for each paper through 2009, which we exploit to semi-automatically annotate articles for our experiments. We manually marked each country-of-affiliation for the language predominantly spoken there. We then applied heuristics to label the data. To be labeled as an English paper, all author names and affiliations had to be English. To determine whether a name corresponds to an English speaker, we took a list of common first names in the United States (from U.S. census data) and added to it common nicknames, e.g., Rob for Robert, Chris for Christopher, and so on. To be labeled as one of the other languages, (1) every author had to have a country-of-affiliation where that language is spoken, and (2) every author's name had to either be on a list of names affiliated with that language or of unknown origin. Our source of male and female names associated with non-English-speaking countries was from www.20000-names.com. These heuristics provide annotations for 1,959 of 8,483 papers (23%) in the 2001-2009 ANN.

Data

(1) Mapping of ACL codes to native language can be found here.

(2) Splits used in our experiments: train, dev, and test papers.

If you use this data in your work, please cite as:

The ACL Anthology (specifically the ACL Anthology Network) should also be acknowledged as the ultimate source of our data. Please see our paper for full details.

Please send an e-mail to sbergsma@jhu.edu if you have any questions.

Shane Bergsma
June 2, 2013