The Center for Language and Speech Processing at Johns Hopkins University




Workshops

Web-derived Pronunciation Data

The following files contain the raw pronunciation data (prior to any correction or site-specific normalization) for English words extracted from non EU sites. The data is presented in a TAB-separated format, where the first column is the pronunciation, the second column is the orthography, and the rest of the columns contain information like the source webpage, extraction score, words in context, etc.

The non-EU IPA and Ad-hoc pronunciations contain two files named 'large' and 'small'. Note that 'small' is not a subset of 'large', but is extracted from a different set of websites [and is smaller in size :-)].

This data was extracted by Google from its web and news repositories for the 2008 Summer Research Workshop and is freely available to others under a Creative Commons Attribution 3.0 United States License.

      Ad-hoc data from non-EU sites .tar.gz (195 MB)
     Ad-hoc data from news sites .tar.gz (1.5 MB)
     IPA data from non-EU sites .tar.gz (131 MB)

Creative Commons License

Citation

Arnab Ghoshal, Martin Jansche, Sanjeev Khudanpur, Michael Riley and Morgan Ulinski, "Web-derived pronunciations", in Proc. IEEE ICASSP, 2009. BIBTEX

Contact

For technical questions regarding the data, you can contact the team members listed below:

     Arnab Ghoshal ag at jhu dot edu Johns Hopkins University
Martin Jansche jansche at acm dot org Google Inc
Sanjeev Khudanpur khudanpur at jhu dot edu Johns Hopkins University
Michael Riley riley at google dot com Google Inc
Morgan Ulinski meu3 at cornell dot edu Cornell University

The page is maintained by Arnab Ghoshal.