Web-derived Pronunciation Data
The following files contain the raw pronunciation data (prior to any correction or site-specific normalization) for English words extracted from non EU sites. The data is presented in a TAB-separated format, where the first column is the pronunciation, the second column is the orthography, and the rest of the columns contain information like the source webpage, extraction score, words in context, etc.
The non-EU IPA and Ad-hoc pronunciations contain two files named 'large' and 'small'. Note that 'small' is not a subset of 'large', but is extracted from a different set of websites [and is smaller in size :-)].
This data was extracted by Google from its web and news repositories for the 2008 Summer Research Workshop and is freely available to others under a Creative Commons Attribution 3.0 United States License.
| Ad-hoc data from non-EU sites | .tar.gz (195 MB) | |
| Ad-hoc data from news sites | .tar.gz (1.5 MB) | |
| IPA data from non-EU sites | .tar.gz (131 MB) |
Citation
Arnab Ghoshal, Martin Jansche, Sanjeev Khudanpur, Michael Riley and Morgan Ulinski, "Web-derived pronunciations", in Proc. IEEE ICASSP, 2009. BIBTEX
Contact
For technical questions regarding the data, you can contact the team members listed below:
| Arnab Ghoshal | ag at jhu dot edu | Johns Hopkins University | |
| Martin Jansche | jansche at acm dot org | Google Inc | |
| Sanjeev Khudanpur | khudanpur at jhu dot edu | Johns Hopkins University | |
| Michael Riley | riley at google dot com | Google Inc | |
| Morgan Ulinski | meu3 at cornell dot edu | Cornell University |
The page is maintained by Arnab Ghoshal.

