Corpora and Statistical Analysis of Non-Linguistic Symbol Systems – Richard Sproat (Google)

March 26, 2013 all-day

We report on the creation and analysis of a set of corpora of non-linguistic symbol systems. The resource, the first of its kind, consists of data from seven systems, both ancient and modern, with four further systems under development, and several others planned. The systems represent a range of types, including heraldic systems, formal systems, and systems that are mostly or purely decorative. We also compare these systems statistically with a large set of linguistic systems, which also range over both time and type.We show that none of the measures proposed in published work by Rao and colleagues (Rao et al., 2009a; Rao, 2010) or Lee and colleagues (Lee et al., 2010a) works. In particular, Rao’s entropic measures are evidently useless when one considers a wider range of examples of real non-linguistic symbol systems. And Lee’s measures, with the cutoff values they propose, misclassify nearly all of our non-linguistic systems. However, we also show that one of Lee’s measures, with different cutoff values, as well as another measure we develop here, do seem useful. We further demonstrate that they are useful largely because they are both highly correlated with a rather trivial feature: mean text length.
Richard Sproat received his Ph.D. in Linguistics from the Massachusetts Institute of Technology in 1985. He has worked at AT&T Bell Labs, at Lucent’s Bell Labs and at AT&T Labs — Research, before joining the faculty of the University of Illinois. From there he moved to the Center for Spoken Language Understanding at the Oregon Health & Science University. In the Fall of 2012 he moved to Google, New York as a Research Scientist.Sproat has worked in numerous areas relating to language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, and text-to-scene conversion. Some of his recent work includes multilingual named entity transliteration, the effects of script layout on readers’ phonological awareness, and tools for automated assessment of child language. At Google he works on multilingual text normalization and finite-state methods for language processing. He also has a long-standing interest in writing systems and symbol systems more generally.

Center for Language and Speech Processing