Unsupervised learning of natural languages – Shimon Edelman (Cornell)

October 5, 2004 all-day

View Seminar Video
We describe an unsupervised algorithm capable of finding hierarchical, context-sensitive structure in corpora of raw symbolic sequential data such as text or transcribed speech. In the domain of language, the algorithm handles both artificial stochastic context-free grammar data and real natural-language corpora, including raw transcribed child-directed speech. It identifies candidate structures iteratively as patterns of partially aligned sequences of symbols, accompanied by equivalence classes of symbols that are in complementary distribution in the context of their patterns. Pattern significance is estimated using a context-sensitive probabilistic criterion defined in terms of local flow quantities in a graph whose vertices are the lexicon entries and where the paths correspond, initially, to corpus sentences. New patterns and equivalence classes can incorporate those added previously, leading to the emergence of recursively structured units that also support highly productive and safe generalization, by opening context-dependent paths that do not exist in the original corpus. This is the first time an unsupervised algorithm is shown capable of learning complex, grammar-like linguistic representations that are demonstrably productive, exhibit a range of structure-dependent syntactic phenomena, and score well in standard language proficiency tests.

Center for Language and Speech Processing