Scaling of Information in Natural Language – Naftali Tishby (The Hebrew University of Jerusalem)

March 30, 2004 all-day

View Seminar Video
The idea that the observed semantic structure of human language is a result of an adaptive competition between accuracy of expression and efficient communication is not new. It has been suggested in various forms by Zipf, Shannon, and Mandelbrot, among many others. In this talk I will discuss a novel technique for studying such a competition between accuracy and efficiency of communication, solely from the statistics of large linguistic corpora. By exploiting the deep and intriguing duality between source and channel coding in Shannon’s information theory we can explore directly the relationship between the semantic accuracy and the complexity of the representation in a large corpus of English documents. We do this by evaluating the accuracy in identifying the topic of a document as a function of the complexity of the semantic representation, as captured by relevant hierarchical clustering of words via the information bottleneck method, which can be viewed as a combination of perfectly matched source and channel. What we obtain is a scaling relation (a power-law) that, unlike the famous Zipfs law, quantifies directly the statistical way words are semantically refined in human language. It may therefore reveal some quantitative properties of human cognition which can now be explored experimentally in other languages or other complex cognitive modalities such as music and mathematics. This work is partly based on joint work with Noam Slonim. See also:

Dr. Naftali Tishby is currently on sabbatical the at the CIS department at UPenn. Until last summer he served as the founding chair of the new computer engineering program at the School of Computer Science and Engineering at the Hebrew University. He is a founding member of the Interdisciplinary Center for Neural Computation (ICNC) and one of the key teachers of the well known computational neuroscience graduate program of the ICNC. He received his PhD in theoretical physics from the Hebrew university in 1985 and has been a research member of staff at MIT, Bell Labs, AT&T, and NECI since then. His current research is on the interface between computer science, statistical physics, and computational biology. He introduced various methods from statistical mechanics into computational learning theory and machine learning and is interested in particular in the role of phase transitions in learning and cognitive phenomena. More recently he has been working on the foundation of biological information processing and has developed novel conceptual frameworks for relevant data representation and learning algorithms based on information theory, such as the Information Bottleneck method and Sufficient Dimensionality Reduction.

Center for Language and Speech Processing