Towards Semi-Supervised Algorithms for Semantic Relation Detection in BioScience Text – Marti Hearst (Berkeley)

October 21, 2004 all-day

A crucial step toward the goal of automatic extraction of propositional information from natural language text is the identification of semantic relations between constituents in sentences. In the bioscience text domain, we have developed a simple ontology-based algorithm for determining which semantic relation holds between terms in noun compounds, and a supervised learning algorithm for discovering relations between entities. In this talk, I will first briefly describe these results. A major bottleneck for semantic labeling work is the development of labeled training data. To remedy this, we propose a new approach for creating semantically-labeled data that makes use of what we call *citances*: the text of the sentences surrounding citations to research articles. Citances provide us with differently-worded statements of approximately the same semantic information; by looking at the way that different authors talk about the same facts, we obtain paraphrases nearly for free. We have just begun to assess how well citances work for the creation of labeled training data for the problem of detecting protein-protein interaction relations. We also hypothesize that citances will be useful for synonym creation, document summarization, and database curation. Joint work with Preslav Nakov, Barbara Rosario, Ariel Schwartz, and Janice Hamer. This work is part of the BioText project, supported by NSF DBI-0317510.

Dr. Marti Hearst is an associate professor in SIMS, the School of Information Management and Systems at UC Berkeley, with an affiliate appointment in the Computer Science Division. Her primary research interests are user interfaces and visualization for information retrieval, empirical computational linguistics, and text data mining. She received BA, MS, and PhD degrees in Computer Science from the University of California at Berkeley, and she was a Member of the Research Staff at Xerox PARC from 1994 to 1997. Prof. Hearst is on the editorial boards of ACM Transactions on Information Systems and ACM Transactions on Computer-Human Interaction and was formerly on the boards of Computational Linguistics and IEEE Intelligent Systems, and was the program co-chair of HLT-NAACL.

Center for Language and Speech Processing