Information Extraction from the World Wide Web using Finite State Models and Scoped Learning: Andrew McCallum - 07/17/2002
see slides from Andrew McCallum's lecture (.pdf format)
- Location: Shaffer Hall, Room 100
- Time: 10:30 am - 12:00 noon
- Abstract:
The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, data mining and reasoning by computers. Today's search engines return web pages. Tomorrow's search engines will also return "things" (like people, jobs, companies, events), facts, their relations and trends.
Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Finite state machines are the dominant model for information extraction both in research and industry. In this talk I will give several examples of information extraction tasks at WhizBang Labs, and then describe two new finite state models designed to take special advantage of the multi-faceted nature of text on the web. Maximum Entropy Markov Models and Conditional Random Fields are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech, layout and formatting---plus features from the past and future, and agglomerative features of the entire sequence). I will introduce both models, skim over their parameter estimations algorithms, and present experimental results on real-world tasks. I will then describe Scoped Learning, a method that further improves information extraction and classification by taking advantage of local regularities in training and test sets.
(Joint work with Fernando Pereira, John Lafferty, Dayne Freitag, David Blei, Drew Bagnell and many others at WhizBang Labs.)
- Biography:
Andrew McCallum was Vice President of Research and Development at WhizBang Labs, and Director of their 30-person research and development lab in Pittsburgh, PA. He also holds an adjunct faculty position at Carnegie Mellon University. Prior to joining WhizBang he was a Research Scientist and Coordinator at Just Research (Justsystem Pittsburgh Research Center), where he spearheaded development of new methods for statistical text processing and created the research paper search engine now available at www.cora.whizbang.com. In 1996 he was a post-doctoral fellow at Carnegie Mellon University, where he worked with Sebastian Thrun on the Intelligent Building project and with Tom M. Mitchell on the WebKB project. Andrew graduated summa cum laude from Dartmouth College in 1989, and received his PhD in computer science from University of Rochester in 1995, where he worked with Dana Ballard.
For the past six years, Andrew has been active in research and publication of machine learning and statistical methods applied to text. His particular research interests include information extraction, document classification, and learning combinations of labeled and unlabeled data. He is on the editorial board of the Journal of Machine Learning Research, and has served on the program committees for many technical conferences, including IJCAI, AAAI, ICML, SIGIR, UAI and NIPS. He has given invited talks at MIT, Stanford, CMU, Brown, Xerox PARC, IBM Almaden, SRI, AT&T Research and Google.
This fall he will be a Research Associate Professor at University of Massachusetts, Amherst.
|