David D. Lewis
AT&T Labs - Research
Title: "Uncertainty Sampling for Supervised Learning: A Logarithmic Lunch?"
**************************************************************************
Random sampling is often used to choose training data for
language processing tasks such as text retrieval, email filtering,
parsing, tagging, etc. We propose an alternative approach of labeling
data, training the system, and finding examples for which the system
is least certain of the correct answer. On a text categorization task
this method, which we call uncertainty sampling, reduced by up to
500-fold the amount of training data needed to achieve a given level
of categorization accuracy.
The computational learning theory results that inspired our own
(heuristic) work suggest that, asymptotically, a labeled training set
of size *logarithmic* in the amount of unlabeled training data can be
used without sacrificing accuracy. For applications where unlabeled
training data is cheap, this would be the next best thing to a free
lunch. A great deal is unknown about these methods, and we will
discuss avenues for research.
Biographical Information
David D. Lewis is a Principal Research Staff Member at AT&T Labs.
Prior to that he was a Member of Technical Staff at AT&T Bell Labs,
and a Research Associate at the University of Chicago. He did his
Ph.D. research at the University of Massachusetts under Bruce Croft.
Lewis' research interests are in the areas of information retrieval,
machine learning, and natural language processing.
**************************************************************************
Seminar Schedule