Uncertainty Sampling for Supervised Learning: A Logarithmic Lunch? – David D. Lewis (AT&T Labs – Research)

September 30, 1997 all-day

Random sampling is often used to choose training data for language processing tasks such as text retrieval, email filtering, parsing, tagging, etc. We propose an alternative approach of labeling data, training the system, and finding examples for which the system is least certain of the correct answer. On a text categorization task this method, which we call uncertainty sampling, reduced by up to 500-fold the amount of training data needed to achieve a given level of categorization accuracy. The computational learning theory results that inspired our own (heuristic) work suggest that, asymptotically, a labeled training set of size *logarithmic* in the amount of unlabeled training data can be used without sacrificing accuracy. For applications where unlabeled training data is cheap, this would be the next best thing to a free lunch. A great deal is unknown about these methods, and we will discuss avenues for research.
David D. Lewis is a Principal Research Staff Member at AT&T Labs. Prior to that he was a Member of Technical Staff at AT&T Bell Labs, and a Research Associate at the University of Chicago. He did his Ph.D. research at the University of Massachusetts under Bruce Croft. Lewis’ research interests are in the areas of information retrieval, machine learning, and natural language processing.

Center for Language and Speech Processing