Learning Under Differing Training and Test Distributions
Tobias Scheffer, Machine Learning Research Group of the Max Planck Institute for Computer Science
June 1, 2007
Most learning algorithms are constructed under the assumption that the training data is governed by the exact same distribution which the model will later be exposed to. In practice, control over the data generation process is often less perfect. Training data may consist of a benchmark corpus (e.g., the Penn Treebank) that does not reflect the distribution of sentences that a parser will later be used for. Spam filters may be used by individuals whose distribution of inbound emails diverges from the distribution reflected in public training corpora (e.g., the TREC spam corpus).
In the talk, I will analyze the problem of learning classifiers that perform well under a test distribution that may differ arbitrarily from the training distribution. I will discuss the correct optimization criterion and a solutions, including a kernel logistic regression classifier for differing training and test challenges.
In filtering spam, phishing and virus emails, distributions vary greatly over users, IP domains, and over time. Taking into account that spam senders change their email templates in response to the filtering mechanisms employed, leads to the related but even more challenging problem of adversarial learning.
Tobias Scheffer is Research Associate Professor and head of the Machine Learning Research Group of the Max Planck Institute for Computer Science. He is an adjunct faculty member of Humboldt-Universitaet zu Berlin. Between 2003 and 2006, he was a Research Assistant Professor at Humboldt-Universitaet zu Berlin. Prior to that, he worked at the University of Magdeburg, at Technische Universitaet Berlin, the University of New South Wales in Sydney and Siemens Corporate Research in Princeton, N.J. He was awarded an Emmy Noether Fellowship of the German Science Foundation DFG in 2003 and an Ernst von Siemens Fellowship by Siemens AG in 1996. He received a Master's Degree in Computer Science (Diplominformatiker) in 1995 and a Ph.D. (Dr. rer nat.) in 1999 from Technische Universitat Berlin. Tobias serves on the Editorial Board of the Data Mining and Knowledge Discovery Journal. He served as Program Chair of the European Conference on Machine Learning, and the European Conference on Principles and Practice of Knowledge Discovery in Databases.