Abstract:
Most learning algorithms are constructed under the assumption that the
training data is governed by the exact same distribution which the
model will later be exposed to. In practice, control over the data
generation
process is often less perfect. Training data may consist of a
benchmark
corpus (e.g., the Penn Treebank) that does not reflect the
distribution of
sentences that a parser will later be used for. Spam filters may be
used by
individuals whose distribution of inbound emails diverges from the
distribution reflected in public training corpora (e.g., the TREC spam
corpus).
In the talk, I will analyze the problem of learning classifiers
that perform
well under a test distribution that may differ arbitrarily from the
training
distribution. I will discuss the correct optimization criterion and a
solutions, including a kernel logistic regression classifier for
differing
training and test challenges.
In filtering spam, phishing and virus emails, distributions vary
greatly
over users, IP domains, and over time. Taking into account that
spam senders
change their email templates in response to the filtering mechanisms
employed, leads to the related but even more challenging problem of
adversarial learning.