Chris Re (Stanford University) “Bootleg: Guidable Self-Supervision for Named Entity Disambiguation”
Abstract
Mapping textual mentions to entities in a knowledge graph is a key step in using knowledge graphs, called Named Entity Disambiguation (NED). A key challenge in NED is generalizing to rarely seen (tail) entities. Traditionally NED uses hand-tuned patterns from a knowledge base to capture rare, but reliable, signals. Hand-built features make it challenging to deploy and maintain NED–especially in multiple locales. While at Apple in 2018, we built a self-supervised system for NED that was deployed in a handful of locales and that improved performance of downstream models significantly. However, due to the fog of production, it was unclear what aspects of these models were most valuable. Motivated by this experience, we built Bootleg, a clean-slate, open-source, self-supervised system to improve tail performance using a simple transformer-based architecture. Bootleg improves tail generalization through a new inverse regularization scheme to favor more generalizable signals automatically. Bootleg-like models are used by several downstream applications. As a result, quality issues fixed in one application may need to be fixed independently in many applications. Thus, we initiate the study of techniques to fix systematic errors in self-supervised models using weak supervision, augmentation, and training set refinement. Bootleg achieves new state-of-the-art performance on the three major NED benchmarks by up to 3.3 F1 points, and it improves performance over BERT baselines on tail slices by 50.1 F1 points.