Mike Hopcroft (Microsoft) “How Bing Uses Bloom Filters for Search”
3400 N Charles St
Baltimore, MD 21218
USA
Abstract
Sometimes complex problems have simple solutions that buck conventional wisdom. The BitFunnel algorithm, which powers the Bing search engine, uses Bloom filters to process queries. With a small team and a blank page, we built a system to search billions of documents. This is our story.
Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The driving factor behind the shift away from the inverted index was operational cost savings. This paper describes algorithmic innovations and changes in the cloud computing landscape that led us to reconsider and eventually field a technology that was once considered unusable. The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time, our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures. We show these innovations yield a significant efficiency gain versus classic bit-sliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes, MG4J, and Lucene.
Biography
Michael Hopcroft grew up in a college town where he would ride his bike to the physics department stockroom to buy transistors and scavenge used equipment. He wrote video games for Atari in high school, but chose to pursue his interest in electrical engineering in college. After graduating from Cornell with a B.S. he went on to earn a Masters and PhD in computational geometry and computer vision. He has held a number of jobs from summer intern at a chip fab and Hewlett Packard, to researcher at Xerox PARC and software engineer at Silicon Graphics. For the past twenty years he has worked at Microsoft on projects ranging from Office to Windows to Visual Studio to Bing. He almost always prefers spending weeks writing a program to perform simple tasks that would otherwise take a few minutes.