Probabilistic hashing for similarity searching and machine learning on large datasets in high dimensions

Ping Li, Cornell University

October 14, 2011


Abstract

Many applications such as information retrieval make use of efficient (approximate) estimates of set similarity. A number of such estimates have been discussed in the literature: minwise hashing, random projections and compressed sensing. This talk presents an improvement: b-bit minwise hashing. An evaluation on large real-life datasets will show large gains in both space and time. In addition, we will characterize the improvement theoretically, and show that the theory matches the practice.More recently, we realized that (b-bit) minwise hashing can not only be used for similarity matching but also for machine learning. Applying logistic regression and SVMs to large datasets faces numerous practical challenges. As datasets become larger and larger, they take too long to load and may not fit in memory. Training and testing time can become an issue. Error analysis and exploratory data analysis are rarely performed on large datasets because it is too painful to run lots of what-if scenarios and explore lots of high-order interactions (pairwise, 3-way, etc.). The proposed method has been applied to two large datasets: a "smaller" dataset (24GB in 16M dimensions) and a "larger" dataset (200GB in 1B dimensions). Using a single desktop computer, the proposed method takes 3 seconds to train an SVM for the smaller dataset and 30 seconds for the larger dataset.