AT&T Fellow, 2001
Association for Computational Linguistics Vice President-elect 2010
According to Google Scholar, my top 3 cited papers are:
1. Word association norms, mutual information, and lexicography 1990 (with Patrick Hanks)
3. A program for aligning sentences in bilingual corpora 1993 (with Bill Gale)
Some recent papers with citations (according to Google Scholar)
Received Ph.D., MS, and SB degrees in 1983, 1980, and 1978, all from MIT in Computer Science.
Undergraduate thesis, supervised by Richard Greenblatt, on chess programming. Thanks to considerable help and encouragement from Hans Berliner, it was published in IJCAI-79 as Co-ordinate Squares: A Solution to Many Chess Pawn Endgames.
Masters thesis, supervised by Peter Szolovits, argued that a finite state machine (FSM) was sufficient, and that one did not need much more complex mechanisms (ATNs or Turing Machines) that were in vogue at the time. The thesis was originally intended to help Szolovits’ Expert System MEDG (Medical Decision Making Group) generate better explanations. Became convinced that the MEDG explanations were difficult to understand because they were full of center-embedding, a kind of recursion that requires unbounded memory to process. This argument was based on Chomsky’s 1959 proof that center-embedded characterizes the difference between push down automata and finite state machines (FSMs). The thesis ultimately evolved into a demonstration that language could be processed on a FSM by implementing a parser based on Mitch Marcus’ thesis and Kaplan-Bresnan’s Lexical-Functional Grammar, and showing that the stack did not need more than about three cells as long as one used certain compiler optimizations such as tail recursion to conserve memory.
Doctoral thesis, supervised by Jon Allen, argued that well-known parsing methods could be used help take advantage of allophonic variation in speech. The phoneme /t/, for example, is typically realized with a heavily aspirated burst at the beginning of a syllable as in Tom, but not at the end of a syllable as in atlas. At the time, these sorts of variations were seen as a kind of noise that a speech recognition machine would have to overcome. I preferred to view the speech system as consisting of two types of clues, those that varied systematically with the suprasegmental (syllable structure and stress) context such as aspiration and those that tended to be relatively invariant to contextual influences such as place, manner and voicing. Both types of clues are useful. Variant clues can be used in a parser to discover the underlying context, and therefore should not be considered noise. An abbreviated version of the thesis was published in Cognition in 1987; the complete thesis was published by Kluwer the following year.
Joined AT&T Bell Laboratories research in 1983 to work with Mitch Marcus and Mark Liberman in Osamu Fujimura’s Linguistic Research and Artificial Intelligence department. Worked on a number of problems in speech and language, especially part of speech tagging, pronunciation of surnames for speech synthesis and text analysis. Published a statistical trigram part of speech tagger in 1988. The approach has since become standard, but at the time it was believed that much more complex mechanisms such as ATNs or Turing Machines were required. (There was also a strong bias at the time against empirical/statistical approaches to language.)
My interest in text analysis grew out of a collaboration with the lexicographer Patrick Hanks. I ran into a number of lexicographers while working on letter-to-sound rules for speech synthesis. At the time, letter-to-sound rules were being used to guess when the letter “t” should be pronounced with the phoneme /t/ and when it shouldn’t. Word accuracy rates were unacceptably low. After acquiring the rights to a dictionary, word accuracy quickly climbed from about 80% to 99.9% or better. The dictionary-based approach has since become standard, but at the time there was considerable controversy over the “larger” memory requirement (0.25 megabytes). Ever since I have been acquiring data just about as fast as I have been able to acquire disks. These days, we buy disks by the terabyte.
After this successful interaction with lexicography, Patrick Hanks visited Bell Laboratories for a month in early 1989. He showed me how lexicographers were beginning to use large corpora to find collocations, words that are often found together like doctor/nurse, or save…from. We found ways to automate some of the drudgery by using well-understood statistical measures like mutual information. Began working with William Gale, a statistician at Bell Laboratories, on estimating the probability of rare events in large corpora, comparing the Good-Turing method with a held-out estimator. Over the next five years, ending with Gale’s retirement, we published a series of papers together on spelling correction, word-sense disambiguation, aligning parallel corpora, and alternatives to the Poisson for modeling the distribution of words in text.
Visited Ed Hovy at USC/ISI for an academic year, with support from ARPA for work on machine translation and parallel corpora (texts such as the Canadian Parliamentary debates that are published in two languages). Spent the time reading in many areas, especially in machine translation (MT) and information retrieval (IR). Started attending meetings in both areas. Became very active in corpus lexicography, publishing a number of papers with Hanks, Gale, Yarowsky and others. Also used the time to help members of the community get started in corpus-based research by serving on data collection committees such as the ACL/DCI (Association for Computational Data Collection Initiative), writing review articles (for Computational Linguistics, Machine Translation, and CACM) and tutorials (Coling-90, 1992 European Summer School, ACL-95). Taught a graduate seminar course at USC with Ed Hovy in the spring term. Started reviewing large numbers of papers for conferences, journals and funding agencies, perhaps as many as a hundred papers in some years.
Returned to Bell Laboratories in fall 1991 and switched into a software engineering organization, headed by Peter Weinberger. Continued my interest in large corpora, by showing how a large multi-million line program could be thought of as a large corpora, and how corpus-based techniques could be used in the discovery process, the process of reading a large program for the first time. Wrote a program with Jon Helfman called dotplot that helps find self-similarity (copied regions) in text (large corpora, source code, object code) using methods borrowed from biology for matching long sequences of DNA.
Developed the Termworks platform with AT&T Language Line, a commercial translation service. The platform was designed to help translators with technical terminology and translation reuse. Translators don’t need help with easy vocabulary and easy grammar (the focus of most machine translation research) but they do need help with technical terms like dialog box. The solutions to these terminology puzzles can be surprising, even to a native speaker of the language. Microsoft uses finestra(window) in their Italian manuals, but boite(box) in their French manuals. The tools help translators conform to house standards by extracting examples of the term and its translations from a previous release of the manual that has already been published in both languages. A statistical alignment program is used to figure out which parts of the source text correspond to which parts of the translation. The tool and its implications for machine translation research were presented at an invited talk at EACL-93. Pierre Isabelle and I edited a special issue of the journal Machine Translation (1997 12:1/2) on tool-based alternatives to machine translation.
Worked on a digital library project based on a collection of 500,000 pages of AT&T memos supplied by the Murray Hill Library in 400 dots-per-inch bitmap format (20 gigabytes). Demonstrated how off-the-shelf optical character recognition (OCR) and information retrieval (IR) programs could be used to implement a “fax-a-query” information retrieval service where both the input and output could be faxes or bitmaps or any format that could be converted to a fax or bitmap. Argued at Coling-94 that fax had advantages over proposed ascii-based standards such as SGML (availability, ease-of-use, longevity). Began work on bitmap-level matching, avoiding the need for OCR. Found that bitmap-level matching could also be used to improve up-sampling and down-sampling of documents, which is important if the scanner, the fax transmission line and the printer use different sampling rates.
Taught a seminar course on corpus-based methods at Penn with Mitch Marcus. Served on thesis committees for Pascale Fung and Vasileios Hatzivassiloglou, both at Columbia . In Pascale’s case, I was in a certain sense her de facto advisor. Her interest in aligning parallel and non-parallel corpora, which ultimately became her thesis, started with a summer job in my lab at AT&T. Switched into a new organization headed by Eric Sumner, focusing on network services (issues more closely aligned with the core telephone business than speech and language). Worked on some systems problems such as how to replace a billing factory with a few PCs. There is a tendency to believe that if you have a serious “big data” problem (e.g., billing, datamining), then you need a serious (=expensive) computer center full of mainframes, tape robots and bureaucracy. Attempted to show that a gang of PCs could replace a mainframe, but discovered, at least in one application (Billing Edge) that a single PC could replace a gang of mainframes.
Promoted to head a new department in research. AT&T has vast quantities of data, some is textual, but most is not. The department uses visualization and datamining, methods that have become popular in corpus-based natural language processing, to find patterns of interest. Imagine that we have a large steady stream of usage records indicating which customers are buying which services and when. This kind of data feed is usually developed for billing purposes, but many companies are discovering a broad range of other applications such as fraud detection, inventory control, provisioning scarce resources, forecasting demand, etc. The problems are challenging from a research point of view because of the massive scale. The telephone network, for example, has about 100 million customers (nodes), who purchase about 100 million calls (edges) per day. This data stream is much larger than the very large corpora that I had just a few years ago. We used to think that a billion words of text was a lot of data, but we are currently receiving a couple billion records a week.
Developed with Glenn Fowler and Don Caldwell a program called pzip that is like gzip, but uses an automatic training procedure to induce an optimal schema for compressing tabular data such as database records and spread sheets. Sometimes it is better to compress a table in row major order and sometimes it is better to compress a table in column major order. Pzipuses a schema to numerate the values in the table and then passes the results to gzip. Another program, pin (partition inducer), uses dynamic programming to find the optimal partition (or schema) by applying gzip to all possible partitions of a sample of the table. Pzip typically saves a factor of two in both space and time over gzip, which has been very much appreciated by several large data warehousing applications. A paper was published in SODA-2000 pdf. Won a “Circle of Excellence” award in 1998 from the AT&T CIO (Chief Information Office) for this work.
Established excellent relationships with AT&T marketing, especially the folks behind the new Lucky Dog flanker brand. My team maintains a web site that tracks sales in “real time” (faster than previously possible). This web site is changing the culture in marketing. Now that they know they can get quick feedback, they are much more willing to experiment. Try a little bit of this and a little bit of that, and do more of what works and less of what doesn’t. The web site has moved SWIFT from a research prototype into a highly visible business reality. SWIFT was a very nice demo that visualized traffic on the AT&T network on a map projected on a powerwall, a huge wall-sized terminal. The point of the demo is that we ought to be able to react to a change in the marketplace in less than a day. To make the idea a business reality, we simplified the interface so that marketing managers could drive it, and replaced the fancy powerwall hardware with a standard web browser. In addition, we worked closely with marketing managers to understand what matters to them, a task that reminded me of my days in Peter Szolovits’ expert systems group at MIT.
Built upon our success with Lucky Dog to establish relationships with other marketing groups, especially the launch of AT&T local service in New York and Texas. Used datamining techniques to discover considerable room for improvement in operations. Won “salesman of the month” in August 2000 for finding large numbers of sales that could be “reflowed.” This discovery played a major role in the team meeting its sales commitments.
Led a team on virtual integration that used methods such as web crawling and screen scraping to produce a view that made a bunch of disparate databases appear to be more unified than they were.
Despite my new interests in datamining, visualization, management and business, I remain active in computational linguistics, my first love. With so many responsibilities, it has been more difficult to publish, but nevertheless, together with Yamamoto, we showed how to compute term frequencies and document frequencies, two statistics used in Information Retrieval, for all substrings in a large corpus. It is well known how to compute these statistics for short substrings, e.g., unigrams, bigrams and trigrams, but we showed how to compute many of these kinds of statistics for all substrings including very long ones, e.g., million-grams. One might have thought that this would be impractical since there are n2 substrings, but it turns out that the n2 substrings can be grouped into n classes using a data structure called suffix arrays, and many of the statistics of interest can be computed over the n classes rather than the n2 substrings. Since we were able to compute these statistics over a larger set of words and phrases than previously possible, we found an interesting new heuristic for distinguishing good keywords from general vocabulary, which seems promising for both Japanese and English. Both mutual information and a new measure called residual IDF (inverse document frequency) compare an observation to chance, but they use different notions of chance. Mutual information looks for deviations from compositionality. It grew out of work with lexicographers (Hanks) and so it isn’t surprising that it is better for finding general vocabulary, the kind of words and phrases that one would expect to find in a dictionary. On the other hand, residual IDF looks for words and phrases that are not randomly distributed across documents. It grew out of the information retrieval literature and so it isn’t surprising that it is better for finding keywords, the words and phrases that distinguish one document from another.
Joined David Heckerman’s Machine Learning Group in Microsoft Research and later joined Eric Brill’s Text Mining, Search and Navigation research group (TMSN), currently headed by Chris Burges. Web logs are similar to logs of telephone calls (call detail records). Usage logs are quite a bit larger than the documents themselves. As big as the web is, there are more readers than writers, and therefore usage logs (such as click logs) tend to be even larger than the documents (web crawls). We used to think that the AP news was a large corpus, but that is just a million words a week. Telephone companies and web search engines see much larger traffic volumes (e.g., a few hundred million records per day).
Big data problems become more manageable if we can make them smaller. We used a method borrowed from the Unix Spell program for the context sensitive speller in Office-2007. For example, if you use here instead of hear in a context such as, “you should here what he said,” then Office-2007 will underline the typo with blue squiggles. The method uses a standard trigram language model, but heroic compression method were required to compress the language model down to just a few megabytes. Doug McIlroy faces a similar problem when he wrote the Unix Speller. He used Golomb coding to squeeze a 32,000 word dictionary into the address space of a PDP-11 (64k bytes). That was just 2 bytes per word. We used similar method to compress the trigram language model to just a few bytes per trigram.
Ping Li and I have written quite a few papers on dimension reduction. Methods such as random projections and sketches make it possible to find strong associations from small samples. Ping Li won a best student paper at KDD-2006 for his work on random projections. Sketches are probably even better. We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. Sketches were originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given a sample and the margins (also known as document frequencies in information retrieval).
Many techniques that work well on text corpora also work well on web logs. With Qiaozhu Mei in WSDM-2009, we argued that entropy is a useful tool for estimating how hard is search, as well as the potential for personalization. We estimated the entropy of queries, urls and IP addresses, taken one at time, two at a time, and three at a time. We found that it takes 20-something bits to guess the next query, the next url or the next user. But search is easier. If we give you the query, then it takes about 3 bits to guess the url that the user will click on. Personalized search is even easier. If the search engine was given both the query and the IP address, then the entropy is cut in half. There is a huge opportunity for personalization to improve the search experience. It is much easier to answer the question if we know something about the audience. We borrowed a number of techniques from language modeling in speech and language such as Katz backoff. With lots of backoff, personalization starts to look like market segments in database marketing.
Bo Thiesson and I developed “The Wild Thing!” Users are encouraged to enter queries with wild cards. For example “C Rice” or “2#7423” is treated as a regular expression. The system returns the k-best (most popular) matches which includes: “Condoleezza Rice.” The method is particularly attractive for queries that are hard to spell, or for scenarios where it is hard to type (like mobile phones). On a phone, “2#7423” maps to “C Rice” since “C” is on the 2 key, and “R” is on the 7 key and so on. To make the matching more efficient, we introduced K-best suffix arrays, a variant on suffix arrays designed to find the K-best matches, as opposed to all matches. Best can be defined by any total order on the strings such as popularity (frequency in the logs).
With Albert Greenberg and James Hamilton, we published a paper in Hotnets-2008 comparing the mega-model and the micro-model for data centers. If we want to put a million servers in the cloud, should we deploy them in a single mega-data center, or distribute them across a large number of micro-data centers? Given how much the industry is currently investing in the mega model, the industry would do well to consider the micro alternative, especially for embarrassingly distributed applications such as web serving and email.
Chief Scientist of the Human Language Technology Center of Excellence at The Johns Hopkins University.
Served on editorial board of Computational Linguistics (CL), Machine Translation (MT), Quantitative Linguistics (QL), International Journal of Corpus Linguistics (IJCL). Served on program committees for numerous conferences including SIGIR and ACL, both more than once. Helped to create SIGDAT, a special interest group focusing on corpus data, and organized, chaired, or co-chaired many of its annual workshops on very large corpora, which have evolved into the EMNLP conferences with 200+ participants per meeting, about equally distributed over Europe, Asia and America . Co-chaired ACL-99, the major meeting in computational linguistics in 1999. Co-chaired the Industrial Track of KDD-2000, the major meeting in datamining in 2000.