Modern learning architectures for natural language processing have been very successful in incorporating a huge amount of texts into their parameters. However, by and large, such models store and use knowledge in distributed and decentralized ways. This proves unreliable and makes the models ill-suited for knowledge-intensive tasks that require reasoning over factual information in linguistic expressions. In this talk, I will give a few examples of exploring alternative architectures to tackle those challenges. In particular, we can improve the performance of such (language) models by representing, storing and accessing knowledge in a dedicated memory component.
This talk is based on several joint works with Yury Zemlyanskiy (Google Research), Michiel de Jong (USC and Google Research), William Cohen (Google Research and CMU) and our other collaborators in Google Research.
Fei is a research scientist at Google Research. Before that, he was a Professor of Computer Science at University of Southern California. His primary research interests are machine learning and its application to various AI problems: speech and language processing, computer vision, robotics and recently weather forecast and climate modeling. He has a PhD (2007) from Computer and Information Science from U. of Pennsylvania and B.Sc and M.Sc in Biomedical Engineering from Southeast University (Nanjing, China).
Transformers are essential to pretraining. As we approach 5 years of BERT, the connection between attention as architecture and transfer learning remains key to this central thread in NLP. Other architectures such as CNNs and RNNs have been used to replicate pretraining results, but these either fail to reach the same accuracy or require supplemental attention layers. This work revisits the semanal BERT result and considers pretraining without attention. We consider replacing self-attention layers with recently developed approach for long-range sequence modeling and transformer architecture variants. Specifically, inspired by recent papers like the structured space space sequence model (S4), we use simple routing layers based on state-space models (SSM) and a bidirectional model architecture based on multiplicative gating. We discuss the results of the proposed Bidirectional Gated SSM (BiGS) and present a range of analysis into its properties. Results show that architecture does seem to have a notable impact on downstream performance and a different inductive bias that is worth exploring further.