Sasha Rush (Cornell University) “Pretraining Without Attention”
Abstract Transformers are essential to pretraining. As we approach 5 years of BERT, the connection between attention as architecture and transfer learning remains key to this central thread in NLP. Other architectures such as CNNs and RNNs[…]