Block Skim Transformer for Efficient Question Answering

Yue Guan,Jingwen Leng,Yuhao Zhu,Minyi Guo
2021-01-01
Abstract:Transformer-based encoder models have achieved promising results on natural language processing (NLP) tasks including question answering (QA). Different from sequence classification or language modeling tasks, hidden states at all positions are used for the final classification in QA. However, we do not always need all the context to answer the raised question. Following this idea, we proposed Block Skim Transformer (BST) to improve and accelerate the processing of transformer QA models. The key idea of BST is to identify the context that must be further processed and the blocks that could be safely discarded early on during inference. Critically, we learn such information from self-attention weights. As a result, the model hidden states are pruned at the sequence dimension, achieving significant inference speedup. We also show that such extra training optimization objection also improves model accuracy. As a plugin to the transformer-based QA models, BST is compatible with other model compression methods without changing existing network architectures. BST improves QA models' accuracies on different datasets and achieves 1.6× speedup on BERTlarge model.
What problem does this paper attempt to address?