Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Ziwei He,Jian Yuan,Le Zhou,Jingwen Leng,Bo Jiang
DOI: https://doi.org/10.1109/icassp48485.2024.10446483
2024-01-01
Abstract:The quadratic complexity of self-attention in Transformers has hindered theprocessing of long text. To alleviate this problem, previous works haveproposed to sparsify the attention matrix, taking advantage of the observationthat crucial information about a token can be derived from its neighbors. Thesemethods typically combine one or another form of local attention and globalattention. Such combinations introduce abrupt changes in contextual granularitywhen going from local to global, which may be undesirable. We believe that asmoother transition could potentially enhance model's ability to capturelong-context dependencies. In this study, we introduce Fovea Transformer, along-context focused transformer that addresses the challenges of capturingglobal dependencies while maintaining computational efficiency. To achievethis, we construct a multi-scale tree from the input sequence, and userepresentations of context tokens with a progressively coarser granularity inthe tree, as their distance to the query token increases. We evaluate our modelon three long-context summarization tasks[Our code is publiclyavailable at: https://github.com/ZiweiHe/Fovea-Transformer]. Itachieves state-of-the-art performance on two of them, and competitive resultson the third with mixed improvement and setback of the evaluation metrics.
What problem does this paper attempt to address?