Triplet Attention: Rethinking the similarity in Transformers

Haoyi Zhou,Jianxin Li,Jieqi Peng,Shuai Zhang,Shanghang Zhang
DOI: https://doi.org/10.1145/3447548.3467241
2021-01-01
Abstract:The Transformer model has benefited various real-world applications, where the self-attention mechanism with dot-products shows superior alignment ability on building long dependency. However, the pair-wisely attended self-attention limits further performance improvement on challenging tasks. To the extent of our knowledge, this is the first work to define the Triplet Attention (A(3)) for Transformer, which introduces triplet connections as the complementary dependency. Specifically, we define the triplet attention based on the scalar triplet product, which may be interchangeably used with the canonical one within the multi-head attention. It allows the self-attention mechanism to attend to diverse triplets and capture complex dependency. Then, we utilize the permuted formulation and kernel tricks to establish a linear approximation to A(3). The proposed architecture could be smoothly integrated into the pre-training by modifying head configurations. Extensive experiments show that our methods achieve significant performance improvement on various tasks and two benchmarks.
What problem does this paper attempt to address?