Abstract:Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.

Multi-Level Head-Wise Match and Aggregation in Transformer for Textual Sequence Matching.

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Densely-Connected Transformer with Co-attentive Information for Matching Text Sequences.

Multi-level network based on transformer encoder for fine-grained image–text matching

Enhanced Pre-Trained Transformer with Aligned Attention Map for Text Matching

A Compare-Aggregate Model for Matching Text Sequences

Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding

Enhanced Text Matching Based on Semantic Transformation

Comateformer: Combined Attention Transformer for Semantic Sentence Matching

Improved Transformer with Multi-Head Dense Collaboration

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Improving Transformers with Dynamically Composable Multi-Head Attention

Lightweight Text Matching Method with Rich Features.

Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

Hierarchical Feature Aggregation based on Transformer for Image-text Matching

Metaformer: A Transformer That Tends to Mine Metaphorical-Level Information

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning