Abstract:Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.

Enhanced Pre-Trained Transformer with Aligned Attention Map for Text Matching

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Comateformer: Combined Attention Transformer for Semantic Sentence Matching

Densely-Connected Transformer with Co-attentive Information for Matching Text Sequences.

Multi-level network based on transformer encoder for fine-grained image–text matching

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

Multi-Level Head-Wise Match and Aggregation in Transformer for Textual Sequence Matching.

Enhanced Text Matching Based on Semantic Transformation

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention

Lightweight Text Matching Method with Rich Features.

Image–Text Matching Model Based on CLIP Bimodal Encoding

Improving Semantic Matching Through Dependency-Enhanced Pre-trained Model with Adaptive Fusion.

Improved Transformer with Multi-Head Dense Collaboration

Match-Ignition: Plugging PageRank into Transformer for Long-form Text Matching

DABERT: Dual Attention Enhanced BERT for Semantic Matching.

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Effective entity matching with transformers

A text matching model based on dynamic multi‐mask and augmented adversarial