Abstract:Machine reading comprehension (MRC) is a challenging natural language processing (NLP) task. Recently, the emergence of pre-trained models (PTM) has brought this research field into a new era, in which the training objective plays a key role. The masked language model (MLM) is a self-supervised training objective that widely used in various PTMs. With the development of training objectives, many variants of MLM have been proposed, such as whole word masking, entity masking, phrase masking, span masking, and so on. In different MLM, the length of the masked tokens is different. Similarly, in different machine reading comprehension tasks, the length of the answer is also different, and the answer is often a word, phrase, or sentence. Thus, in MRC tasks with different answer lengths, whether the length of MLM is related to performance is a question worth studying. If this hypothesis is true, it can guide us how to pre-train the MLM model with a relatively suitable mask length distribution for MRC task. In this paper, we try to uncover how much of MLM's success in the machine reading comprehension tasks comes from the correlation between masking length distribution and answer length in MRC dataset. In order to address this issue, herein, (1) we propose four MRC tasks with different answer length distributions, namely short span extraction task, long span extraction task, short multiple-choice cloze task, long multiple-choice cloze task; (2) four Chinese MRC datasets are created for these tasks; (3) we also have pre-trained four masked language models according to the answer length distributions of these datasets; (4) ablation experiments are conducted on the datasets to verify our hypothesis. The experimental results demonstrate that our hypothesis is true.

ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

An Improved Mask Approach Based on Pointer Network for Domain Adaptation of BERT

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Beyond Questions: Leveraging ColBERT for Keyphrase Search

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Boosting Point-BERT by Multi-choice Tokens

NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents

A Reproducibility Study of PLAID

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Investigating Masking-based Data Generation in Language Models

Should You Mask 15% in Masked Language Modeling?

Noise-Robust Dense Retrieval via Contrastive Alignment Post Training

QuadrupletBERT: an Efficient Model for Embedding-Based Large-Scale Retrieval

Query Augmentation with Brain Signals

Improving Low-resource Question Answering by Augmenting Question Information

Studying Strategically: Learning to Mask for Closed-book QA