Abstract:The development of modern NLP applications often relies on various benchmark datasets containing plenty of manually labeled tests to evaluate performance. While constructing datasets often costs many resources, the performance on the held-out data may not properly reflect their capability in real-world application scenarios and thus cause tremendous misunderstanding and monetary loss. To alleviate this problem, in this paper, we propose an automated test generation method for detecting erroneous behaviors of various NLP applications. Our method is designed based on the sentence parsing process of classic linguistics, and thus it is capable of assembling basic grammatical elements and adjuncts into a grammatically correct test with proper oracle information. We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences to automate the test generation. NLPLego disassembles the seed sentence into the template and adjuncts and then generates new sentences by assembling context-appropriate adjuncts with the template in a specific order. Unlike the taskspecific methods, the tests generated by NLPLego have derivation relations and different degrees of variation, which makes constructing appropriate metamorphic relations easier. Thus, NLPLego is general, meaning it can meet the testing requirements of various NLP applications. To validate NLPLego, we experiment with three common NLP tasks, identifying failures in four state-of-art models. Given seed tests from SQuAD 2.0, SST, and QQP, NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks, respectively.

What problem does this paper attempt to address?

The paper attempts to address the issue that existing benchmark datasets in natural language processing (NLP) applications do not adequately reflect the capabilities of these applications in real-world scenarios, which may lead to misunderstandings of performance and economic losses. To mitigate this problem, the paper proposes an automatic test generation method aimed at detecting erroneous behaviors in various NLP applications. Specifically, this method is based on the sentence parsing process of classical linguistics, capable of assembling basic grammatical elements and modifiers into grammatically correct test cases, accompanied by the correct expected output information. The test cases generated by this method can more effectively detect erroneous behaviors in NLP applications and evaluate their capabilities in different tasks without relying on specific benchmark datasets. ### Main Contributions: 1. **Method**: An automatic test generation method is proposed, which splits seed sentences into templates and modifiers, and creates derivation trees by iteratively mutating and assembling grammatical elements. This derivation relationship facilitates morphological testing, increasing the method's generality. 2. **Tool**: The above test generation method is implemented as an automated testing tool called NLPLego. NLPLego uses derivation trees to generate new test cases based on predefined morphological relationships and the input format of the tested NLP application. 3. **Research**: The performance of NLPLego is validated through experiments on three common NLP tasks (machine reading comprehension, sentiment analysis, and semantic similarity measurement). The results show that NLPLego can efficiently generate test cases and effectively detect erroneous behaviors of models. ### Background and Motivation: - **NLP Applications**: NLP applications can be divided into tasks such as machine reading comprehension, sentiment analysis, and semantic similarity measurement. These tasks require understanding and generating natural language, but due to the vast output space, manually constructing and checking the correct output is very complex. - **Limitations of Existing Evaluation Methods**: The current standard evaluation paradigm is to estimate performance using a train-validation-test split, but due to limited resources and lack of expected output information, researchers can only sample from usage scenarios and hire crowd workers to construct these benchmark datasets. This leads to benchmark datasets that may not adequately reflect the performance of NLP applications in real-world scenarios. - **Morphological Testing**: To address the issue of missing expected output information in testing, the paper adopts a morphological testing approach. By defining appropriate morphological relationships, testing can be conducted without explicit expected output. ### Methodology: 1. **Key Idea**: Inspired by the sentence parsing process, a test generation method based on splitting and assembling is proposed. This method splits seed sentences into basic structures and modifiers, and then generates new sentences by assembling these components. 2. **Sentence Splitting**: Basic sentence structures are obtained by removing modifiers, and slots where modifiers can be inserted are identified. 3. **Grammatical Element Assembly**: New sentences with different patterns and diverse semantics are generated by sequentially inserting modifiers that fit the current syntactic structure. To enhance the comprehensiveness of the generated tests, mutation operators are also used to generate more contextually appropriate modifiers. 4. **Expected Output Information Generation**: Combining the characteristics of newly generated sentences and morphological testing theory, expected output information for the tests is generated by defining appropriate morphological relationships. ### Implementation: - **Template Generation**: Dependency structures and constituent structures of seed sentences are obtained using spaCy and Stanford CoreNLP, and simple sentences retaining basic grammatical elements are generated using the advanced sentence compression model SLAHAN. - **Sentence Generation**: New sentences are generated by assembling basic sentence structures and modifiers using assembly operators. To enhance the comprehensiveness of the generated tests, the idea of fuzz testing is adopted, generating contextually appropriate modifiers through synonym replacement and masked language model prediction. - **Morphological Relationship Construction**: Different morphological relationships are designed for different NLP tasks. For example, for machine reading comprehension tasks, semantic invariance relationships are used to determine the expected output of the tests. In summary, the proposed method and tool NLPLego can efficiently generate test cases, detect erroneous behaviors in NLP applications, and evaluate their capabilities in different tasks without relying on specific benchmark datasets.

Intergenerational Test Generation for Natural Language Processing Applications