Abstract:Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at <a class="link-external link-https" href="https://github.com/owenzx/data4comp" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiency of neural sequence - to - sequence (seq2seq) models in compositional generalization. Specifically, when these models are trained from scratch and encounter examples of new combinations containing seen elements during testing, they perform very poorly. However, in contrast to these results, models trained or pre - trained on larger and more general datasets show better compositional generalization ability. Therefore, this paper empirically analyzes how data factors (such as dataset scale, pattern complexity, example difficulty, etc.) affect the generalization ability of Transformer models trained from scratch, in order to explain why more complex datasets can improve the model's compositional generalization performance.
### Main contributions of the paper
1. **Relationship between dataset complexity and generalization ability**: The study found that increasing the complexity of the dataset can significantly improve the model's performance in various generalization challenges. This is mainly because more complex datasets provide more diverse examples, making combinatorial understanding more effective and reducing the frequency of repeated memory of examples, thereby preventing non - generalizable memory.
2. **Analysis of the advantages of complex datasets**: The author proposes two hypotheses to explain why more complex datasets can improve generalization:
- **Diversity hypothesis**: More unique patterns in the dataset (for example, more unique original words) increase the difficulty of surface memory.
- **Frequency hypothesis**: Larger datasets lead to a lower frequency of seeing similar examples, thereby preventing them from being memorized.
3. **The impact of training examples of different difficulties on generalization**: The study shows that on synthetic datasets, simple examples can promote compositional generalization more than difficult examples. On large - scale real - language datasets, although simple examples alone are not sufficient to achieve good performance, a mixture of simple and difficult examples can induce the strongest generalization ability.
### Experimental design and results
- **Experimental design**: The author conducted experiments on multiple datasets, including the synthetic dataset SCAN and its extended version SCAN*, as well as the real - language datasets GeoQuery, ATIS, and SMCalFlow. By controlling the complexity of the dataset and the difficulty of examples, the generalization performance of the model was observed.
- **Results**: The experimental results show that increasing the complexity of the dataset significantly improves the model's generalization ability, especially in compositional generalization tasks. In addition, by reducing frequently occurring examples, the generalization performance can be further improved.
### Conclusion
This paper empirically reveals the positive impact of dataset complexity on the model's compositional generalization ability through empirical research, and proposes a simple data augmentation method AugZero, which can also effectively improve the model's generalization ability without introducing additional knowledge. These findings are of great significance for understanding and improving model generalization in natural language processing tasks.