A State-Vector Framework for Dataset Effects

Esmat Sahak,Zining Zhu,Frank Rudzicz
2023-10-17
Abstract:The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses the influence of datasets in the training of deep neural network (DNN) models and proposes a state vector framework to quantify the effects of individual datasets and their interactions. The study points out the crucial importance of high-quality datasets for the success of DNNs, but the interactions between datasets have not been fully explored. The proposed state vector framework utilizes idealized probe test results as the basis for the vector space to measure the effects of separate or interacting datasets. The authors found that the effectiveness of commonly used language understanding datasets exhibits characteristic patterns concentrated on a few language dimensions, while also observing an "overflow" effect where datasets may affect the models in dimensions seemingly unrelated to their original tasks. They also observed specific and significant interaction effects when combining multiple datasets. The paper emphasizes the importance of systematically understanding the influence of datasets, which is a key component for responsible and robust model development. The authors evaluated the effects of datasets through probe analysis, considering not only difficulty scores but also multidimensional impacts. Their proposed state vector framework allows for statistical testing, providing a rigorous approach to explaining how datasets affect models. In the experimental section, the authors conducted multi-task fine-tuning using BERT and RoBERTa models, and utilized the SentEval toolkit for probe testing to quantify the individual and interaction effects of different datasets. The results showed that dataset effects depend on model selection, dataset composition, and task types. Additionally, unexpected "overflow" effects were observed between semantic and syntactic dimensions for certain datasets. In conclusion, this paper proposes a novel approach to systematically study the influence of datasets on deep learning models, providing valuable insights for understanding and optimizing data usage in the model training process.