Abstract:The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.

What problem does this paper attempt to address?

This paper mainly discusses the influence of datasets in the training of deep neural network (DNN) models and proposes a state vector framework to quantify the effects of individual datasets and their interactions. The study points out the crucial importance of high-quality datasets for the success of DNNs, but the interactions between datasets have not been fully explored. The proposed state vector framework utilizes idealized probe test results as the basis for the vector space to measure the effects of separate or interacting datasets. The authors found that the effectiveness of commonly used language understanding datasets exhibits characteristic patterns concentrated on a few language dimensions, while also observing an "overflow" effect where datasets may affect the models in dimensions seemingly unrelated to their original tasks. They also observed specific and significant interaction effects when combining multiple datasets. The paper emphasizes the importance of systematically understanding the influence of datasets, which is a key component for responsible and robust model development. The authors evaluated the effects of datasets through probe analysis, considering not only difficulty scores but also multidimensional impacts. Their proposed state vector framework allows for statistical testing, providing a rigorous approach to explaining how datasets affect models. In the experimental section, the authors conducted multi-task fine-tuning using BERT and RoBERTa models, and utilized the SentEval toolkit for probe testing to quantify the individual and interaction effects of different datasets. The results showed that dataset effects depend on model selection, dataset composition, and task types. Additionally, unexpected "overflow" effects were observed between semantic and syntactic dimensions for certain datasets. In conclusion, this paper proposes a novel approach to systematically study the influence of datasets on deep learning models, providing valuable insights for understanding and optimizing data usage in the model training process.

A State-Vector Framework for Dataset Effects

Voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data

Voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data.

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Vectorizing string entries for data processing on tables: when are larger language models better?

Non-Autoregressive Dialog State Tracking

Learning Simpler Language Models with the Differential State Framework

A generative framework to bridge data-driven models and scientific theories in language neuroscience

Diagonal State Spaces are as Effective as Structured State Spaces

Ultraschallgezielte Interventionen an peripheren Nerven: diagnostische und therapeutische Indikationen

Do we need to go Deep? Knowledge Tracing with Big Data

DLBench: An Experimental Evaluation of Deep Learning Frameworks

On the Impact of Cross-Domain Data on German Language Models

A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

Vector-ICL: In-context Learning with Continuous Vector Representations

Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling

Wonderful Matrices: More Efficient and Effective Architecture for Language Modeling Tasks

The Impact of Negative Sampling on Contrastive Structured World Models

A Neurobiologically Motivated Analysis of Distributional Semantic Models

Probing Conceptual Understanding of Large Visual-Language Models

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation