Abstract:It is held as a truism that deep neural networks require large datasets to train effective models. However, large datasets, especially with high-quality labels, can be expensive to obtain. This study sets out to investigate (i) how large a dataset must be to train well-performing models, and (ii) what impact can be shown from fractional changes to the dataset size. A practical method to investigate these questions is to train a collection of deep neural answer selection models using fractional subsets of varying sizes of an initial dataset. We observe that dataset size has a conspicuous lack of effect on the training of some of these models, bringing the underlying algorithms into question.

What problem does this paper attempt to address?

This paper attempts to explore the impact of the size of the training data set on the performance of neural answer selection models. Specifically, the researchers want to address the following two questions: 1. **How large a data set is required to train a well - performing model?** 2. **What are the specific impacts of changes in data set size on model performance?** To answer these questions, the author adopts a practical approach, that is, by using different - proportion subsets of the original data set to train a series of deep neural network models and evaluate the performance changes of these models under different amounts of training data. This method not only helps to understand the impact of data set size on model training, but also can reveal whether the performance of certain models meets expectations when facing smaller data sets. ### Research Background - **Deep Learning and Big Data**: Deep learning models usually require a large amount of data to train effective models, but obtaining high - quality large - scale data sets is costly. - **Task Definition**: This paper focuses on a specific task at the intersection of information retrieval (IR) and natural language processing (NLP) - answer selection. The goal of the answer selection task is to select the correct answer from a given set of candidate answers in response to a natural language question. ### Research Methods - **Data Set**: The WikiQA data set is used as the original data set, and different - proportion subsets (10%, 25%, 50%, 75%, 100%) are generated by random sampling. - **Model**: A variety of deep neural network models are studied, including DSSM, CDSSM, ARC - I, ARC - II, MV - LSTM, DRMM, aNMM, DUET, MatchPyramid and DRMM TKS. - **Evaluation Metric**: Mean Average Precision (MAP) is used to evaluate the performance of models on the validation set and the test set. ### Main Findings - **Most Models Do Not Perform as Expected**: As the training data set increases, the performance of most models does not improve significantly. For example, only CDSSM, ARC - II and DRMM TKS have a relative performance improvement of more than 20% when the data set increases from 10% to 100%. - **Over - fitting Phenomenon**: Some models (such as DSSM and MatchPyramid) over - fit quickly on small data sets, which may indicate that these models have a strong "memory" ability. - **Validation Set Performance**: The performance change on the validation set is inconsistent with that on the training set, indicating the difference in the generalization ability of models under different amounts of data. ### Conclusion The research results of the paper show that the impact of data set size on the performance of neural answer selection models is not as significant as expected. This finding has important guiding significance for choosing appropriate algorithms and strategies in the future under resource - limited conditions. At the same time, the performance on the validation set is an important basis for evaluating the generalization ability of models, and future research can further explore the impact of different amounts of data on the generalization ability of models.

Impact of Training Dataset Size on Neural Answer Selection Models

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Scaling Down Deep Learning with MNIST-1D

An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration

Training Data Subset Search With Ensemble Active Learning

EVALUATING THE EFFECT OF DATASET SIZE ON PREDICTIVE MODEL USING SUPERVISED LEARNING TECHNIQUE

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

Effect of Training Data Volume on Performance of Convolutional Neural Network Pneumothorax Classifiers

Does your data spark joy? Performance gains from domain upsampling at the end of training

Towards Accelerated Model Training via Bayesian Data Selection

A Reasonable Effectiveness of Features in Modeling Visual Perception of User Interfaces

The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Optimization of deep learning models: benchmark and analysis

"Why" Has the Least Side Effect on Model Editing

Insights from the Use of Previously Unseen Neural Architecture Search Datasets

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning