Abstract:It is held as a truism that deep neural networks require large datasets to train effective models. However, large datasets, especially with high-quality labels, can be expensive to obtain. This study sets out to investigate (i) how large a dataset must be to train well-performing models, and (ii) what impact can be shown from fractional changes to the dataset size. A practical method to investigate these questions is to train a collection of deep neural answer selection models using fractional subsets of varying sizes of an initial dataset. We observe that dataset size has a conspicuous lack of effect on the training of some of these models, bringing the underlying algorithms into question.
What problem does this paper attempt to address?
This paper attempts to explore the impact of the size of the training data set on the performance of neural answer selection models. Specifically, the researchers want to address the following two questions:
1. **How large a data set is required to train a well - performing model?**
2. **What are the specific impacts of changes in data set size on model performance?**
To answer these questions, the author adopts a practical approach, that is, by using different - proportion subsets of the original data set to train a series of deep neural network models and evaluate the performance changes of these models under different amounts of training data. This method not only helps to understand the impact of data set size on model training, but also can reveal whether the performance of certain models meets expectations when facing smaller data sets.
### Research Background
- **Deep Learning and Big Data**: Deep learning models usually require a large amount of data to train effective models, but obtaining high - quality large - scale data sets is costly.
- **Task Definition**: This paper focuses on a specific task at the intersection of information retrieval (IR) and natural language processing (NLP) - answer selection. The goal of the answer selection task is to select the correct answer from a given set of candidate answers in response to a natural language question.
### Research Methods
- **Data Set**: The WikiQA data set is used as the original data set, and different - proportion subsets (10%, 25%, 50%, 75%, 100%) are generated by random sampling.
- **Model**: A variety of deep neural network models are studied, including DSSM, CDSSM, ARC - I, ARC - II, MV - LSTM, DRMM, aNMM, DUET, MatchPyramid and DRMM TKS.
- **Evaluation Metric**: Mean Average Precision (MAP) is used to evaluate the performance of models on the validation set and the test set.
### Main Findings
- **Most Models Do Not Perform as Expected**: As the training data set increases, the performance of most models does not improve significantly. For example, only CDSSM, ARC - II and DRMM TKS have a relative performance improvement of more than 20% when the data set increases from 10% to 100%.
- **Over - fitting Phenomenon**: Some models (such as DSSM and MatchPyramid) over - fit quickly on small data sets, which may indicate that these models have a strong "memory" ability.
- **Validation Set Performance**: The performance change on the validation set is inconsistent with that on the training set, indicating the difference in the generalization ability of models under different amounts of data.
### Conclusion
The research results of the paper show that the impact of data set size on the performance of neural answer selection models is not as significant as expected. This finding has important guiding significance for choosing appropriate algorithms and strategies in the future under resource - limited conditions. At the same time, the performance on the validation set is an important basis for evaluating the generalization ability of models, and future research can further explore the impact of different amounts of data on the generalization ability of models.