Abstract:Rapid model validation via the train-test paradigm has been a key driver for the breathtaking progress in machine learning and AI. However, modern AI systems often depend on a combination of tasks and data collection practices that violate all assumptions ensuring test validity. Yet, without rigorous model validation we cannot ensure the intended outcomes of deployed AI systems, including positive social impact, nor continue to advance AI research in a scientifically sound way. In this paper, I will show that for widely considered inference settings in complex social systems the train-test paradigm does not only lack a justification but is indeed invalid for any risk estimator, including counterfactual and causal estimators, with high probability. These formal impossibility results highlight a fundamental epistemic issue, i.e., that for key tasks in modern AI we cannot know whether models are valid under current data collection practices. Importantly, this includes variants of both recommender systems and reasoning via large language models, and neither naïve scaling nor limited benchmarks are suited to address this issue. I am illustrating these results via the widely used MovieLens benchmark and conclude by discussing the implications of these results for AI in social systems, including possible remedies such as participatory data curation and open science.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: in complex social systems, whether passive data collection methods can effectively verify the performance of machine - learning models. Specifically, the author explores whether, under current data collection practices, we can be confident in the effectiveness of AI systems (such as recommendation systems and large - language models) when performing complex tasks. ### Background of the Paper and Problem Statement The verification of modern AI systems usually depends on the train - test paradigm, that is, evaluating the generalization ability of models by dividing the data set into a training set and a test set. However, with the complexity of AI tasks and the change of data collection methods, the traditional train - test paradigm may no longer be applicable. Especially in complex social systems, passive data collection (such as scraping data from the Internet) may lead to sample bias and uneven distribution, thus affecting the effectiveness of model verification. ### Research Questions The research questions proposed by the author are: **Research Question 1 (RQ1)**: Given the complex tasks we require AI systems to solve and the current data collection methods, can we know whether a model performs well on such tasks? ### Main Contributions 1. **Theorem 1 (informally stated)**: In complex social systems, for most system nodes, if the data is passively collected, under the ontological parsimony assumption, the train - test paradigm cannot effectively verify the model. 2. **Corollary 2 (informally stated)**: Simple extensions (such as increasing the amount of data) and limited benchmark tests are not sufficient to solve the problem in Theorem 1, and are therefore not suitable for obtaining effective test verification in this case. 3. **Experimental Evidence**: Through experiments on the popular MovieLens benchmark data set, the ineffectiveness of testing in recommendation tasks is proved. ### Result Explanation By introducing the concepts of "possible worlds" and "sample graphs", the author shows that in complex social systems, due to the heavy - tailed distributions of data and the internal dynamics of the sample generation system, the traditional train - test paradigm cannot ensure the effectiveness of model verification. In particular, when the k - connectivity of the sample graph is lower than the system complexity, model verification will fail. ### Practical Implications These results have important implications for the application of AI in social systems. For example, recommendation systems and question - answering systems (such as systems based on large - language models) may encounter serious generalization problems in actual deployment because existing verification methods cannot accurately evaluate their performance. For this reason, the author suggests adopting new methods such as participatory data curation and open science to improve model verification. ### Summary The paper points out that in complex social systems, the traditional train - test paradigm caused by passive data collection is ineffective in many cases. This not only affects model verification but also poses challenges to the social impact of AI systems. To ensure the reliability and effectiveness of AI systems, new verification methods and data collection strategies need to be developed.

No Free Delivery Service: Epistemic limits of passive data collection in complex social systems

Scaling Laws Do Not Scale

Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework

When not to use machine learning: A perspective on potential and limitations

Evaluation Gaps in Machine Learning Practice

AI and the Problem of Knowledge Collapse

Practical approaches in evaluating validation and biases of machine learning applied to mobile health studies

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

A Cautionary Tail: A Framework and Case Study for Testing Predictive Model Validity

The Causal Chambers: Real Physical Systems as a Testbed for AI Methodology

A Common Misassumption in Online Experiments with Machine Learning Models

Wrong side of the tracks: Big Data and Protected Categories

AI and Social Theory

Automating Ambiguity: Challenges and Pitfalls of Artificial Intelligence

Uncovering the Data-Related Limits of Human Reasoning Research: An Analysis based on Recommender Systems

A Case Study on a Sustainable Framework for Ethically Aware Predictive Modeling

On (in)validating environmental models. 1. Principles for formulating a Turing‐like Test for determining when a model is fit‐for purpose

The Challenges of Machine Learning and Their Economic Implications

The Return of Pseudosciences in Artificial Intelligence: Have Machine Learning and Deep Learning Forgotten Lessons from Statistics and History?

Rigorous (in)validation of ecological models

Moving towards more holistic validation of machine learning-based approaches in ecology and evolution