Defining data science: a new field of inquiry

Michael L Brodie
DOI: https://doi.org/10.48550/arXiv.2306.16177
IF: 5.414
2023-06-28
Machine Learning
Abstract:Data science is not a science. It is a research paradigm. Its power, scope, and scale will surpass science, our most powerful research paradigm, to enable knowledge discovery and change our world. We have yet to understand and define it, vital to realizing its potential and managing its risks. Modern data science is in its infancy. Emerging slowly since 1962 and rapidly since 2000, it is a fundamentally new field of inquiry, one of the most active, powerful, and rapidly evolving 21st century innovations. Due to its value, power, and applicability, it is emerging in 40+ disciplines, hundreds of research areas, and thousands of applications. Millions of data science publications contain myriad definitions of data science and data science problem solving. Due to its infancy, many definitions are independent, application-specific, mutually incomplete, redundant, or inconsistent, hence so is data science. This research addresses this data science multiple definitions challenge by proposing the development of coherent, unified definition based on a data science reference framework using a data science journal for the data science community to achieve such a definition. This paper provides candidate definitions for essential data science artifacts that are required to discuss such a definition. They are based on the classical research paradigm concept consisting of a philosophy of data science, the data science problem solving paradigm, and the six component data science reference framework (axiology, ontology, epistemology, methodology, methods, technology) that is a frequently called for unifying framework with which to define, unify, and evolve data science. It presents challenges for defining data science, solution approaches, i.e., means for defining data science, and their requirements and benefits as the basis of a comprehensive solution.
What problem does this paper attempt to address?
The main problem this paper attempts to address is the multiplicity and inconsistency in the definition of data science. Specifically, the paper focuses on the following aspects: 1. **Definition of Data Science**: What is data science? How can it be defined as a new research field, either as a research paradigm category or a research domain? 2. **Definition of Data Science Disciplines**: How is data science applied in different disciplines defined? How can a data science discipline be defined across all data science disciplines? 3. **Definition of Specific Data Science Disciplines**: How is data science applied in specific disciplines defined? For example, does the application of data science in Natural Language Processing (NLP) have the value of unifying multiple definitions? 4. **Definition of Data Analysis**: What is data analysis? How can it be defined as a new problem-solving paradigm category? 5. **Data Analysis in Specific Data Science Disciplines**: How is data analysis conducted in specific data science disciplines defined? How can a data science problem-solving paradigm category be defined across all data science disciplines? 6. **Problem Solving in Specific Data Science Disciplines**: How is problem-solving conducted in specific data science disciplines defined? For example, does problem-solving in data science within NLP have the value of unifying multiple definitions? The paper proposes a systematic approach to address these challenges by introducing the concepts of classical paradigms, categories, and disciplines, aiming to establish a unified, coherent, and evolving framework for the definition of data science.