Abstract:Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

What problem does this paper attempt to address?

The paper primarily explores the opportunities and challenges faced in big data analysis, especially in the context of high-dimensional data and large sample sizes. Specifically: 1. **Characteristics of Big Data and the Challenges They Bring**: The paper points out that big data is characterized by high dimensionality and large sample sizes, which bring three unique challenges to data analysis: (1) High dimensionality leads to noise accumulation, spurious correlations, and incidental homogeneity; (2) The combination of high dimensionality and large sample sizes results in enormous computational costs and algorithm instability; (3) Large-scale sample data from different sources introduce heterogeneity, experimental variability, and statistical biases. 2. **Transformation of Statistical Methods**: To address these challenges, new statistical thinking and computational methods are required. Traditional methods suitable for moderate sample sizes cannot scale to large-scale data, and methods suitable for low-dimensional data face significant difficulties when dealing with high-dimensional data. Therefore, it is necessary to design effective statistical procedures to explore and predict issues such as heterogeneity, noise accumulation, spurious correlations, and incidental endogeneity in big data, while balancing statistical accuracy and computational efficiency. 3. **Development of Computational Infrastructure**: Big data has also driven the development of new computational infrastructure and data storage methods. Optimization is no longer the goal but a tool for analyzing big data. This paradigm shift has prompted the development of efficient algorithms to meet the processing needs of large-scale high-dimensional data. Through the above points, the paper aims to emphasize how, in the context of the big data era, new statistical and computational methods can be effectively utilized to analyze and process big data, thereby achieving scientific discoveries and economic value.

Challenges of Big Data Analysis

Big Data, Big Challenges

Big Data Analytics in Medicine and Healthcare

Statistical Methods and Computing for Big Data

A Survey of Big Data Research

Promises and Challenges of Big Data Computing in Health Sciences

Rethinking Abstractions for Big Data: Why, Where, How, and What

Foundation Issues for Big Data Research

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

CRITICAL QUESTIONS FOR BIG DATA

Data Science: Challenges and Directions

How big is Big Data?

Predictive Analytics in the Era of Big Data: Opportunities and Challenges

Rethinking big data: A review on the data quality and usage issues

High dimensionality: The latest challenge to data analysis

Big data, bigger dilemmas: A critical review

Big Data Challenges: A Program Optimization Perspective

Big Issues for Big Data: challenges for critical spatial data analytics

Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives [discussion Forum]

A Survey of Bayesian Statistical Approaches for Big Data

Significance and Challenges of Big Data Research