Abstract:As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields, which contributes to optimization of big data and artificial intelligence technology.Generally, physical failures or technical defects in data collectors and recorders cause anomalies in collected data.These anomalies will strongly impact on subsequent data analysis and artificial intelligence processes; thus, data should be processed and cleaned accordingly before application.Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values.The constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple.A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing.Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing.Specifically, multiple speed intervals are set to constrain time series data, and a series of candidate repairing points are formed for each data point according to the speed constraints.Next, the optimal repair solution is selected from these candidates based on the dynamic programming method.With regard to the feasibility study of this method, an artificial dataset, two real datasets, and another real dataset with real anomalies are employed for experiments in case of different rates of anomalies and data sizes.Experimental results demonstrate that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS errors and time cost.In addition, the investigation of clustering and classification accuracy with several datasets reveals the impact of data quality on subsequent data analysis and artificial intelligence.The proposed method can improve the quality of data analysis and artificial intelligence results.

Time Series Data Validity.

Assessing Data Quality Within Available Context

TsQuality: Measuring Time Series Data Quality in Apache IoTDB

On the Index of Cluster Validity

Time Series Data Cleaning under Expressive Constraints on Both Rows and Columns

Limitations of Validity Intervals in Data Freshness Management

Time Series Data Cleaning under Multi-speed Constraints

Optimizing Time Series Queries with Versions

TSDDISCOVER: Discovering Data Dependency for Time Series Data

A GROUP OF NEW INDEXES OF CLUSTER VALIDITY

Evaluation of Scaling Invariance Embedded in Short Time Series

Time-tired compaction: An elastic compaction scheme for LSM-tree based time-series database

On Repairing Timestamps for Regular Interval Time Series.

Time Series Data Encoding for Efficient Storage

Review of Data-centric Time Series Analysis from Sample, Feature, and Period

Time Series Anomaly Detection for Trustworthy Services in Cloud Computing Systems

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

Signal Quality Auditing for Time-series Data

Two-Sample and Change-Point Inference for Non-Euclidean Valued Time Series

Spatial Index for Uncertain Time Series

TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models