Abstract:Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.

DPASF: a flink library for streaming data preprocessing

An efficient architecture for processing real-time traffic data streams using apache flink

SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink

A stream processing abstraction framework

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Exploring Real-Time Data Processing Using Big Data Frameworks

Using streaming data and Apache Flink to infer energy consumption

Differentially Private Stream Processing at Scale

FaaS and Furious: abstractions and differential caching for efficient data pre-processing

A Spark ML driven preprocessing approach for deep learning based scholarly data applications

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

FlashView

Evaluating New Approaches of Big Data Analytics Frameworks

s2p: Provenance Research for Stream Processing System

SDPPF — A MapReduce based parallel processing framework for spatial data

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

A new Apache Spark-based framework for big data streaming forecasting in IoT networks

Towards Health Data Stream Analytics

Data Provenance and Management in Radio Astronomy: A Stream Computing Approach

Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary