Abstract:Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.

Improving Power Grid Monitoring Data Quality: an Efficient Machine Learning Framework for Missing Data Prediction

Robust and Automatic Data Cleansing Method for Short-Term Load Forecasting of Distribution Feeders

A Data Fusion and Data Cleaning System for Smart Grids Big Data.

Power Grid Missing Data Filling Method Based on Historical Data Mining Assisted Multi-dimensional Scenario Analysis

Enhancing Smart Grid Sustainability: Using Advanced Hybrid Machine Learning Techniques While Considering Multiple Influencing Factors for Imputing Missing Electric Load Data

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Mat-Transformer-Based State Prediction Method for Information Equipment

Data-Based Line Trip Fault Prediction in Power Systems Using LSTM Networks and SVM.

Analysis and Prediction of Power Distribution Network Loss Based on Machine Learning

Improving Power Grid Resilience Through Predictive Outage Estimation

Missing-Data Tolerant Hybrid Learning Method for Solar Power Forecasting

Recovery Algorithm of Power Metering Data Based on Collaborative Fitting

A hybrid approach based machine learning models in electricity markets

Short-term power grid load forecasting based on variable weight combination hybrid model

A Hybrid System Based on LSTM for Short-Term Power Load Forecasting

Physical-Model-Aided Data-Driven Linear Power Flow Model: an Approach to Address Missing Training Data

An advanced framework for net electricity consumption prediction: Incorporating novel machine learning models and optimization algorithms

Evaluation and Analysis of Urban Power Grid Operation Status Based on Online Sequence Extreme Learning Machine and Self-Coding Network

Machine Learning-Based Sensor Data Modeling Methods for Power Transformer PHM

Big Data Cleaning Model of Multi-Source Heterogeneous Power Grid Based On Machine Learning Classification Algorithm

Assessing deep learning performance in power demand forecasting for smart grid