Data quality challenges in existing distribution network datasets

Frederik Geth,Marta Vanin,Dirk Van Hertem
2023-08-01
Abstract:Existing digital distribution network models, like those in the databases of network utilities, are known to contain erroneous or untrustworthy information. This can compromise the effectiveness of physics-based engineering simulations and technologies, in particular those that are needed to deliver the energy transition. The large-scale rollout of smart meters presents new opportunities for data-driven system identification in distribution networks, enabling the improvement of existing data sets. Despite the increasing academic attention to system identification for distribution networks, researchers often make troublesome assumptions on what data is available and/or trustworthy. In this paper, we highlight some differences between academic efforts and first-hand industrial experiences, in order to steer the former towards more applicable research solutions.
Systems and Control
What problem does this paper attempt to address?
The paper primarily focuses on the quality issues in distribution network datasets, particularly how these datasets affect the effectiveness of physics-based engineering simulations and techniques, which are crucial for the energy transition. The authors point out that despite the large-scale deployment of smart meters bringing new opportunities for data-driven system identification in distribution networks, academic research often makes some unrealistic assumptions about the available data and its reliability. Specifically, the paper discusses the following four issues: 1. **Modeling errors or simplifications**: For example, applying the Kron reduction method in networks where the neutral point is not universally grounded. 2. **Network data errors**: Such as issues with impedance values and topology information. 3. **Inevitable measurement errors**: "Noise" or "bad" data due to sensor tolerances or failures. 4. **Insufficient measurements**: Including semantic mismatches (e.g., average values instead of instantaneous values), granularity mismatches (e.g., three-phase totals instead of phase-separated measurements), and label mismatches (e.g., incorrect location or phase metadata). The paper emphasizes the importance of addressing these issues and offers several recommendations, including understanding, applying, and improving best practices for network dataset development; developing automated data cleaning and maintenance tools; and establishing practical methods to verify the effectiveness of data corrections. Additionally, the paper details specific issues found in actual network data and proposes the concept of a systematic identification/network data cleaning framework aimed at gradually improving the quality of existing distribution network models through a series of calibration tasks. This framework includes steps such as analyzing input data, selecting the most appropriate processing workflows, applying methods, and verifying results. Finally, the paper calls for researchers to consider real-world application scenarios and suggests that future research should focus on addressing underexplored sources of network data errors, integrating methods for handling multiple error sources, understanding the impact of adverse measurement conditions on system identification methods, and integrating real-world validation strategies.