Graphical representation of data prediction potential: correlation graphs and correlation chains

Adam Dudáš
DOI: https://doi.org/10.1007/s00371-023-03240-y
IF: 2.835
2024-01-24
The Visual Computer
Abstract:The correlation of the set of attributes is a crucial statistical value for the measuring of prediction potential present in a dataset. The correlation coefficient, which measures the correlation between the values of two attributes, can be used in order to measure the prediction potential between two-element subsets of a dataset containing a high number of attributes. In this way two common summary visualizations of prediction potential in datasets are formed—correlation matrices and correlation heatmaps. Both of these visualizations are focused on the presentation of correlation between pair of attributes but not much more regarding the context of correlations in the dataset. The main objective of this article is the design and implementation of graphical models usable in a visual representation of data prediction potential—correlation graphs and correlation chains—which emphasize the pseudo-transitivity of prediction potential in a dataset.
computer science, software engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of existing visualization methods for predictive potential in datasets. Specifically, although traditional correlation matrices and correlation heatmaps can show the correlations between attribute pairs, they have limitations in presenting the correlation context in the dataset and in discovering patterns and trends in the dataset. Moreover, these methods have poor readability and interpretability when dealing with large or high - dimensional datasets. To overcome these problems, the paper designs and implements two new graphical models - correlation graphs and correlation chains - to represent the predictive potential in the dataset more effectively and to emphasize the pseudo - transitivity of the predictive potential in the dataset. These models are based on graph theory principles and are designed to help analysts discover patterns and trends when analyzing data. ### Main contributions: 1. **Propose original visualization models**: These models are suitable for correlation analysis and subsequent predictive analysis of large and multi - dimensional datasets. Based on graph theory principles, they are called correlation graphs and correlation chains. 2. **Implement the proposed graphical representations**: These graphical representations of predictive potential are implemented in the form of freely available Python code. 3. **Evaluate the proposed graphical models**: These models are evaluated on two datasets of different sizes and structures, one is the standard Iris dataset, and the other is the original graph - attribute dataset containing multiple attributes and records. ### Specific methods: - **Correlation graph**: Represent the correlations in the dataset by constructing an undirected weighted graph, where each node corresponds to an attribute and the weight of an edge represents the correlation coefficient between two attributes. Simplify the complexity of the graph through a two - stage pruning method (selecting edges with the maximum correlation value and setting a correlation threshold). - **Correlation chain**: Extract a sub - graph from the correlation graph, which contains edges with correlations greater than the set threshold, further emphasizing the pseudo - transitivity of the predictive potential in the dataset. ### Evaluation results: - **Iris dataset**: Correlation graphs and correlation chains can clearly show the direct and indirect influences between various attributes, especially when predicting the types of flowers. - **Cubic - graph - attribute dataset**: When dealing with larger and more complex datasets, correlation graphs and correlation chains still maintain good readability and interpretability and can effectively identify the correlations between attributes. In conclusion, through the design and implementation of new graphical models, this paper solves the deficiencies of traditional correlation visualization methods in dealing with large and high - dimensional datasets and provides more effective tools for data analysis.