AccuStripes: Adaptive Binning for the Visual Comparison of Univariate Data Distributions

Anja Heim,Eduard Gröller,Christoph Heinzl
DOI: https://doi.org/10.48550/arXiv.2207.13663
2022-09-06
Abstract:Understanding and comparing distributions of data (e.g., regarding their modes, shapes, or outliers) is a common challenge in many scientific disciplines. Typically, this challenge is addressed using side-by-side comparisons of histograms or density plots. However, comparing multiple density plots is mentally demanding. Uniform histograms often represent distributions imprecisely since missing values, outliers, or modes are hidden by a grouping of equal size. In this paper, a novel type of overview visualization for the comparison of univariate data distributions is presented: AccuStripes (i.e., accumulated stripes) is a new visual metaphor encoding accumulations of data distributions according to adaptive binning using color coded stripes of irregular width. We provide detailed insights about challenges of binning. Specifically, we explore different adaptive binning concepts such as Bayesian Blocks binning and Jenks Natural Breaks binning for the computation of binning boundaries, in terms of their capabilities to represent the datasets as accurately as possible. In addition, we discuss issues arising with the representation of designs for the comparative visualization of distributions: To allow for a comparison of many distributions, their accumulated representations are plotted below each other in a stacked mode. Based on our findings, we propose three different layouts for comparative visualization of multiple distributions. The usefulness of AccuStripes is investigated using a statistical evaluation of the binning methods. Using a similarity metric from cluster analysis, it is shown, which binning method statistically yields the best grouping results. Through a user study we evaluate, which binning strategy visually represents the distribution in the most intuitive form and investigate, which layout allows the user the comparison of many distributions in the most effortless way.
Human-Computer Interaction,Graphics
What problem does this paper attempt to address?
The problem this paper attempts to address is how to more accurately understand and compare univariate data distributions (e.g., their modality, shape, or outliers) in scientific research and technical fields. The current common method to achieve this goal is by side-by-side comparison of histograms or density plots, but these methods present visual challenges, especially when multiple distributions need to be compared. Specifically: 1. **Limitations of Histograms**: Traditional uniform binning techniques, while simple and easy to use, are not precise enough in representing data distributions. For example, missing values, outliers, or modalities may be hidden by bins of the same size, making the data structure difficult to identify. 2. **Limitations of Density Plots**: Although density plots can show the shape of the entire distribution, the visual complexity significantly increases when comparing multiple density plots side by side, making the comparison task difficult. To overcome these limitations, the paper proposes a new visualization method—AccuStripes (i.e., "accumulative stripes"), aimed at more accurately representing and comparing univariate data distributions. AccuStripes uses adaptive binning techniques to encode data distributions as colored stripes of different widths, thereby more precisely capturing the structural features of the data. The paper explores different adaptive binning methods (such as Bayesian blocks binning and Jenks natural breaks binning) and their advantages in representing data distributions, and proposes three different layout strategies (Bin Layout, Bin+Curve Layout, and Filled Curve Layout) to support the comparison of multiple distributions. Through statistical evaluation and user studies, the paper validates that adaptive binning techniques are more intuitive and accurate than uniform binning in representing the shape of data distributions, and explores which layout design is most suitable for comparing multiple distributions.