ScaleViz: Scaling Visualization Recommendation Models on Large Data

Ghazi Shazan Ahmad,Shubham Agarwal,Subrata Mitra,Ryan Rossi,Manav Doshi,Vibhor Porwal,Syam Manoj Kumar Paila
DOI: https://doi.org/10.1007/978-981-97-2262-4_8
2024-11-27
Abstract:Automated visualization recommendations (vis-rec) help users to derive crucial insights from new datasets. Typically, such automated vis-rec models first calculate a large number of statistics from the datasets and then use machine-learning models to score or classify multiple visualizations choices to recommend the most effective ones, as per the statistics. However, state-of-the art models rely on very large number of expensive statistics and therefore using such models on large datasets become infeasible due to prohibitively large computational time, limiting the effectiveness of such techniques to most real world complex and large datasets. In this paper, we propose a novel reinforcement-learning (RL) based framework that takes a given vis-rec model and a time-budget from the user and identifies the best set of input statistics that would be most effective while generating the visual insights within a given time budget, using the given model. Using two state-of-the-art vis-rec models applied on three large real-world datasets, we show the effectiveness of our technique in significantly reducing time-to visualize with very small amount of introduced error. Our approach is about 10X times faster compared to the baseline approaches that introduce similar amounts of error.
Artificial Intelligence,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the scalability issue of existing Visualization - Recommendation (Vis - Rec) models when dealing with large - scale datasets. Specifically, in order to be able to generalize on unknown datasets, these models calculate a large number of statistical features, which becomes infeasible when facing large - scale datasets because the calculation time is too long, resulting in poor performance of these techniques in practical applications. ### Summary of Main Problems: 1. **High Computational Complexity**: Existing Vis - Rec models rely on a large number of statistical features (for example, Qian et al. used 1,006 high - order statistical features per column in [13]), which makes the calculation on large - scale datasets very expensive and time - consuming. 2. **Not Applicable to Large - Scale Datasets**: As the scale of the dataset increases, the time to calculate these statistical features increases sharply, making these models unable to work effectively in practical applications. 3. **Lack of Flexibility**: Traditional solutions (such as random selection or sampling) may lead to a decline in the quality of generated visualization recommendations or fail to accurately reflect the characteristics of the entire dataset. ### Solutions Proposed in the Paper: To solve the above problems, the paper proposes a new framework named ScaleViz. ScaleViz optimizes the performance of Vis - Rec models on large - scale datasets through the following steps: 1. **Cost Analysis**: Use a small number of data samples of different scales to estimate the calculation cost of each statistical feature. 2. **Budget - Aware Reinforcement Learning**: Utilize Reinforcement Learning (RL) techniques to gradually learn and select the most effective statistical features within a given time budget. 3. **Feature Selection and Inference**: Based on the learned knowledge, only calculate the selected statistical features and generate visualization recommendations. Through this method, ScaleViz can significantly reduce the calculation time while ensuring the recommendation quality, achieving a speed - up of up to 10 times. ### Formula Representation: The formulas involved in the paper are mainly used to describe the optimization problem and the reinforcement learning process. For example, the optimization problem can be formalized as: \[ \min_{\theta} L[P(\theta(f) \odot f) - P(f)], \quad \text{subject to:} \sum_{i,j} \theta(f) \odot c(f) \leq B \] where: - \(L\) is the loss function, which is used to compare the differences in model outputs under different feature sets. - \(c(f)\) is the cost function for calculating features. - \(B\) is the time budget specified by the user. In this way, ScaleViz can efficiently generate high - quality visualization recommendations on large - scale datasets.