Large Scale Behavioral Analytics via Topical Interaction

Shih-Chieh Su
DOI: https://doi.org/10.48550/arXiv.1608.07625
2016-08-27
Abstract:We propose the split-diffuse (SD) algorithm that takes the output of an existing dimension reduction algorithm, and distributes the data points uniformly across the visualization space. The result, called the topic grids, is a set of grids on various topics which are generated from the free-form text content of any domain of interest. The topic grids efficiently utilizes the visualization space to provide visual summaries for massive data. Topical analysis, comparison and interaction can be performed on the topic grids in a more perceivable way.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively visualize and interact with large - scale behavioral data so that human experts can perceive, compare and understand these data more easily. Specifically, the paper proposes an algorithm named split - diffuse (SD), aiming to evenly distribute high - dimensional data points into two - dimensional or three - dimensional visualization spaces, thereby improving the distinguishability and interactivity between data points. ### Specific Background of the Problem 1. **Visualization Challenges of High - Dimensional Data**: - When data has multiple measurement dimensions, each sample is represented in a high - dimensional space \( H \). For example, data collected by network sensors, quantitative indicators in the stock market, word - frequency vectors of documents, etc. - High - dimensional data is difficult to be directly visualized, so dimension - reduction techniques (such as PCA, MDS, t - SNE, etc.) need to be used to map it to a low - dimensional space \( L \) (usually 2D or 3D). However, existing dimension - reduction methods may lead to an uneven distribution of data points in the visualization space, affecting the relative relationships and readability between data points. 2. **Limitations of Existing Dimension - Reduction Methods**: - Data points may overlap in the dimension - reduced visualization space, making the information difficult to identify. - Data points are too dense in some areas, increasing the difficulty of interacting with the data. - When comparing behaviors in different time periods or of different targets, geometric relationships may mask the actual differences. ### The Method Proposed in the Paper To overcome the above problems, the paper proposes the split - diffuse (SD) algorithm. The main objectives of this algorithm are: - **Evenly Distribute Data Points**: Distribute data points evenly in the visualization space through recursive splitting and diffusion. - **Maintain Topological Relationships between Points**: Try to maintain the relative positional relationships between data points during the dimension - reduction process. - **Improve Interactivity**: Through evenly distributed data points, users can interact and compare more conveniently. ### Application Scenarios The paper shows the application of the SD algorithm in the following application scenarios: - **Network Security Field**: Analyze network activity logs to detect abnormal behaviors and provide visual risk assessments. - **Other Fields**: Such as e - commerce log analysis of customers' shopping behaviors and preferences, credit card transaction analysis, customer complaint analysis, etc. ### Summary The core problem of the paper is: how to improve the interpretability and interactivity of large - scale behavioral data analysis by improving the distribution of data points in the visualization space. The SD algorithm provides an effective solution for this, enabling human experts to understand and compare complex behavioral data more intuitively.