Abstract:Handling sparse and unstructured geometric data, such as point clouds or event-based vision, is a pressing challenge in the field of machine vision. Recently, sequence models such as Transformers and state-space models entered the domain of geometric data. These methods require specialized preprocessing to create a sequential view of a set of points. Furthermore, prior works involving sequence models iterate geometric data with either uniform or learned step sizes, implicitly relying on the model to infer the underlying geometric structure. In this work, we propose to encode geometric structure explicitly into the parameterization of a state-space model. State-space models are based on linear dynamics governed by a one-dimensional variable such as time or a spatial coordinate. We exploit this dynamic variable to inject relative differences of coordinates into the step size of the state-space model. The resulting geometric operation computes interactions between all pairs of N points in O(N) steps. Our model deploys the Mamba selective state-space model with a modified CUDA kernel to efficiently map sparse geometric data to modern hardware. The resulting sequence model, which we call STREAM, achieves competitive results on a range of benchmarks from point-cloud classification to event-based vision and audio classification. STREAM demonstrates a powerful inductive bias for sparse geometric data by improving the PointMamba baseline when trained from scratch on the ModelNet40 and ScanObjectNN point cloud analysis datasets. It further achieves, for the first time, 100% test accuracy on all 11 classes of the DVS128 Gestures dataset.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to handle the challenges of sparse and unstructured geometric data (such as point clouds or event - based vision data) in the field of machine vision. Specifically, the authors point out: 1. **Processing of Sparse Geometric Data**: Traditional convolutional neural networks (CNNs) are suitable for processing structured, uniformly - distributed data (such as images and videos), but for sparse and irregularly - distributed geometric data (such as LiDAR point clouds or event - based camera data), these methods perform poorly. 2. **Application of Sequence Models**: In recent years, Transformer and state - space models (SSMs) have been introduced into geometric data processing. However, these methods usually require special pre - processing steps to convert geometric data into a sequence form and rely on the model to infer the underlying geometric structure by itself. 3. **Limitations of Existing Methods**: Existing sequence - based models either use uniform time steps or learned time steps, which may result in the model being unable to effectively capture the complex relationships in geometric data. To solve these problems, the authors propose the STREAM model, and its main innovations include: - **Unified Framework**: A unified framework for state - space models (SSM) for modeling sparse geometric data is proposed, which can handle irregularly - spaced time steps. - **Explicit Encoding of Geometric Structure**: By injecting the relative differences between coordinates into the parameterization of the state - space model, the geometric structure is explicitly encoded. - **Efficient Implementation**: By using modified CUDA kernels, efficient hardware mapping is achieved, enabling the model to run quickly on modern hardware. ### Specific Problem Summary - **Processing of Sparse Geometric Data**: How to effectively process sparse and irregularly - distributed geometric data, such as point clouds and event - based vision data. - **Explicit Encoding of Geometric Information**: How to explicitly encode the geometric structure in the model instead of relying on the model to infer by itself. - **Efficient Computation**: How to design an efficient state - space model that can reduce the computational complexity while maintaining accuracy. Through these innovations, the STREAM model has achieved excellent results in multiple benchmark tests, especially in point - cloud classification and event - based vision tasks. For example, on the DVS128 - Gestures dataset, STREAM has achieved 100% classification accuracy for all 11 categories for the first time.

STREAM: A Universal State-Space Model for Sparse Geometric Data

Streaming Video Model

Collision-streams: fast GPU-based collision detection for deformable models.

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

StreamTrack: real-time meta-detector for streaming perception in full-speed domain driving scenarios

STREAMLINE: Streaming Active Learning for Realistic Multi-Distributional Settings

PointMamba: A Simple State Space Model for Point Cloud Analysis

Streamed Learning: One-Pass SVMs

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

Multi-modal Streaming 3D Object Detection

WaterMamba: Visual State Space Model for Underwater Image Enhancement

S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points

Serialized Point Mamba: A Serialized Point Cloud Mamba Segmentation Model

STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models

Scalable Visual State Space Model with Fractal Scanning

Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model