Abstract:Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. We show our new method yields experimental AUC advantages of up to 8.7%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the **early - classification problem of multi - modal sequences**. Specifically, the author focuses on how to classify multi - modal sequences as early as possible when sufficient information is received, in order to balance the accuracy and timeliness of classification. ### Problem Background In many practical applications, information is obtained gradually over time. For example: - When browsing movies on an online platform, quickly decide whether to watch based on a few seconds of video and audio clips. - Doctors diagnose patients' conditions as soon as possible based on images, laboratory results, etc. obtained at different times, so as to start treatment as early as possible. The common problem in these scenarios is: **When is enough information collected to stop and classify?** Traditional early - classification methods mainly focus on single - modal sequences, while this paper attempts to extend to multi - modal sequences. ### Core Problems The core problems mentioned in the paper are: - After receiving new information at each time step, it is necessary to decide whether enough information has been collected to classify, or continue to wait for more information to improve prediction accuracy. - Balancing these two goals (classifying as early as possible and classifying as accurately as possible) is the core challenge of this research. ### Solutions To solve this problem, the author combines the following two techniques: 1. **OmniNet - like Transformer**: This structure can explicitly model spatio - temporal interactions and is suitable for the early classification of multi - modal sequences. 2. **Classifier - Induced Stopping (CIS)**: This is an efficient method that can learn a strategy at each time step to decide whether to stop and classify or continue to wait, and find the optimal stop time through its own classification results. ### Main Contributions - This is the first time that the early - classification method has been applied to multi - modal sequences composed of different modalities (such as images, texts, and structured classification data). - Experiments show that the method combining the spatio - temporal Transformer and CIS has achieved significant advantages in the experiment, with an AUC increase of 8.7%. ### Summary This research not only emphasizes the universality of early - classification of multi - modal sequences in the real world, but also shows the effectiveness of the combination of OmniNet - like spatio - temporal Transformer and CIS. This provides a new direction for future research, especially in meeting the needs of classification tasks in dynamic environments.

Early Classifying Multimodal Sequences

A Policy for Early Sequence Classification

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Integrating Data-Driven Segmentation, Local Feature Extraction and Fisher Kernel Encoding to Improve Time Series Classification

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

A Sequential Model for Multi-Class Classification

Advancing Time Series Classification with Multimodal Language Modeling

Multivariate Time Series Early Classification Across Channel and Time Dimensions

Finding Discriminative Subsequences Via a Coverage Measure and Mutual Information Selection Strategy for Multi-Class Time Series Classification

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Unaligned Multimodal Sequences for Depression Assessment From Speech

Longitudinal Ensemble Integration for sequential classification with multimodal data

Representation Learning of Tangled Key-Value Sequence Data for Early Classification

Multimodal Difference Learning for Sequential Recommendation

Multimodal Classification for Analysing Social Media

Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling

Learn to Combine Modalities in Multimodal Deep Learning

Improving Fine-grained Image Classification with Multimodal Information

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Learning Unseen Modality Interaction

Many Could Be Better Than All: A Novel Instance-Oriented Algorithm for Multi-modal Multi-label Problem.