Early Classifying Multimodal Sequences

Alexander Cao,Jean Utke,Diego Klabjan
2023-05-02
Abstract:Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. We show our new method yields experimental AUC advantages of up to 8.7%.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the **early - classification problem of multi - modal sequences**. Specifically, the author focuses on how to classify multi - modal sequences as early as possible when sufficient information is received, in order to balance the accuracy and timeliness of classification. ### Problem Background In many practical applications, information is obtained gradually over time. For example: - When browsing movies on an online platform, quickly decide whether to watch based on a few seconds of video and audio clips. - Doctors diagnose patients' conditions as soon as possible based on images, laboratory results, etc. obtained at different times, so as to start treatment as early as possible. The common problem in these scenarios is: **When is enough information collected to stop and classify?** Traditional early - classification methods mainly focus on single - modal sequences, while this paper attempts to extend to multi - modal sequences. ### Core Problems The core problems mentioned in the paper are: - After receiving new information at each time step, it is necessary to decide whether enough information has been collected to classify, or continue to wait for more information to improve prediction accuracy. - Balancing these two goals (classifying as early as possible and classifying as accurately as possible) is the core challenge of this research. ### Solutions To solve this problem, the author combines the following two techniques: 1. **OmniNet - like Transformer**: This structure can explicitly model spatio - temporal interactions and is suitable for the early classification of multi - modal sequences. 2. **Classifier - Induced Stopping (CIS)**: This is an efficient method that can learn a strategy at each time step to decide whether to stop and classify or continue to wait, and find the optimal stop time through its own classification results. ### Main Contributions - This is the first time that the early - classification method has been applied to multi - modal sequences composed of different modalities (such as images, texts, and structured classification data). - Experiments show that the method combining the spatio - temporal Transformer and CIS has achieved significant advantages in the experiment, with an AUC increase of 8.7%. ### Summary This research not only emphasizes the universality of early - classification of multi - modal sequences in the real world, but also shows the effectiveness of the combination of OmniNet - like spatio - temporal Transformer and CIS. This provides a new direction for future research, especially in meeting the needs of classification tasks in dynamic environments.