Wang-Wang Yu,Xian-Shi Zhang,Fu-Ya Luo,Yijun Cao,Kai-Fu Yang,Hong-Mei Yan,Yong-Jie Li
Abstract:Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the recognition of micro - expressions (MEs) and macro - expressions (MaEs), the existing methods have the time - consuming frame - by - frame annotation problem during the annotation process, and the video - level weakly - supervised methods lack sufficient position and quantity information, resulting in a performance significantly lower than that of fully - supervised methods. To solve this problem, the author proposes a point - level weakly - supervised expression spotting framework (PWES), which only needs to annotate one random frame (i.e., one point) for each expression to reduce the annotation cost and improve the model performance.
### Problem Background
1. **Frame - level Annotation**
- Frame - level micro - expression and macro - expression spotting methods require a large amount of time for frame - by - frame annotation.
- Although this method is accurate, it is very time - consuming and costly.
2. **Video - level Annotation**
- Video - level weakly - supervised methods only need to provide video - level labels without paying attention to specific timestamps or the number of expressions.
- However, this method lacks specific information about the position and quantity of expressions, resulting in poor performance.
### Proposed Solution
To bridge the gap between frame - level fully - supervised methods and video - level weakly - supervised methods, the author proposes a point - level weakly - supervised expression spotting framework (PWES). The core contributions of this framework include:
1. **Point - level Weakly - supervised Expression Spotting Framework (PWES)**
- By only annotating one random frame for each expression, the annotation cost is reduced.
- It aims to achieve performance comparable to that of fully - supervised methods.
2. **Multi - level Refined Pseudo - label Generation (MPLG)**
- Combine class - specific probabilities, attention scores, fused features and point - level labels to generate more reliable pseudo - labels.
- Solve the misclassification and segment prediction problems common in pseudo - label mining.
3. **Distribution - guided Feature Contrastive Learning (DFCL)**
- Use the memory bank to store enhanced features and capture global representations through contrastive learning.
- Guide the model to converge in terms of foreground - background separation, inter - class isolation and intra - class aggregation.
### Main Contributions
- **Propose the point - level weakly - supervised expression spotting framework (PWES) for the first time**, achieving frame - level micro - expression and macro - expression spotting using point - level labels in untrimmed facial videos.
- **Propose the MPLG algorithm**, which generates more reliable pseudo - labels by combining class - specific probabilities, attention scores, current video features and point - level labels.
- **Introduce the DFCL algorithm**, which uses the memory bank and the distribution - guided feature sampling module to achieve feature contrastive learning and enhance the model's representation learning ability.
- **Experimental results show that** the performance of PWES on the CAS(ME)2, CAS(ME)3 and SAMM - LV datasets is almost comparable to that of fully - supervised methods.
Through these innovations, the PWES framework not only significantly reduces the annotation cost but also approaches or even exceeds the performance of existing fully - supervised methods.