Abstract:Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the recognition of micro - expressions (MEs) and macro - expressions (MaEs), the existing methods have the time - consuming frame - by - frame annotation problem during the annotation process, and the video - level weakly - supervised methods lack sufficient position and quantity information, resulting in a performance significantly lower than that of fully - supervised methods. To solve this problem, the author proposes a point - level weakly - supervised expression spotting framework (PWES), which only needs to annotate one random frame (i.e., one point) for each expression to reduce the annotation cost and improve the model performance. ### Problem Background 1. **Frame - level Annotation** - Frame - level micro - expression and macro - expression spotting methods require a large amount of time for frame - by - frame annotation. - Although this method is accurate, it is very time - consuming and costly. 2. **Video - level Annotation** - Video - level weakly - supervised methods only need to provide video - level labels without paying attention to specific timestamps or the number of expressions. - However, this method lacks specific information about the position and quantity of expressions, resulting in poor performance. ### Proposed Solution To bridge the gap between frame - level fully - supervised methods and video - level weakly - supervised methods, the author proposes a point - level weakly - supervised expression spotting framework (PWES). The core contributions of this framework include: 1. **Point - level Weakly - supervised Expression Spotting Framework (PWES)** - By only annotating one random frame for each expression, the annotation cost is reduced. - It aims to achieve performance comparable to that of fully - supervised methods. 2. **Multi - level Refined Pseudo - label Generation (MPLG)** - Combine class - specific probabilities, attention scores, fused features and point - level labels to generate more reliable pseudo - labels. - Solve the misclassification and segment prediction problems common in pseudo - label mining. 3. **Distribution - guided Feature Contrastive Learning (DFCL)** - Use the memory bank to store enhanced features and capture global representations through contrastive learning. - Guide the model to converge in terms of foreground - background separation, inter - class isolation and intra - class aggregation. ### Main Contributions - **Propose the point - level weakly - supervised expression spotting framework (PWES) for the first time**, achieving frame - level micro - expression and macro - expression spotting using point - level labels in untrimmed facial videos. - **Propose the MPLG algorithm**, which generates more reliable pseudo - labels by combining class - specific probabilities, attention scores, current video features and point - level labels. - **Introduce the DFCL algorithm**, which uses the memory bank and the distribution - guided feature sampling module to achieve feature contrastive learning and enhance the model's representation learning ability. - **Experimental results show that** the performance of PWES on the CAS(ME)2, CAS(ME)3 and SAMM - LV datasets is almost comparable to that of fully - supervised methods. Through these innovations, the PWES framework not only significantly reduces the annotation cost but also approaches or even exceeds the performance of existing fully - supervised methods.

Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting

Weakly-supervised Micro- and Macro-expression Spotting Based on Multi-level Consistency

PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding

Weakly Supervised Instance Segmentation Using Multi-Prior Fusion.

Video Scene Graph Generation from Single-Frame Weak Supervision.

LGSNet: A Two-Stream Network for Micro- and Macro-Expression Spotting With Background Modeling

Weakly Supervised Video Salient Object Detection via Point Supervision

A dual-branch network based on optical flow learning and semantic consistency for macro-expression spotting

Micro-Expression Spotting Based on a Short-Duration Prior and Multi-Stage Feature Extraction

Integrating VideoMAE based model and Optical Flow for Micro- and Macro-expression Spotting

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Synergistic Spotting and Recognition of Micro-Expression via Temporal State Transition

Spotting Micro-Expressions on Long Videos Sequences

Progressive Feature Self-reinforcement for Weakly Supervised Semantic Segmentation

Micro-expression Spotting with Multi-scale Local Transformer in Long Videos

3D-CNN for Facial Micro- and Macro-expression Spotting on Long Video Sequences using Temporal Oriented Reference Frame

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

Annotate less but perform better: weakly supervised shadow detection via label augmentation

A Survey on Programmatic Weak Supervision

Micro-expression spotting with a novel wavelet convolution magnification network in long videos

MESNet: A Convolutional Neural Network for Spotting Multi-Scale Micro-Expression Intervals in Long Videos