Realtime Interpersonal Human Synchrony Detection Based on Action Segmentation

Bowen Chen,Jiamin Zhang,Zuode Liu,Ruihan Lin,Weihong Ren,Luodi Yu,Honghai Liu
DOI: https://doi.org/10.1007/978-3-031-13844-7_32
2022-01-01
Abstract:IS (Interpersonal Synchrony), where the follower (participant) tries to behave the same action along with the raiser (human or metronome), is an essential social interaction skill. The evaluation of interpersonal synchronization is valuable for early autism screening. However, the research on IS evaluation is limited, and the current approaches usually evaluate the IS task with "motion energy" that is calculated by imprecise corner detection of the participant, which is not robust in an uncontrollable clinical environment. Moreover, these approaches need to manually mark the start and the end anchor of the specified action segment, which is labor-intensive. In this paper, we construct a realtime action segmentation model to automatically recognize the human-wise action class frame by frame. A simple yet efficient backbone is utilized to classify action class straightly instead of extracting the motion features (e.g. optical flow) with high computational complexity. Specifically, given an action video, a sliding window stacks frames in a fixed window size to feed a Resnet-like action classification branch (ACB) to classify the current action label. To further improve the accuracy of action boundary and eliminate the over-segmentation noises, we incorporate a boundary prediction branch (BPB), cooperating with majority-voting strategy, to refine the action classification generated by ACB. Then we can calculate the IS overlap easily by comparing two action timelines belonging to raiser and follower. To evaluate the proposed model, we collect 200K annotated images belonging to 40 subjects who perform 2 tasks (nod and clap) in 2 conditions (interpersonal and human-metronome). The experiment results demonstrate that our model achieves 87.1% accuracy at 200 FPS and can locate the start and end of action precisely in realtime.
What problem does this paper attempt to address?