Abstract:Tutorial videos of mobile apps have become a popular and compelling way for users to learn unfamiliar app features. To make the video accessible to the users, video creators always need to annotate the actions in the video, including what actions are performed and where to tap. However, this process can be time-consuming and labor-intensive. In this paper, we introduce a lightweight approach Video2Action, to automatically generate the action scenes and predict the action locations from the video by using image-processing and deep-learning methods. The automated experiments demonstrate the good performance of Video2Action in acquiring actions from the videos, and a user study shows the usefulness of our generated action cues in assisting video creators with action annotation.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to reduce the time and labor burden of manually annotating user actions in mobile application tutorial videos. Specifically, the authors propose a lightweight, non-intrusive method called **Video2Action** that automatically extracts action scenes and predicts action locations from videos using image processing and deep learning techniques, thereby helping video creators efficiently annotate actions. ### Background and Problem Mobile application tutorial videos have become a popular and effective way for users to learn unfamiliar application features. To make these videos more user-friendly, video creators often need to annotate actions in the videos, including what actions were performed and where clicks occurred. However, this process is very time-consuming and labor-intensive, especially when it involves watching the video frame by frame, extracting action segments, recalling specific click locations, and annotating them. ### Solution The **Video2Action** method aims to automate action annotation through the following two main stages: 1. **Action Scene Generation**: Using image processing techniques to segment the video into action scenes. 2. **Action Location Prediction**: Using deep learning models to predict the specific locations of actions. ### Method Details #### Action Scene Generation - **Shot Detection**: Detecting shot boundaries by calculating the brightness difference (Y-Diff) between consecutive frames. Specific steps include: - Converting the RGB color space to the YUV color space and extracting the luminance component. - Using the Structural Similarity Index (SSIM) to calculate the similarity value per pixel. - Determining shot boundaries based on similarity scores, selecting stable states with longer durations as fully rendered UIs. - **Scene Segmentation**: Identifying different types of scenes based on the similarity scores of consecutive frames and corresponding shots: - **TAP**: Typically transitions instantly to a completely different UI, with a sharp drop in similarity scores. - **SCROLL**: Indicates a continuous transition from one UI to another, with similarity scores first dropping sharply and then gradually increasing. - **BACKWARD**: Returns from the current UI to the previous UI, using a stack structure to check if the palindrome UI shots are consistent. #### Action Location Prediction - **Model Architecture**: Proposing a deep learning model that first identifies potential clickable areas in the first UI and then predicts the perceived click location that transitions to the second UI. - **Data Augmentation**: Integrating human knowledge into the model by combining UI-specific data augmentation methods to improve the model's robustness. - **Post-processing**: Further refining the model's prediction results. ### Evaluation and Results - **Automated Experiments**: Evaluated on the large-scale crowdsourced dataset Rico, the results show that **Video2Action** outperforms six commonly used baseline methods in action scene generation, with an F1 score of 81.6% and a Levenshtein score of 86.4%. - **User Study**: Evaluated the effectiveness of **Video2Action** in assisting action annotation in real-world environments through user studies. The results show that participants saved 85% of the time when annotating using the action information generated by **Video2Action**. ### Conclusion The **Video2Action** method significantly reduces the time and labor burden of video creators in action annotation by automatically extracting action scenes and predicting action locations, thereby improving the usability and user experience of tutorial videos.

Video2Action: Reducing Human Interactions in Action Annotation of App Tutorial Videos

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Action2video: Generating Videos of Human 3D Actions

Annotation-Efficient Untrimmed Video Action Recognition

Deep-VFX: Deep Action Recognition Driven VFX for Short Video

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

Automatic extraction of relevant video shots of specific actions exploiting Web data

Deep action: A mobile action recognition framework using edge offloading

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions

Action-based image editing guided by human instructions

Video Action Segmentation Via Contextually Refined Temporal Keypoints

Enhancing early action prediction in videos through temporal composition of sub-actions

Action Recognition by Exploring Data Distribution and Feature Correlation

One-shot Video Graph Generation for Explainable Action Reasoning

Video Action Understanding

Action Recognition in Still Images with Minimum Annotation Efforts

Unsupervised Discovery of Actions in Instructional Videos

Real-Time Motion Data Annotation Via Action String

Motion Control for Enhanced Complex Action Video Generation