Abstract:In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is multiple tasks in the automated life - saving intervention procedures from the first - person perspective. Specifically, the authors proposed solutions for the three major tasks in the Trauma THOMPSON (T3) challenge: action recognition, action prediction (also known as action anticipation), and visual question answering (VQA). These problems are especially crucial in medical assistance systems, especially in providing remote guidance in uncontrolled and harsh conditions to help first - aiders or individuals lacking professional medical training to carry out effective first - aid treatment. ### Action Recognition and Action Prediction For the action recognition and action prediction tasks, the authors proposed a pre - processing strategy that samples and concatenates multiple input frames into a single image, and combines the knowledge distillation method of momentum and attention mechanisms to improve the performance of the two tasks. In addition, they introduced the Action Dictionary - guided design (ADG). Through this design, the model can learn more effective action label representations, thus obtaining the best results in the experiment. These improvement measures help to improve the model's ability to recognize current actions and predict future actions. ### Visual Question Answering (VQA) In the VQA task, the authors utilized object - level features and deployed co - attention networks to train object and question features. In particular, they introduced a novel frame - question cross - attention mechanism, which is located at the core of the network and significantly improves the model's performance. Through this method, the model can better understand the content in the video frames and accurately answer related questions. ### Conclusion Overall, this research aims to assist the decision - making process in medical emergency scenarios by developing advanced AI technologies, especially in cases of limited resources or harsh environments. The authors' solutions not only achieved excellent results in the T3 challenge (second place in the action recognition and prediction tasks and first place in the VQA task), but also provided valuable references for future research. If you need to further understand the specific implementation details or other related content, please feel free to let us know!

QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

TQRFormer: Tubelet query recollection transformer for action detection

First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

A reinforcement learning approach for VQA validation: An application to diabetic macular edema grading

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Bridging the Gap Between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Task-driven Visual Saliency and Attention-based Visual Question Answering

AIML at VQA-Med 2020: Knowledge Inference Via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering.

Unifying 3D Vision-Language Understanding via Promptable Queries

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

SQATIN: Supervised Instruction Tuning Meets Question Answering for Improved Dialogue NLU

Multitask Learning for Visual Question Answering

Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making