QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Trinh T. L. Vuong,Doanh C. Bui,Jin Tae Kwak
2024-07-18
Abstract:In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is multiple tasks in the automated life - saving intervention procedures from the first - person perspective. Specifically, the authors proposed solutions for the three major tasks in the Trauma THOMPSON (T3) challenge: action recognition, action prediction (also known as action anticipation), and visual question answering (VQA). These problems are especially crucial in medical assistance systems, especially in providing remote guidance in uncontrolled and harsh conditions to help first - aiders or individuals lacking professional medical training to carry out effective first - aid treatment. ### Action Recognition and Action Prediction For the action recognition and action prediction tasks, the authors proposed a pre - processing strategy that samples and concatenates multiple input frames into a single image, and combines the knowledge distillation method of momentum and attention mechanisms to improve the performance of the two tasks. In addition, they introduced the Action Dictionary - guided design (ADG). Through this design, the model can learn more effective action label representations, thus obtaining the best results in the experiment. These improvement measures help to improve the model's ability to recognize current actions and predict future actions. ### Visual Question Answering (VQA) In the VQA task, the authors utilized object - level features and deployed co - attention networks to train object and question features. In particular, they introduced a novel frame - question cross - attention mechanism, which is located at the core of the network and significantly improves the model's performance. Through this method, the model can better understand the content in the video frames and accurately answer related questions. ### Conclusion Overall, this research aims to assist the decision - making process in medical emergency scenarios by developing advanced AI technologies, especially in cases of limited resources or harsh environments. The authors' solutions not only achieved excellent results in the T3 challenge (second place in the action recognition and prediction tasks and first place in the VQA task), but also provided valuable references for future research. If you need to further understand the specific implementation details or other related content, please feel free to let us know!