EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Sijie Cheng,Zhicheng Guo,Jingwen Wu,Kechen Fang,Peng Li,Huaping Liu,Yang Liu
DOI: https://doi.org/10.1109/cvpr52733.2024.01355
2024-01-01
Computer Vision and Pattern Recognition
Abstract:Vision-language models (VLMs) have recently shown promising results intraditional downstream tasks. Evaluation studies have emerged to assess theirabilities, with the majority focusing on the third-person perspective, and onlya few addressing specific tasks from the first-person perspective. However, thecapability of VLMs to "think" from a first-person perspective, a crucialattribute for advancing autonomous agents and robotics, remains largelyunexplored. To bridge this research gap, we introduce EgoThink, a novel visualquestion-answering benchmark that encompasses six core capabilities with twelvedetailed dimensions. The benchmark is constructed using selected clips fromegocentric videos, with manually annotated question-answer pairs containingfirst-person information. To comprehensively assess VLMs, we evaluate eighteenpopular VLMs on EgoThink. Moreover, given the open-ended format of the answers,we use GPT-4 as the automatic judge to compute single-answer grading.Experimental results indicate that although GPT-4V leads in numerousdimensions, all evaluated VLMs still possess considerable potential forimprovement in first-person perspective tasks. Meanwhile, enlarging the numberof trainable parameters has the most significant impact on model performance onEgoThink. In conclusion, EgoThink serves as a valuable addition to existingevaluation benchmarks for VLMs, providing an indispensable resource for futureresearch in the realm of embodied artificial intelligence and robotics.
What problem does this paper attempt to address?