Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning

Ayush Singh,Mansi Gupta,Shivank Garg,Abhinav Kumar,Vansh Agrawal
2024-10-08
Abstract:Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks. Various works claim that introducing a captioning pipeline before VQA tasks enhances performance. We incorporated this pipeline for tasks involving geometry, algebra, and counting. We found that captioning results are not generalizable, specifically with larger VLMs primarily trained on downstream QnA tasks showing random performance on math-related challenges. However, we present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance. This approach shows promise and proves more effective than direct captioning methods for math-heavy problems.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **Visual - Language Models (VLMs) perform poorly in tasks involving mathematical reasoning such as geometric reasoning, algebraic problem - solving, and counting**. Although these models perform well in tasks such as image retrieval and Visual Question Answering (VQA), they face significant challenges when dealing with math - related tasks, especially in geometry, algebra, and counting. ### Specific problems include: 1. **Effective integration of multi - modal information**: VLMs have difficulty effectively combining and interpreting visual and textual information, especially in tasks that require an understanding of geometric relationships. 2. **The particularity of math tasks**: VLMs perform poorly in handling math tasks, especially those involving counting. This is mainly due to the scarcity of accurately labeled object quantities in the training data, especially when the number of objects increases. 3. **Limitations of existing methods**: Although some studies have shown that introducing a captioning pipeline before VQA tasks can improve performance, this method is not generally applicable to larger - scale VLMs, especially showing instability in math - related challenges. ### Solution: The author proposes a new method - **task - based prompting**, which enhances the model's reasoning ability by adding task - specific guidance information to the prompt. This method is more effective than directly using captioning, especially for math - intensive problems. ### Main research objectives: - Evaluate the performance of different VLMs in geometric, algebraic, and counting tasks. - Explore whether task - based prompting can effectively improve the performance of VLMs in mathematical reasoning tasks. - Compare the effects of different prompting strategies (such as random prompting, adversarial prompting) on model performance to test the model's robustness and generalization ability. Through these studies, the author hopes to find an effective method to improve the performance of VLMs in mathematical reasoning tasks, thereby promoting the application of multi - modal models in complex problem - solving scenarios.