Enhancing Vision Models for Text-Heavy Content Understanding and Interaction

Adithya TG,Adithya SK,Abhinav R Bharadwaj,Abhiram HA,Dr. Surabhi Narayan
2024-05-31
Abstract:Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models' capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenges faced by traditional visual models in understanding and interacting with text-intensive image content (such as research papers, textbooks, and documents). Specifically, the goal of the paper is to enhance the capabilities of visual models through the following methods: 1. **Understanding complex textual data**: Improve the ability of visual models to process images containing a large amount of textual information (e.g., charts, tables, etc.). 2. **Multimodal data fusion**: Optimize the ability of visual models to seamlessly integrate textual information, enabling them to extract valuable information from text-intensive images. 3. **Improving model performance**: Enhance model performance through data preprocessing, fine-tuning (using LoRA technology), and evaluation. Through these improvements, the paper aims to advance the development of multimodal artificial intelligence, enabling models to more effectively understand and interact with text-rich visual content.