Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

Daniel Wen,Nafisa Hussain

2024-06-24

Abstract:Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of generating precise step-by-step recipes from cooking videos. Specifically, the authors propose a novel approach to optimize Large Vision-Language Models (LVLM) through Directed Domain Fine-Tuning, enabling them to generate detailed ingredient lists and step-by-step instructions from cooking videos that do not contain subtitles or audio information. They utilized the Video-LLaVA model and conducted task-specific dataset training for three modalities: images, videos, and text, to enhance the model's performance in the cooking domain. Experimental results show that the fine-tuned model improved accuracy by 2% on the YouCook2 dataset compared to the base version of Video-LLaVA, despite using a much smaller amount of training data than the base model. This method aims to reduce irrelevant noise and improve the relevance and accuracy of the generated content.

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

Video Instruction Tuning With Synthetic Data

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Multimodal Language Models for Domain-Specific Procedural Video Summarization

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Large Language Models for Ingredient Substitution in Food Recipes using Supervised Fine-tuning and Direct Preference Optimization

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Vision-Language Instruction Tuning: A Review and Analysis

VILA: On Pre-training for Visual Language Models

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models