Abstract:Fine-tuning facilitates the adaptation of text-to-image generative models to novel concepts (e.g., styles and portraits), empowering users to forge creatively customized content. Recent efforts on fine-tuning focus on reducing training data and lightening computation overload but neglect alignment with user intentions, particularly in manual curation of multi-modal training data and intent-oriented evaluation. Informed by a formative study with fine-tuning practitioners for comprehending user intentions, we propose IntentTuner, an interactive framework that intelligently incorporates human intentions throughout each phase of the fine-tuning workflow. IntentTuner enables users to articulate training intentions with imagery exemplars and textual descriptions, automatically converting them into effective data augmentation strategies. Furthermore, IntentTuner introduces novel metrics to measure user intent alignment, allowing intent-aware monitoring and evaluation of model training. Application exemplars and user studies demonstrate that IntentTuner streamlines fine-tuning, reducing cognitive effort and yielding superior models compared to the common baseline tool.

What problem does this paper attempt to address?

The paper aims to address the issue of user intent alignment in the fine-tuning process of text-to-image generation models. Specifically, the researchers found that existing fine-tuning methods and techniques, while effectively reducing the number of images and computational resources required for training, neglect the alignment between user intent and technical implementation, especially in the manual curation of multimodal training data and intent-based evaluation. To solve this problem, the research team proposed an interactive framework called IntentTuner. The core goal of this framework is to better integrate user intent through the following three main aspects: 1. **Understanding User Intent**: Understanding user intent through natural language descriptions and interactive methods. 2. **Efficiently Translating User Intent into Data Strategies**: Automatically converting user intent into matching data augmentation strategies. 3. **Monitoring and Evaluating Intent Alignment**: Introducing new metrics to measure the degree of user intent alignment and allowing intent-aware monitoring and evaluation during model training. The design of IntentTuner is based on preliminary research on fine-tuning practitioners, aiming to help users clearly express their training intent (e.g., through example images and text descriptions) and automatically convert it into effective data augmentation strategies. Additionally, the framework introduces a novel method to measure the alignment of user intent, enabling monitoring and evaluation of model training. The paper also details the specific challenges users face in fine-tuning practice, including the difficulty of translating abstract intent into clear data strategies, the lack of effective model selection and evaluation methods, and the lack of intuitive training monitoring tools. To address these challenges, IntentTuner provides an integrated system that unifies the fine-tuning and generation process, allowing both expert and novice users to flexibly customize text-to-image generation models according to their intent, and supports user-friendly monitoring and evaluation functions for intuitive model selection.

IntentTuner: An Interactive Framework for Integrating Human Intents in Fine-tuning Text-to-Image Generative Models

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Tuning-Free Image Customization with Image and Text Guidance

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Imagine yourself: Tuning-Free Personalized Image Generation

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

DreamTuner: Single Image is Enough for Subject-Driven Generation

Fine-tuning Pre-trained Language Models for Few-shot Intent Detection: Supervised Pre-training and Isotropization

UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Personalized Visual Instruction Tuning

Discriminative Probing and Tuning for Text-to-Image Generation

Customization Assistant for Text-to-image Generation

Customized Generation Reimagined: Fidelity and Editability Harmonized

A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction

Tuning-Free Visual Customization via View Iterative Self-Attention Control

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Towards an End-to-End Personal Fine-Tuning Framework for AI Value Alignment

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation