VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li,Zhihui Xie,Mukai Li,Shunian Chen,Peiyi Wang,Liang Chen,Yazheng Yang,Benyou Wang,Lingpeng Kong,Qi Liu
2024-10-18
Abstract:As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at <a class="link-external link-https" href="https://vlf-silkie.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language