GenLens: A Systematic Evaluation of Visual GenAI Model Outputs

Tica Lin,Hanspeter Pfister,Jui-Hsien Wang
2024-02-06
Abstract:The rapid development of generative AI (GenAI) models in computer vision necessitates effective evaluation methods to ensure their quality and fairness. Existing tools primarily focus on dataset quality assurance and model explainability, leaving a significant gap in GenAI output evaluation during model development. Current practices often depend on developers' subjective visual assessments, which may lack scalability and generalizability. This paper bridges this gap by conducting a formative study with GenAI model developers in an industrial setting. Our findings led to the development of GenLens, a visual analytic interface designed for the systematic evaluation of GenAI model outputs during the early stages of model development. GenLens offers a quantifiable approach for overviewing and annotating failure cases, customizing issue tags and classifications, and aggregating annotations from multiple users to enhance collaboration. A user study with model developers reveals that GenLens effectively enhances their workflow, evidenced by high satisfaction rates and a strong intent to integrate it into their practices. This research underscores the importance of robust early-stage evaluation tools in GenAI development, contributing to the advancement of fair and high-quality GenAI models.
Human-Computer Interaction,Artificial Intelligence
What problem does this paper attempt to address?
The paper focuses on the early assessment of Generative Artificial Intelligence (GenAI) models in the field of computer vision. Existing tools primarily focus on data quality assurance and model interpretability, while there is a gap in output evaluation during the development process of GenAI models. Developers often rely on subjective visual evaluation, which may lack scalability and generalizability. GenLens is an interactive web application designed to facilitate annotation and analysis of GenAI model outputs, supporting a comprehensive evaluation process from pattern discovery, issue labeling, and result aggregation to evidence-based insights. It provides an overview of failure cases and annotation methods, custom problem labeling and categorization, as well as collaborative features through multi-user annotation aggregation. User studies have shown that GenLens effectively improves the workflow of model developers, and they have a strong intention to integrate it into practice. Through formal research with GenAI model developers in industrial environments, the paper reveals the importance of systematic evaluation of GenAI models at the early stages of development. GenLens fills this gap by providing quantitative methods to summarize and annotate failure cases, thereby supporting improvement in model training. User studies confirm that GenLens enhances developers' work efficiency, increases their satisfaction, and helps gain better insights for validation. The paper also discusses related work, including generative AI, visualization analysis of machine learning, and challenges in evaluating GenAI model outputs. GenLens aims to achieve four key objectives: pattern discovery, issue identification, performance analysis, and insight summarization, to support effective evaluation of model outputs. Finally, the paper presents the design iteration process of GenLens, including key components such as discovery page, annotation modes, and analysis page, as well as its implementation and user feedback. User evaluations indicate that GenLens is highly useful in model evaluation, user-friendly, and users have a strong intention to use it. Furthermore, the research proposes two insights for GenAI model development: enhancing collaboration in the early model evaluation stage and promoting human-centric GenAI development. Future work may include further optimization of GenLens to accommodate larger-scale data and different tasks, as well as evaluating the application of model outputs for end users after deployment.