Abstract:The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at <a class="link-external link-https" href="https://github.com/emanuelevivoli/CoMix-dataset" rel="external noopener nofollow">this https URL</a>. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set.

Comic MTL: optimized multi-task learning for comic book image analysis

Dense Multitask Learning to Reconfigure Comics

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

EmoComicNet : A multi-task model for comic emotion recognition

Multimodal Transformer for Comics Text-Cloze

CNN-based segmentation of speech balloons and narrative text boxes from comic book page images

Comic Text Detection and Recognition Based on Deep Learning

ComiCap: A VLMs pipeline for dense captioning of Comic Panels

Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

A tree conditional random field model for panel detection in comic images

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Automating Manga Character Analysis: A Robust Deep Vision-Transformer Approach to Facial Landmark Detection

Robust Analysis of Multi-Task Learning Efficiency: New Benchmarks on Light-Weighed Backbones and Effective Measurement of Multi-Task Learning Challenges by Feature Disentanglement

When Multitask Learning Meets Partial Supervision: A Computer Vision Review

DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning

AutoMTL: A Programming Framework for Automating Efficient Multi-Task Learning

Distribution Matching for Multi-Task Learning of Classification Tasks: a Large-Scale Study on Faces & Beyond

When Multi-Task Learning Meets Partial Supervision: A Computer Vision Review

AP-MTL: Attention Pruned Multi-task Learning Model for Real-time Instrument Detection and Segmentation in Robot-assisted Surgery

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation