Toward accessible comics for blind and low vision readers

Christophe Rigaud,Jean-Christophe Burie,Samuel Petit
2024-09-10
Abstract:This work explores how to fine-tune large language models using prompt engineering techniques with contextual information for generating an accurate text description of the full story, ready to be forwarded to off-the-shelve speech synthesis tools. We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content, such as panels, characters, text, reading order and the association of bubbles and characters. Then we infer character identification and generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc. We believe that such enriched content description can be easily used to produce audiobook and eBook with various voices for characters, captions and playing sound effects.
Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of how to make comic books more accessible to blind and low-vision readers. Specifically, the researchers explore how to use large language models (LLMs) and prompt engineering techniques to generate accurate text descriptions that can be converted into audio by existing speech synthesis tools, allowing blind and low-vision readers to understand and enjoy comic stories through hearing. The key challenges mentioned in the paper include: 1. **Automatic identification and description of image content**: This involves using computer vision and optical character recognition technologies to extract information from comic entries, such as panels, characters, text, reading order, and the association between speech bubbles and characters. 2. **Generating comic scripts with contextual information**: This includes identifying characters, describing their appearance, posture, emotions, dialogues, etc., and organizing this information in a natural reading order. 3. **Improving the quality of text-to-speech**: Ensuring that the generated text descriptions are detailed and accurate enough to produce high-quality audio output through speech synthesis tools, enabling blind and low-vision readers to better understand the comic content. By addressing these issues, the researchers hope to provide blind and low-vision readers with a new way to experience comics, allowing them to enjoy the art and stories of comics just like sighted readers.