Abstract:This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task. Our study is available at <a class="link-external link-https" href="https://github.com/RainYuGG/BLIP-Adapter" rel="external noopener nofollow">this https URL</a>

Read It, Don't Watch It: Captioning Bug Recordings Automatically

Reverse Engineering Time-Series Interaction Data from Screen-Captured Videos.

An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation

RECAP: Retrieval-Augmented Audio Captioning

GUI Action Narrator: Where and When Did That Action Take Place?

Caption positioning structure for hard of hearing people using deep learning method

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Seeing Bot

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Prompting Is All You Need: Automated Android Bug Replay with Large Language Models

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Towards Effective Bug Reproduction for Mobile Applications

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Delving Deeper into the Decoder for Video Captioning

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning

SnapCap: Efficient Snapshot Compressive Video Captioning

Video captioning – a survey

Mubug: a Mobile Service for Rapid Bug Tracking

Extracting Replayable Interactions from Videos of Mobile App Usage

Video2Action: Reducing Human Interactions in Action Annotation of App Tutorial Videos