Image captioning for cultural artworks: a case study on ceramics
Baoying Zheng,Fang Liu,Mohan Zhang,Tongqing Zhou,Shenglan Cui,Yunfan Ye,Yeting Guo
DOI: https://doi.org/10.1007/s00530-023-01178-8
IF: 3.9
2023-01-01
Multimedia Systems
Abstract:When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.