A Multimodal Text Block Segmentation Framework for Photo Translation.

Jiajia Wu,Anni Li,Kun Zhao,Zhengyan Yang,Bing Yin,Cong Liu,Li-Rong Dai
DOI: https://doi.org/10.1007/978-3-031-46311-2_10
2023-01-01
Abstract:Nowadays, with the vigorous development of OCR (Optical Character Recognition) and machine translation, photo translation technology brings great convenience to people’s life and study. However, when translating the content of an image line by line, the lack of contextual information in adjacent semantic-related text lines will seriously influence the actual effect of translation, making it difficult for people to understand. To tackle the above problem, we propose a novel multimodal text block segmentation encoder-decoder model. Specifically, we construct a convolutional encoder to extract the multimodal representation which combines visual, semantic, and positional features together for each text line. In the decoder stage, the LSTM (Long Short Term Memory) module is employed to output the predicted segmentation sequence inspired by the pointer network. Experimental results illustrate that our model outperforms the other baselines by a large margin.
What problem does this paper attempt to address?