Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin,Xuancheng Ren,Yichang Zhang,Gao Liu,Peng Wang,An Yang,Chang Zhou
DOI: https://doi.org/10.48550/arXiv.2212.09297
2022-12-19
Abstract:This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (<a class="link-external link-https" href="https://github.com/OFA-Sys/OFA" rel="external noopener nofollow">this https URL</a>) and demo (<a class="link-external link-https" href="https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary" rel="external noopener nofollow">this https URL</a>) are publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?