Image-to-LaTeX Converter for Mathematical Formulas and Text

Daniil Gurgurov,Aleksey Morshnev
2024-08-08
Abstract:In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to develop an image-to-LaTeX converter named Im2Latex, which is used to convert images containing mathematical formulas and text into LaTeX code. The paper mainly addresses the following issues: 1. **Proposed a new image-to-LaTeX conversion method**: The authors utilized an encoder based on Swin Transformer and a GPT-2 decoder to construct a visual encoder-decoder model, achieving effective conversion from images to LaTeX code. 2. **Solved the problem of recognizing complex mathematical formulas**: By adopting Swin Transformer as the encoder, the model can effectively handle images containing complex mathematical formulas, thereby improving recognition accuracy. 3. **Achieved recognition of both printed and handwritten formulas**: First, a base model is trained to handle mathematical formulas in printed images, and then through fine-tuning (using LoRA technology), the model is enabled to recognize handwritten mathematical formulas. 4. **Compared performance with existing models**: The authors compared the proposed model with several similar models (such as Pix2Text, TexTeller, and Sumen) to evaluate its relative performance and robustness in handling handwritten mathematical formulas. 5. **Provided open-source resources**: The authors publicly released their code and pre-trained models to support further research and development in the OCR field, particularly for processing mathematical and scientific documents. In summary, the main goal of this paper is to propose an efficient and accurate image-to-LaTeX conversion method and to validate the effectiveness of the proposed method through experiments.