MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability
Kyudan Jung,Sieun Hyeon,Jeong Youn Kwon,Nam-Joon Kim,Hyun Gon Ryu,Hyuk-Jae Lee,Jaeyoung Do
2024-08-16
Abstract:Improving the readability of mathematical expressions in text-based document such as subtitle of mathematical video, is an significant task. To achieve this, mathematical expressions should be convert to compiled formulas. For instance, the spoken expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition is more readily comprehensible when displayed as a compiled formula $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. To convert mathematical spoken sentences to compiled formulas, two processes are required: spoken sentences are converted into LaTeX formulas, and LaTeX formulas are converted into compiled formulas. The latter can be managed by using LaTeX engines. However, there is no way to do the former effectively. Even if we try to solve this using language models, there is no paired data between spoken sentences and LaTeX formulas to train it. In this paper, we introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into LaTeX formulas. MathBridge comprises approximately 23 million LaTeX formulas paired with the corresponding mathematical spoken sentences. Through comprehensive evaluations, including fine-tuning with proposed data, we discovered that MathBridge significantly enhances the capabilities of pretrained language models for converting to LaTeX formulas from mathematical spoken sentences. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement.
Machine Learning,Artificial Intelligence,Computation and Language