Why Machines Cannot Learn Mathematics, Yet

André Greiner-Petter,Terry Ruas,Moritz Schubotz,Akiko Aizawa,William Grosky,Bela Gipp
DOI: https://doi.org/10.48550/arXiv.1905.08359
2019-05-21
Abstract:Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers.
Digital Libraries,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the deficiencies of current machine learning (ML) techniques in mathematical information retrieval (MIR) tasks. Specifically, although machine learning has made remarkable progress in the field of natural language processing (NLP), it still faces enormous challenges in understanding and representing the semantics of mathematical expressions. These problems mainly stem from the following aspects: 1. **Ambiguity and context - dependence of mathematical expressions**: Mathematical documents usually use imprecise and ambiguous language to describe complex concepts and relationships. This makes it difficult for computers to understand the true meaning of mathematical expressions. 2. **Lack of standardized definitions of mathematical symbols**: Existing mathematical markup languages (such as MathML) do not define mathematical symbols clearly enough, resulting in different interpretations and representation methods, which further increases the difficulty of computer understanding. 3. **Limitations of existing embedding techniques**: Current word - embedding techniques (such as word2vec, GloVe, etc.) perform well in processing natural language, but are not effective in processing mathematical expressions. These techniques cannot capture the complex relationships between mathematical symbols, nor can they distinguish different meanings of the same symbol in different contexts. 4. **Lack of annotated data**: In order for machines to learn mathematics, a large number of datasets with detailed annotations are required, but currently such resources are very scarce. ### Main research content of the paper To solve the above problems, the author has made the following explorations: - **Applying popular text - embedding techniques**: The author applies text - embedding techniques such as word2vec to the STEM document collection on arXiv and evaluates the performance of these techniques in understanding and representing mathematical expressions. - **Analyzing the deficiencies of existing embedding techniques**: Through experiments, the author finds that existing embedding techniques have many problems when processing mathematical expressions, such as being unable to correctly understand the semantics of mathematical symbols and being unable to recognize different calling forms of the same function. - **Proposing improvement suggestions**: Based on the experimental results, the author proposes some improvement suggestions, including developing more precise standards for defining mathematical symbols, constructing a specialized mathematical vocabulary (similar to WordNet), and using a phased - training method, first learning basic knowledge from educational literature and then gradually expanding to more complex academic literature. ### Conclusion The paper points out that in order for machines to effectively learn and understand mathematics, the ambiguity and context - dependence of mathematical expressions and the limitations of existing embedding techniques must be solved. To this end, more precise standards for defining mathematical symbols need to be developed, rich annotated datasets need to be constructed, and existing machine - learning algorithms need to be improved so that they can better handle the special needs of the mathematical field. ### Example of a formula One formula mentioned in the paper is: \[ W(2, k) > \frac{2k}{k^\varepsilon} \] This formula comes from the English Wikipedia page about Van der Waerden's theorem. The symbols \( W \), \( k \), and \( \varepsilon \) in the formula may have multiple interpretations, depending on the context. If \( W \) is regarded as a variable rather than a function, the interpretation of \( W(2, k) \) will be completely different. Therefore, understanding the semantics of a mathematical expression requires not only considering the symbols themselves, but also comprehensive analysis in combination with the context. Hope this information is helpful for you to understand this paper! If you have more questions, feel free to continue asking.