Automated formatting verification technique of paperwork based on the gradient boosting on decision trees

Nail Nasyrov,Mikhail Komarov,Petr Tartynskikh,Nataliya Gorlushkina
DOI: https://doi.org/10.1016/j.procs.2020.11.038
2020-01-01
Procedia Computer Science
Abstract:The article describes the automated formatting verification technique of docx document elements which were the basis for developed online service. Checking the document formatting correctness is an important task when writing various research papers, explanatory notes for course projects, and other scientific works. The article describes the approach for identifying design features of text document elements. The structure of the client-server interaction of the service, the operation of which was simulated, is also given. Such a machine learning algorithm as gradient boosting on decision trees CatBoost was chosen as the primary tool for multi-classification. Empirically refined parameters of the algorithm that increase its accuracy are described. Special attention is paid to the developed method of checking the results of classification of elements, the sequence of which has different regularities. This approach allows us to identify the formatting classes of docx files elements that were incorrectly identified by the classifier. Sometimes, it is possible to override the results of the classifier to increase the accuracy of checking the elements formatting in docx files.
What problem does this paper attempt to address?