Document Parsing Tool for Language Translation and Web Crawling using Django REST Framework
Kruthika Alnavar,R Uday Kumar,C Narendra Babu
DOI: https://doi.org/10.1088/1742-6596/1962/1/012018
2021-07-01
Journal of Physics: Conference Series
Abstract:Abstract There are 7.5 billion inhabitants and over 7,117 languages existing around the world, but only 20% of the people speak English. To understand the wisdom and knowledge of other cultures language translation becomes a basic need. In this paper, a computer-assisted document parsing tool is investigated. The proposed approach uses a language translator that performs translation from images eliminating the need of a human translator for images avoiding the scope for misinterpretation and misunderstanding among people of different ethnic groups. The proposed tool is also capable of performing web crawling using Django Representational State Transfer framework. Further, the proposed approach employs Python packages such as pytesseract, textblob and beautifulsoup to perform Optical Character Recognition, Translation and Extraction of Hypertext Markup Language data respectively. Experimental results of translation on four different categories of images such as Maps, Comics, Newspapers and Magazines, Scientific Publications demonstrate an accuracy of 97.2%, 93.3%, 95.82% and 98.27% respectively. By considering websites like E-commerce, Magazines, Blogs, Social Media, News and Educational sites average precision of 5.4, recall of 7.45 and F-score of 6.24 is achieved. The results reveal that the proposed system can be used as an improvement over a human translator and a data entry operator.