Abstract:In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

Automatic content based title extraction for Chinese documents using support vector machine

Automatic extraction of titles from general documents using machine learning

Web Content Extraction Based on Maximum Continuous Sum of Text Density.

An Automatic Chinese-Text Classifier Based on Vector Space Model

Sentiment Classification for Chinese Reviews: a Comparison Between SVM and Semantic Approaches

Keyword extraction using support vector machine

Automatic Document Metadata Extraction Based on Deep Networks.

Automatic extraction device, method and program of essay title and correlation information

Study of Word-Based Chinese Document Experimental System and Chinese Free-Text Information Extraction Experiment Based on It

Chinese Keyword Extraction Algorithm Based on Neighbour Words

Metadata Extraction System for Chinese Books

Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

Sentiment Classification for Chinese Reviews Based on Key Substring Features

A Method for Chinese Text Classification Based on Three-Dimensional Vector Space Model

Semi-automatic System for Title Construction

Automatic keyphrase extraction from chinese news documents

Topic Detection Technology for Chinese Text Based on Statistics and Semantic Information

MinerU: An Open-Source Solution for Precise Document Content Extraction

Automatic Generation of Chinese Short Product Titles for Mobile Display.

Chinese Text Classification System on Regulatory Information Based on SVM