Abstract:With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation.

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

Approach for multi-dimensional associated heterogeneous engineering document semantic retrieval

Document Clustering Based on Semantic Smoothing Approach

Multi-page Document Analysis Based on Format Consistency and Clustering

Multi-document Chinese Name Disambiguation Based on Latent Semantic Analysis

An automatic approach for efficient text segmentation

Automatic Multi-Document Summarization for Digital Libraries

Study on Topic Partition in Automatic Abstracting System

Clustering-based Semantic Retrieval Algorithm

Design and development of a concept-based multi-document summarization system for research abstracts

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

A Semantic approach for effective document clustering using WordNet

Object Recognition from Scientific Document based on Compartment Refinement Framework

Automatic Decomposition of Multi-Author Documents Using Grammar Analysis

Using a Double Clustering Approach to Build Extractive Multi-document Summaries

Semantic Smoothing for Model-based Document Clustering

Abstract Meaning Representation for Multi-Document Summarization

A Semantics Enabled Intelligent Semi-structured Document Processor

Document Clustering Based on Word Sense Cluster

Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

Automatic Text Summarization Method Based on Improved TextRank Algorithm and K-Means Clustering