Abstract:Simple linear models, which usually learn word-level representations that are later combined to form document representations, have recently shown impressive performance. To improve the performance of document-level classification, it is crucial to explore the factors affecting the quality of the document vector. In this paper, we propose the concept of containers and further explore the properties of word containers and document containers by experiments and theoretical demonstrations. We find that the document container has a fixed capacity and that the document vector obtained by a simple average of too many word embeddings undoubtedly cannot be fully loaded by the container and will lose some semantic and syntactic information on very large text datasets. We also propose an efficient approach for document representation, using clustering algorithms to divide a document container into several subcontainers and establishing the relationship between the subcontainers. We additionally report and discuss the properties of two methods of clustering algorithms, DVEM-Kmeans and DVEM-Random, on large text datasets by sentiment analysis and topic classification tasks. Compared to simple linear models, the results show that our models outperform the existing state-of-the-art in generating high-quality document representations for document-level classification relatedness tasks. Our approaches can also be introduced to other models based on neural networks, such as convolutional neural networks, recurrent neural networks and generative adversarial networks, in supervised or semisupervised settings.

Paragraph Vector Representation Based on Word to Vector and CNN Learning

Knowledge-based Document Embedding for Cross-Domain Text Classification

Topical Paragraph Vector learning

A Hierarchical Neural Autoencoder for Paragraphs and Documents

New Generation Model of Word Vector Representation Based on CBOW or Skip-Gram

Generative Paragraph Vector

Sentence Vector Model Based on Implicit Word Vector Expression

Paragraph Vector Based Topic Model for Language Model Adaptation.

Spherical Paragraph Model.

Comprehensive Relation Modelling for Image Paragraph Generation

The Deep Learning Word Vector Model Using Part of Speech and Sentiment Information.

Analysis of the Paragraph Vector Model for Information Retrieval.

Document Vector Extension for Documents Classification

Context Vector Model for Document Representation: A Computational Study.

Paragraph Generation Network with Visual Relationship Detection.

Learning Task Specific Distributed Paragraph Representations Using a 2-Tier Convolutional Neural Network

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews

Dual-CNN: A Convolutional Language Decoder for Paragraph Image Captioning

Three Convolutional Neural Network-based Models for Learning Sentiment Word Vectors Towards Sentiment Analysis

Document Classification Based on Word Vectors

Semantic-aware network embedding via optimized random walk and paragaraph2vec