Abstract:We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge in fine - grained text semantic representation, especially how to recognize (inferred) semantic equivalence between different atomic propositions within a sentence (i.e., the smallest units of meaning expressed in the text sequence). Traditional methods usually encode the entire text sequence into a fixed - length vector. Although this method provides a unified and compact text semantic representation, it has difficulties when querying the sentence embeddings of fixed dimensions to obtain more fine - grained semantic information or structure. For example, even if two sentences express different meanings as a whole, they may share similar meanings at the level of some atomic propositions, as shown in the example in Figure 1. Both sentences agree that "Dracula is a novel" and "Dracula was published in the 19th century". To solve these problems, the paper proposes a sub - sentence encoder, which is a context - embedding model obtained through contrastive learning for fine - grained text semantic representation. The sub - sentence encoder can generate different context embeddings corresponding to different atomic propositions and identify (inferred) semantic equivalence between propositions across different text sequences through contrastive learning. This makes the sub - sentence encoder effective in applications such as retrieving supporting facts for fine - grained text attribution or identifying conditional semantic similarities between texts. Meanwhile, experiments prove that the sub - sentence encoder is at the same level as the sentence encoder in terms of inference cost and space complexity. In conclusion, this paper aims to solve the limitations of existing sentence - embedding methods in handling fine - grained semantic representation by introducing the sub - sentence encoder, providing a new method to encode and index texts more effectively, especially in application scenarios such as long - text evaluation, attribution, or factuality estimation.

Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations

OssCSE: Overcoming Surface Structure Bias in Contrastive Learning for Unsupervised Sentence Embedding

A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings

Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Contrastive Learning Models for Sentence Representations

DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Contrastive Learning with Prompt-derived Virtual Semantic Prototypes for Unsupervised Sentence Embedding

Unsupervised Sentence Embedding Model Based on Contrastive Learning

Conceptual Sentence Embeddings

A Contrastive Framework to Enhance Unsupervised Sentence Representation Learning

Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework

CLSESSP: Contrastive learning of sentence embedding with strong semantic prototypes

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings

Bridging Continuous and Discrete Spaces: Interpretable Sentence Representation Learning via Compositional Operations

Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings

Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models

Sparse Contrastive Learning of Sentence Embeddings

Enhancing Multilingual Universal Sentence Embeddings by Monolingual Contrastive Learning

A Sequential Neural Encoder with Latent Structured Description for Modeling Sentences.