Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations

Sihao Chen,Hongming Zhang,Tong Chen,Ben Zhou,Wenhao Yu,Dian Yu,Baolin Peng,Hongwei Wang,Dan Roth,Dong Yu
2023-11-08
Abstract:We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge in fine - grained text semantic representation, especially how to recognize (inferred) semantic equivalence between different atomic propositions within a sentence (i.e., the smallest units of meaning expressed in the text sequence). Traditional methods usually encode the entire text sequence into a fixed - length vector. Although this method provides a unified and compact text semantic representation, it has difficulties when querying the sentence embeddings of fixed dimensions to obtain more fine - grained semantic information or structure. For example, even if two sentences express different meanings as a whole, they may share similar meanings at the level of some atomic propositions, as shown in the example in Figure 1. Both sentences agree that "Dracula is a novel" and "Dracula was published in the 19th century". To solve these problems, the paper proposes a sub - sentence encoder, which is a context - embedding model obtained through contrastive learning for fine - grained text semantic representation. The sub - sentence encoder can generate different context embeddings corresponding to different atomic propositions and identify (inferred) semantic equivalence between propositions across different text sequences through contrastive learning. This makes the sub - sentence encoder effective in applications such as retrieving supporting facts for fine - grained text attribution or identifying conditional semantic similarities between texts. Meanwhile, experiments prove that the sub - sentence encoder is at the same level as the sentence encoder in terms of inference cost and space complexity. In conclusion, this paper aims to solve the limitations of existing sentence - embedding methods in handling fine - grained semantic representation by introducing the sub - sentence encoder, providing a new method to encode and index texts more effectively, especially in application scenarios such as long - text evaluation, attribution, or factuality estimation.