D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Zihan Liao,Hang Yu,Jianguo Li,Jun Wang,Wei Zhang

2024-06-25

Abstract:The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at <a class="link-external link-https" href="https://github.com/codefuse-ai/D2LLM" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the challenging issue of achieving both precision and efficiency in semantic search models. Specifically, existing BERT-style dual encoders, while efficient (as embeddings can be precomputed), fall short in capturing the nuances of search tasks. On the other hand, GPT-style large language models (LLMs), although capable of capturing these nuances, are computationally expensive and difficult to apply in real-time scenarios. Therefore, this paper proposes a method called D2LLM (Decomposed and Distilled Large Language Models), which combines the advantages of both approaches. D2LLM decomposes a cross-encoder into an efficient dual encoder (integrated with multi-head attention pooling) and an interaction simulation module, and distills the knowledge of the LLM into this model using techniques of contrastive, ranking, and feature simulation. Experimental results show that D2LLM outperforms five baseline methods on three tasks, with significant improvements particularly in the natural language inference (NLI) task.

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Enhancing Cloud-Based Large Language Model Processing with Elasticsearch and Transformer Models

Semantic Compression With Large Language Models

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Best Practices for Distilling Large Language Models into BERT for Web Search Ranking

Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

Supervised Knowledge Makes Large Language Models Better In-context Learners

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Models

A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Large Language Models aren't all that you need

Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Large Language Models are Built-in Autoregressive Search Engines