Abstract:The advancement of transformer neural networks has significantly elevated the capabilities of sentence similarity models, but they struggle with highly discriminative tasks and produce sub-optimal representations of important documents like scientific literature. With the increased reliance on retrieval augmentation and search, representing diverse documents as concise and descriptive vectors is crucial. This paper improves upon the vectors embeddings of scientific literature by assembling niche datasets using co-citations as a similarity metric, focusing on biomedical domains. We apply a novel Mixture of Experts (MoE) extension pipeline to pretrained BERT models, where every multi-layer perceptron section is enlarged and copied into multiple distinct experts. Our MoE variants perform well over $N$ scientific domains with $N$ dedicated experts, whereas standard BERT models excel in only one domain. Notably, extending just a single transformer block to MoE captures 85% of the benefit seen from full MoE extension at every layer. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for numerically representing diverse inputs. Our methodology marks significant advancements in representing scientific text and holds promise for enhancing vector database search and compilation.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the inadequacy of existing large language models (LLMs) in generating reliable vector embeddings and performing precise classification, especially in information retrieval and web search technologies. Despite the significant success and widespread adoption of transformer-based large language models since 2017, they still face challenges in handling highly discriminative tasks, particularly for important documents requiring high-precision representation, such as scientific literature. Specifically, the paper points out that current sentence similarity models, although having made breakthroughs in fields like sentiment analysis, perform poorly when dealing with subtle differences in specific domains, leading to suboptimal representation of many important documents. Therefore, the paper proposes a new approach that combines contrastive learning and Mixture of Experts (MoE) to extend the pre-trained BERT model to improve vector embeddings of scientific literature. This approach aims to enhance the model's performance through the following two aspects: 1. **Domain-specific fine-tuning**: Utilizing co-citation as a similarity measure, applying contrastive fine-tuning methods to the pre-trained BERT model to enable it to learn and understand specific scientific domains. 2. **Achieving general applicability through Mixture of Experts**: Introducing a scalable method to apply the MoE model to pre-trained BERT models across multiple domains, aiming to create a versatile "one-size-fits-all" model. The methodology of the paper marks a significant advancement in representing scientific texts, promising to enhance the search and compilation capabilities of vector databases. Experimental results show that the proposed model significantly outperforms general pre-trained models, fine-tuned sentence similarity models, and science-oriented BERT models in multiple biomedical fields. Specifically, the proposed MoE variant achieves performance comparable to multiple independent models across various domains, suggesting that a "one-size-fits-all" transformer network might be feasible for certain tasks. These models have profound implications for applications relying on precise text classification and vector embeddings, such as information retrieval and web search.

Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

Mixture of A Million Experts

Monet: Mixture of Monosemantic Experts for Transformers

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

From Sparse to Soft Mixtures of Experts

Upcycling Large Language Models into Mixture of Experts

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

Residual Mixture of Experts

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

MoEUT: Mixture-of-Experts Universal Transformers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Mixture of Parrots: Experts improve memorization more than reasoning