Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval

Priyanka Mandikal,Raymond Mooney
2024-01-09
Abstract:Traditional information retrieval is based on sparse bag-of-words vector representations of documents and queries. More recent deep-learning approaches have used dense embeddings learned using a transformer-based large language model. We show that on a classic benchmark on scientific document retrieval in the medical domain of cystic fibrosis, that both of these models perform roughly equivalently. Notably, dense vectors from the state-of-the-art SPECTER2 model do not significantly enhance performance. However, a hybrid model that we propose combining these methods yields significantly better results, underscoring the merits of integrating classical and contemporary deep learning techniques in information retrieval in the domain of specialized scientific documents.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue of how to improve the accuracy and efficiency of scientific literature retrieval in the medical field. Specifically, the authors explore the performance of traditional sparse vector representations (such as TF/IDF weighted bag-of-words models) and modern dense embeddings (such as the SPECTER2 model using transformers) in literature retrieval, and propose a hybrid model that combines these two methods to achieve better retrieval results. ### Main Research Questions: 1. **Performance Comparison of Traditional Sparse Vector Representations and Modern Dense Embedding Methods**: The authors evaluated the performance of these two methods on the Cystic Fibrosis dataset and found that their performance was roughly equivalent, with traditional sparse vector representation methods even slightly outperforming in some cases. 2. **Performance Improvement of the Hybrid Model**: The authors proposed a hybrid model that combines sparse vector representations and dense embeddings, and experimentally validated the significant advantages of this hybrid model in terms of precision/recall (PR) and normalized discounted cumulative gain (NDCG) metrics. ### Research Background: - **Traditional Information Retrieval Methods**: Based on sparse vector representations, such as TF/IDF weighted bag-of-words models, these methods are simple and effective, widely used in early information retrieval systems. - **Modern Deep Learning Methods**: Based on dense embeddings, such as the SPECTER2 model using transformers, these methods can capture the semantics and context of the text, usually performing well in text representation. ### Research Methods: - **Dataset**: Using the Cystic Fibrosis database (CF), which contains 1,239 documents and 100 queries. - **Models**: - **Sparse Retrieval Model**: Using the classic TF/IDF weighted bag-of-words model. - **Dense Retrieval Model**: Using the SPECTER2 model to generate dense embeddings for documents and queries. - **Hybrid Retrieval Model**: Combining sparse and dense embeddings, calculating the similarity between queries and documents through a weighted combination. ### Experimental Results: - **Performance Comparison**: The hybrid model significantly outperformed the individual sparse or dense models in PR and NDCG metrics. - **Hyperparameter Optimization**: By adjusting the weights of sparse and dense embeddings (denoted by λ), it was found that the hybrid model performed best when λ=0.8. ### Conclusion: - **Complementarity of Traditional and Modern Methods**: Although modern dense embedding methods perform well in some tasks, traditional sparse vector representation methods still have competitiveness in specific fields (such as medical literature retrieval). - **Advantages of the Hybrid Model**: By combining sparse and dense embeddings, the strengths of both can be fully utilized to improve the accuracy and efficiency of literature retrieval. ### Significance: - **Theoretical Contribution**: Provides a new perspective for research in the field of information retrieval, emphasizing the importance of combining traditional and modern methods in specific tasks. - **Practical Application**: Provides practical guidance for developing more effective literature retrieval systems, especially in specialized fields such as medical research.