Domain-specific ChatBots for Science using Embeddings

Kevin G. Yager
2023-08-25
Abstract:Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores how to utilize existing methods and technologies to build domain-specific chatbots, particularly for applications in physical science research. The main issues the paper attempts to address are: 1. **Enhancing research efficiency**: By integrating existing technologies, the paper aims to construct a chatbot capable of understanding and answering specialized questions in the field of physical sciences, thereby accelerating the scientific research process. 2. **Addressing the limitations of large language models (LLMs) in scientific applications**: Although current large language models are powerful, their knowledge in specific fields like physical sciences is incomplete, making it difficult to meet the demands for precision and source traceability in scientific research. 3. **Overcoming the hallucination problem**: Large language models sometimes generate information that appears reasonable but is actually incorrect (hallucinations). The paper proposes a method to reduce such issues by providing the model with specific document fragments as contextual information. 4. **Avoiding the need to train new models from scratch**: The paper presents a method that does not require retraining large language models. Instead, it uses text embeddings to retrieve relevant document fragments and provides them as contextual information to the model, thereby achieving domain-specific conversational capabilities. 5. **Integrating image retrieval functionality**: In addition to textual information, the paper also discusses how to use image embedding technology to retrieve image data from scientific publications related to user queries, further enriching the chatbot's functionality. In summary, the paper aims to demonstrate how to quickly build a chatbot that can assist in physical science research using existing technologies and tools, thereby improving the efficiency and quality of researchers' work.