NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2

Tengfei Xue,Xuefeng Li,Roman Smirnov,Tahir Azim,Arash Sadrieh,Babak Pahlavan

2024-07-11

Abstract:Retrieval-augmented generation (RAG) techniques are widely used today to retrieve and present information in a conversational format. This paper presents a set of enhancements to traditional RAG techniques, focusing on large language models (LLMs) fine-tuned and hosted on AWS Trainium and Inferentia2 AI chips via SageMaker. These chips are characterized by their elasticity, affordability, and efficient performance for AI compute tasks. Besides enabling deployment on these chips, this work aims to improve tool usage, add citation capabilities, and mitigate the risks of hallucinations and unsafe responses due to context bias. We benchmark our RAG system's performance on the Natural Questions and HotPotQA datasets, achieving an accuracy of 62% and 59% respectively, exceeding other models such as DBRX and Mixtral Instruct.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The main goal of this paper is to address a series of challenges encountered in the deployment and application of current Retrieval-Augmented Generation (RAG) systems. Specifically, the researchers propose the following improvements: 1. **Cost-effectiveness**: Reducing the cost of deploying large language models (LLM) by using AWS Trainium/Inferentia2 chips instead of traditional Nvidia GPUs. 2. **Elastic computing**: Utilizing the elastic properties of these chips to achieve dynamic resource scaling to meet the needs of real-time applications such as personal assistants. 3. **Model optimization**: Enhancing the Meta's Llama3-Instruct 70B model through fine-tuning to improve its ability to handle complex queries and ensure the accuracy, citation capability, and safety of the answers. 4. **Performance improvement**: Benchmarking on the Natural Questions and HotPotQA datasets, the improved model Ninja LLM achieved 62.22% and 58.84% accuracy, respectively, outperforming some other existing models. In summary, this paper aims to build a more efficient, economical, and reliable RAG system through technological innovation and optimization strategies, thereby better addressing complex query tasks.

NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

SFR-RAG: Towards Contextually Faithful LLMs

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

T-RAG: Lessons from the LLM Trenches

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA

Introducing Super RAGs in Mistral 8x7B-v1

Retrieval-Augmented Generation for Large Language Models: A Survey

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

ERATTA: Extreme RAG for Table To Answers with Large Language Models

Adopting RAG for LLM-Aided Future Vehicle Design

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Deploying Large Language Models With Retrieval Augmented Generation