An introduction to Retrieval-Augmented Generation (RAG)

Published on
Authors

Large Language Models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human-like text. Built using deep learning techniques, these models are trained on large datasets of diverse text sources, primarily scraped from the web. The training process involves processing hundreds of terabytes of data, allowing LLMs to learn the patterns and structures of human language. As a result, they can generate text that closely resembles human writing.

Although LLMs are powerful tools for creating and understanding text, they have some limitations:

  1. Static knowledge: LLMs have a fixed knowledge base up to the date of their training, which means they don't have access to real-time or recently updated information.
  2. Generalisation: They provide generalised answers that may not always be accurate or contextually relevant, especially for specialised questions.
  3. Hallucination: LLMs can sometimes produce plausible-sounding but incorrect answers, known as "hallucinations".

To overcome these limitations, the Retrieval-Augmented Generation (RAG) approach was born.

Retrieval-Augmented Generation approach

Retrieval-Augmented Generation (RAG) provides LLMs with contextual information derived from domain-specific knowledge. RAG is currently being used to power chatbots and Q&A systems.

The diagram shows the three stages of a generic retrieval augmented generation approach. The first block is about retrieve from a knowledge base using the user input, secondly combining the retrieved relevant knowledge with user input and instruction for the LLM and third let the LLM generate the output.
High level process chart of a RAG pipeline.

RAG works in three main steps:

  1. Retrieve relevant information (text, data, etc.) from a knowledge base based on user input.
  2. Complement the user input by integrating the retrieved relevant information with the user input and the instructions for the LLM (called Prompt).
  3. Generate an Output by querying a large language model with the compiled prompt.

The rationale behind the RAG approach is that incorporating additional information from domain knowledge sources can improve the effectiveness and completeness of responses to user queries.

Two fundamental architectural elements drive a RAG application:

  1. a knowledge base. This is where all domain-specific data and information is stored and indexed. When a user submits a query, the system retrieves relevant data from it. One of the most common data sources is a vector database, which contains the content and its representation (embeddings), allowing the similarity between the user's query and the stored information to be calculated to retrieve the most relevant. In the previous figure, the vector database is used in step 1.
  2. Large Language Model: One or more LLMs process the user query taking into account the relevant context and additional information retrieved from the knowledge base to generate a comprehensive and accurate answer (step 3 of the diagram).

It's important to note that RAG applications are not tied to any specific LLM, giving the ability to use Open Source solutions to have complete control over the data flow.

Retrieval-Augmented Generation vs. ChatGPT

There are two main differences between a retrieval-augmented generation and ChatGPT:

  1. Retrieval: While RAG actively retrieves specific information from a knowledge base to respond to user queries, ChatGPT relies solely on the static knowledge encoded during its training, without accessing real-time or specific external databases (except for the new GPT-4o, but with many limitations).
  2. Augmentation: While RAG augments the user input with the retrieved information before generating a response, ChatGPT uses the input provided directly by the user without any intermediate augmentation.

In general, a customised RAG application offers the benefits of real-time, specific and contextually relevant information, improving the quality and effectiveness of responses, especially in specialised domains.