Every iteration of AI chatbots and large language models (LLMs) just gets better and better. Take ChatGPT, for example. In a matter of seconds, it can generate a detailed history of the Olympics, a comparison of King Lear (1606) versus King Kong (2005), answers for the New York Bar Exam, and a whole lot more.
While this kind of general knowledge makes for a neat party trick, it has its limits. For example, if you ask ChatGPT how much vacation time you're entitled to at work, it's going to give you general platitudes, at best—not specific advice from your employee handbook. And at worst, it'll make up an answer.
This is where retrieval augmented generation (RAG) comes in. Broadly speaking, RAG is a method for giving AI models access to additional external information that they haven't been trained on. Crucially, it allows AI models to access new and up-to-date information without needing to be retrained. But that's putting things simply, so let's dive in.
Table of contents:
Large language models (LLMs): A quick overview
Before getting into the nitty gritty of retrieval augmented generation, it's important to understand the problem it solves in large language models.
LLMs are powerful text prediction engines. They take your AI prompt and generate answers using a string of plausible follow-on text. To do this, every LLM is trained on a massive corpus of data. Where this data comes from varies based on the LLM, but at a minimum, you can assume they've been trained on the entire public internet and every major book that's ever been published. State-of-the-art models are also trained on proprietary data or "synthetic data" generated by different AI models. And once this training is done, it's done.
While LLMs are trained on more information than any human could read and internalize in countless lifetimes, they still have a few quirks. One quirk is that every LLM has a cutoff date where its training data stops dead. OpenAI's GPT-4o, for example, stops at October 2023. This means that without some way to access and use external data, GPT-4o doesn't know anything that happened after October 2023—it has only its internal training data to rely on.
This leads us to the problem with LLMs that RAG solves: getting new information.
For a deeper dive into how LLMs work, check out How does ChatGPT work?
What is RAG?
Retrieval augmented generation (RAG) is a way of giving an LLM information that it wasn't trained on. It requires two major components: an LLM and a database containing all the additional information you want it to have access to.
How does retrieval augmented generation work?
The best way to explain retrieval augmented generation is to walk you through an example of RAG in action.
Let's say you sell running shoes, and you've added an AI chatbot to your website to answer customers' questions about your products.
When a customer asks your chatbot a question (the input), it doesn't get sent directly to the LLM. Instead, it gets analyzed, and the database with all the additional info about your products—for example, sales copy, care instructions, and fitting guidelines—is searched for any information that could be useful in generating a response.
If the customer asked about a particular shoe, the search would return all the relevant information about that shoe. Of course, if there's no relevant info, you could configure the RAG database to respond by saying as much or answering using its training data. This is all part of the retrieval step.
All that additional information, or context, gets appended to the initial prompt, and the whole lot is sent to the LLM. The LLM then uses its neural network to generate a response using the information it received from your RAG database. This is the augmented generation part of things.
Elements of a RAG pipeline
There are countless ways to implement a RAG pipeline, but all require the same core elements.
LLMs. The large language model is what's used to process the initial input to figure out what context is required to respond accurately. It's also used to generate a response to the finalized prompts. Some complex RAG-based apps have multiple LLMs in their pipelines: small language models that parse the initial input, along with larger models with greater context windows to generate a response.
RAG database. Retrieval augmented generation employs a vector database like Pinecone or LlamaIndex where "chunks" of text are encoded in vector space. This is similar to how LLMs are trained, though on a different scale. How you decide to chunk the information and what search strategies you employ can have a major impact on speed and performance, and more importantly, cost.
Controller. The controller or orchestrator controls what happens and when. This is often built using tools like LangChain and Semantic Kernel.
The specifics of setting up a RAG pipeline go far beyond the scope of this article. Unless you're a developer working in this space, you're more likely to encounter RAG when it's employed in everyday apps like ChatGPT, Notion AI, and Zapier Chatbots. Which, when you think about it, is incredibly cool.
Take Zapier Chatbots, for example. Without needing to first get an engineering degree, you can easily build a customer support or lead management chatbot that's already trained on the entirety of the internet (if not more), along with the additional knowledge sources and live data you feed it, so it can accurately answer queries.
You can even take it a step further and connect Zapier Chatbots with the rest of your apps. This way, you can automate the rest of your workflows right from your chatbot. Learn more about how to create a custom AI chatbot with Zapier Chatbots, or get started with one of these templates.
Zapier is the leader in workflow automation—integrating with thousands of apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more.
Benefits of RAG
If it's well implemented, retrieval augmented generation offers a couple of huge advantages over employing an LLM on its own.
Offers additional context
The biggest benefit of RAG is that it allows you to include additional context that the LLM wasn't trained on. This can be proprietary or personal data, up-to-date information like your current client projects or deadlines, or anything else you can think of. It massively extends the usefulness of LLMs in the real world.
Reduced AI hallucinations
For better or worse, LLMs and chatbots try to be as helpful as possible, even if it means responding with something totally made up instead of saying they don't know the answer. RAG somewhat alleviates the problem of AI hallucinations by giving them a source of truth to defer to. It's not a perfect solution, though. So it's still always a good idea to fact-check your chatbot's responses.
Easier to deploy and update
Retrieval augmented generation itself is easy and cheap to deploy—at least compared to training or fine-tuning an LLM. It doesn't require weeks of training, and mistakes can be fixed by just updating a database entry.
Cons of RAG
With that said, RAG isn't without its costs. Here are some notable ones.
Increases overall costs
It requires significantly more LLM compute which costs GPUs (graphics processing units) or API credits—and as a result, cash. Even a well-implemented RAG pipeline is going to cost more than a bare LLM, and a poorly implemented one is going to cost a lot more.
Reduced response time
Since RAG adds multiple extra data retrieval steps to the overall process, it means a slower response time. Again, a well-implemented RAG pipeline will have less of a slowdown than a poorly implemented one, but it's slower nonetheless.
The RAGged edge
Retrieval augmented generation fundamentally fixes one of the biggest downsides of current large language models—that their knowledge of the world is limited to their training data. While it isn't without its quirks, I think it'll become the core of most AI products over the next few years, if not sooner.
Related reading: