NEWLabel Studio 1.20.0 Now Available: Spectrograms, Time Series Sync, and more!
Tutorials

In the Loop: What is RAG, Anyway?

In this episode of In The Loop, ML Evangelist Mikaela Kaplan introduces RAG, retrieval-augmented generations and explains how it connects knowledge bases to LLMs to produce more accurate, context-rich responses. Learn the building blocks of a RAG system and why it matters.

Transcript

Hi, I’m Mikaela Kaplan, the ML Evangelist at HumanSignal, and this is In The Loop—the series where we help you stay in the loop with all things data science and AI.

Today, we’re diving into RAG: what it is, how it works, and how you can use it effectively in your own environment.

RAG stands for retrieval-augmented generation. Introduced in a 2020 paper by Lewis et al., it’s a method of connecting a knowledge base to a large language model (LLM) to provide specialized, up-to-date context for generated responses.

Why is this helpful? LLMs have known limitations—they can hallucinate, lack domain-specific or recent knowledge, and struggle with long context windows. RAG helps mitigate these issues by dynamically retrieving relevant information from a connected knowledge base, instead of relying solely on what the LLM was pre-trained on.

The first core component of a RAG system is the knowledge base, which stores domain-specific information as embeddings—numerical representations of text that capture semantic meaning. Similar meanings result in similar embeddings. For example, if your query is about a specific topic, documents with related content will have embeddings close to that query’s embedding.

The second key component is the Retriever, which links the LLM and the knowledge base. Here’s how the RAG workflow operates:

The user submits a query.

That query is converted into an embedding using an embedding model.

The knowledge base is queried for documents with similar embeddings.

Relevant chunks from those documents are added to the prompt context.

The original query and retrieved context are sent to the LLM.

The LLM generates a response using both the context and the query.

In most RAG pipelines, we use two-stage retrieval:

The first stage retrieves all potentially relevant documents.

The second stage—known as re-ranking—orders those documents or chunks by relevance.

This two-stage approach helps optimize for speed and memory efficiency. The re-ranker can incorporate both mathematical and business logic to refine results.

While RAG systems may seem complex, they’re fundamentally about giving LLMs better context. By injecting external knowledge into the generation process, RAG systems make model outputs more relevant and reliable.

But like any AI system, RAG can fail.

In the next episode of In The Loop, we’ll cover common failure points in RAG pipelines and how to fix them.

That’s all for this week. Thanks for staying in the loop.

Related Content