Skip to content

Retrieval-Augmented Generation: How Chatbots Stop Hallucinating

A practical guide to RAG systems that ground LLM responses in real data, reducing hallucinations and keeping answers current.

Daniel Evershaw(ML Engineer & Technical Writer)May 3, 20265 min read0 views

Last updated: May 14, 2026

white printing paper with numbers
Quick Answer

RAG reduces LLM hallucinations by retrieving relevant documents from a knowledge base at inference time and including them as context for generation.

Every organization that has deployed a large language model in production has encountered the same problem: the model confidently states something that is simply not true. It invents product features that do not exist, cites policies that were never written, or provides outdated information with complete assurance. Retrieval-Augmented Generation (RAG) is the most practical solution to this problem today.

The Hallucination Problem

LLMs hallucinate because they are trained to produce plausible text, not truthful text. When asked about your company specific return policy, the model has no way to know the actual policy — it was not in its training data, or the training data is outdated. So it generates something that sounds like a return policy based on patterns it has seen. The result reads convincingly but may be entirely fabricated.

This is not a problem you can solve with better prompting alone. No matter how carefully you instruct the model to “only state facts,” it cannot distinguish between what it knows accurately and what it is confabulating. The architecture does not support that distinction.

How RAG Works

RAG addresses hallucination by giving the model access to a curated knowledge base at inference time. The process has three stages:

Indexing: Your documents — product manuals, policy documents, knowledge base articles, whatever your source of truth is — are split into chunks, converted into vector embeddings, and stored in a vector database. This happens once (or on a schedule as documents update).

Retrieval: When a user asks a question, their query is also converted into a vector embedding. The system searches the vector database for chunks whose embeddings are most similar to the query embedding. This returns the most relevant pieces of your documentation.

Generation: The retrieved chunks are inserted into the LLM prompt as context, along with the user question. The model generates its response based on this provided context rather than relying solely on its training data. You can instruct it to only answer based on the provided context and to say “I don not know” when the context does not contain the answer.

Chunking Strategies

How you split documents into chunks dramatically affects retrieval quality. Too large and you waste context window space on irrelevant information. Too small and you lose the coherence needed for the model to generate good answers.

Fixed-size chunking (splitting every N tokens) is simple but often splits mid-sentence or mid-paragraph, losing context. Semantic chunking uses natural boundaries — paragraphs, sections, or topic shifts — to create more meaningful units.

Overlapping chunks (where each chunk shares some content with its neighbors) help ensure that information at chunk boundaries is not lost. A common approach uses chunks of 500-1000 tokens with 100-200 token overlap.

For structured documents, hierarchical chunking preserves the document structure: a section header chunk links to its subsection chunks, allowing the retrieval system to pull in related context at multiple levels of detail.

Embedding Models

The embedding model converts text into dense vectors that capture semantic meaning. Two pieces of text about the same topic should have similar embeddings even if they use different words.

The choice of embedding model matters more than most teams realize. General-purpose embeddings work adequately for broad topics, but domain-specific fine-tuned embeddings significantly improve retrieval accuracy for specialized content. If your knowledge base is full of medical terminology or legal language, a general embedding model may not capture the nuances that distinguish relevant from irrelevant chunks.

Dimension size is a practical consideration. Higher-dimensional embeddings capture more nuance but require more storage and slower similarity search. For most applications, embeddings in the 768-1536 dimension range offer a good balance.

Vector Databases

Vector databases are optimized for similarity search over high-dimensional vectors. Unlike traditional databases that match exact values, vector databases find the nearest neighbors to a query vector using algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).

Popular options include Pinecone (managed, simple), Weaviate (open-source, feature-rich), Qdrant (open-source, performant), and pgvector (PostgreSQL extension, familiar). The choice depends on your scale, infrastructure preferences, and whether you need additional features like filtering or hybrid search.

For most teams starting out, pgvector offers the lowest friction — you can add vector search to your existing PostgreSQL database without introducing a new service.

Improving RAG Quality

Basic RAG often disappoints because retrieval quality is mediocre. Several techniques improve it:

Hybrid search combines vector similarity with keyword matching (BM25). Some queries are better served by exact keyword matches, while others benefit from semantic understanding. Combining both catches more relevant results.

Re-ranking applies a more expensive model to re-score the top retrieved results. The initial retrieval casts a wide net cheaply, then a cross-encoder model evaluates each candidate more carefully to surface the truly relevant ones.

Query transformation rewrites the user query before retrieval. A vague question like “how does it work?” can be expanded into a more specific query based on conversation context, improving retrieval accuracy.

Metadata filtering narrows the search space before vector similarity is computed. If you know the user is asking about a specific product or time period, filtering by metadata before searching reduces noise.

When RAG Is Not Enough

RAG works well for factual question-answering over a defined knowledge base. It struggles with:

  • Questions requiring synthesis across many documents
  • Reasoning that requires understanding relationships not explicit in any single chunk
  • Real-time information that changes faster than your indexing pipeline runs
  • Tasks requiring the model to follow complex multi-step procedures

For these cases, consider combining RAG with other techniques: agentic workflows that can query multiple sources, fine-tuning for domain-specific reasoning patterns, or structured data retrieval (SQL queries) alongside unstructured document retrieval.

  • RAG grounds LLM responses in real data by retrieving relevant documents at inference time
  • Chunking strategy, embedding model choice, and retrieval quality are the primary levers for RAG performance
  • Hybrid search (vector + keyword) and re-ranking significantly improve retrieval accuracy
  • RAG works best for factual Q&A over defined knowledge bases; complex reasoning may need additional techniques
  • Start simple with pgvector and iterate based on measured retrieval quality

Getting Started

The simplest RAG implementation uses an embedding API, pgvector, and a few dozen lines of code. Start there, measure retrieval quality (what percentage of questions get relevant context?), and add complexity only where measurements show it is needed. Most RAG failures are retrieval failures, not generation failures — fix retrieval first.

Frequently Asked Questions

Does RAG completely eliminate hallucinations?

No, but it significantly reduces them. The model can still misinterpret retrieved context or generate information not present in it. Instructing the model to cite sources and say I do not know helps further.

How much does a RAG system cost to run?

Costs include embedding computation, vector database hosting, and LLM inference. For most applications, embedding costs are minimal, vector DB hosting ranges from free (pgvector on existing infra) to hundreds per month (managed services), and LLM costs depend on query volume.

Can RAG work with private or sensitive documents?

Yes. Since documents are stored in your own infrastructure and only relevant chunks are sent to the LLM, you maintain control over data access. Some teams use local LLMs for additional privacy.

Sources

  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
  2. pgvector: Open-source vector similarity search for Postgres

Comments

Leave a comment. Your email won't be published.

Supports basic formatting: **bold**, *italic*, `code`, [links](url)

Related Articles