← All writing
RAGProductionAgentic Systems

Why your RAG pipeline works in staging and fails your users

A
Anirudh Voruganti
March 28, 20264 min read

You tuned your retrieval pipeline for two weeks. Precision looked good. The demo impressed the team. You shipped it.

Three weeks later, your users are getting answers that have nothing to do with their questions.

Here's what happened — and it's not the embedding model.

The problem is almost never the model

When most teams build RAG, they make the same sequence of decisions:

  1. Chunk documents into fixed-size pieces
  2. Embed them with whatever model is available
  3. Run cosine similarity at query time
  4. Stuff the top-k results into the prompt

This works in staging because your test queries are written by engineers who know the document. They use the same vocabulary. Their questions match the document's language almost perfectly.

Real users don't do that.

A user asking "what's the process for getting reimbursed" will not match a policy document that says "expense claim submission procedure." Same intent. Different language. Your retrieval returns nothing useful. Your LLM hallucinates a process that sounds plausible.

The gap between your test query distribution and your real user query distribution is where most RAG systems quietly fail.

The retrieval design is the system design

Most teams treat retrieval as a configuration problem: which model, which chunk size, which top-k value. These matter. But they're secondary to three structural decisions that determine whether your system works at all.

Decision 1: Chunking strategy for your actual document shape

Fixed-size chunking is a default, not a strategy. A policy document with structured sections needs semantic chunking that respects section boundaries. A transcript needs speaker-turn chunking. A codebase needs function-level chunking.

When you chunk wrong, you split context that belongs together and merge context that should be separate. No embedding model recovers from that.

The diagnostic: take your worst-performing queries, find the source documents, and look at exactly what chunks are being retrieved. You'll see the problem immediately.

Decision 2: Hybrid search

Dense retrieval alone — embedding similarity — misses exact-match cases. If a user asks about a specific product SKU or a person's name, dense retrieval will find semantically similar content, not the exact match they need.

Sparse retrieval — BM25, keyword matching — misses semantic similarity. It finds the exact words but misses paraphrases.

Production systems use both, combined with a reranker that scores the merged results. This single change fixes the majority of retrieval failures I have debugged. It is not a premature optimisation — it is a baseline for any system that handles real user queries.

Decision 3: Confidence thresholds with graceful fallback

If your top retrieved chunk has a similarity score below your calibrated threshold, do not answer. Return a graceful fallback instead.

An honest "I couldn't find a reliable answer to that" is infinitely better than a confident hallucination. Most teams skip this step because it requires knowing what low confidence looks like for their specific data — which means you need to measure it.

Measure it before launch. Not after your users have learned not to trust the system.

The pattern that fixes most failures

The teams that ship reliable RAG systems do one thing differently: they design retrieval for their actual data shape and their actual user language, not for the happy path in their test suite.

That means:

  • Running real user queries — or realistic synthetic ones — against retrieval before launch
  • Measuring retrieval precision separately from generation quality
  • Designing fallback behaviour as a first-class feature, not an afterthought
  • Treating chunking as a domain-specific design decision, not a parameter to tune

The moment you stop debugging the embedding model and start examining the retrieval design, the failures start making sense — and so do the fixes.

RAG is a system design problem. Treat it like one from the start.

goBIGai

Building something like this?

I work with a small number of technical teams to design and ship production-grade agentic systems. If you're dealing with a version of the problems above, let's talk.

Book a 30-min call →