Back to journal
AI delivery notes

RAG in Production: Chunking, Retrieval Quality, and the Problems the Demo Hides

A RAG assistant becomes a product the moment someone relies on it during real work. From that point on, the engineering around the model matters more than the demo.

RAGRetrievalEvaluationLLM Ops
PublishedRead time7 min read

Working principle

If I cannot explain why the assistant answered that way, I do not call the feature ready.

Working rule

Trace every answer

First fix

Ingestion quality

Failure to avoid

Confident guesswork

The first production bug is usually not the model

Most RAG demos look better than the first live release. The reason is simple: demos are curated. Production traffic is not.

In a notebook, the same handful of documents are clean, recent, and already known by the person asking the question. In a real product, users ask with partial context, stale wording, and permissions you have to respect. If the answer fails, they do not say "interesting limitation of retrieval." They say the assistant is wrong.

That is why I treat a RAG feature as a data product with an LLM at the end, not as a chatbot with a vector database bolted on.

Ingestion is where trust starts

If source documents are inconsistent, retrieval quality degrades quietly. You can spend days tuning prompts and never recover from weak inputs.

The rules I keep from the start

  • Every chunk gets a stable document ID, source version, language, and access scope
  • Re-ingestion is diff-based, so unchanged files are not re-embedded
  • Chunking follows the source format instead of a single global rule
  • Deleted or superseded documents are explicitly retired from the index
On regulatory content, article boundaries matter. On technical docs, headings, tables, and code blocks matter. On support conversations, thread context matters. "512 tokens everywhere" is a reasonable default, not a strategy.

Evaluate retrieval before you optimize the answer

Teams often watch the final answer and skip the step that decides whether a good answer was even possible.

The smallest useful evaluation set is not huge. I usually start with 30 to 50 real questions collected from the team that will own the feature. For each one, I want to know which document or section should have been retrieved. That is enough to detect obvious regressions when chunking, filters, or reranking change.

The signals I care about

  • Did the right evidence appear in the top results?
  • Was the answer able to cite the relevant source?
  • Did the system refuse to answer when the corpus did not support the request?
If the system cannot abstain, it will improvise. That is the shortest route to a support problem.

Retrieval needs two stages in practice

Pure vector search is rarely enough on its own. Exact terms matter: article numbers, product names, ticket IDs, internal acronyms. Dense retrieval helps with semantic closeness; lexical search helps with precision. I have had better results by combining both than by endlessly benchmarking embedding models.

A pattern that keeps paying off:

  1. Filter by permissions, language, document type, and freshness.
  2. Retrieve a larger candidate set with hybrid search.
  3. Rerank the candidates before building the final context window.
  4. Keep only the evidence the model can realistically use.
That pipeline is less glamorous than "pick the best model," but it is where most quality gains come from.

Observability has to explain a bad answer in minutes

The question I ask after every release is simple: if a client sends a screenshot of a wrong answer, can I explain what happened without guessing?

To do that, I need a trace of the request, the retrieved chunks, the prompt, the model choice, and the final citations. Langfuse is useful here because it gives a concrete history instead of a vague feeling that the system "usually works."

The important part is not the tool name. The important part is being able to answer these four questions quickly:

  • What did the user ask?
  • What evidence was retrieved?
  • What instruction path was sent to the model?
  • Why did we allow that answer to ship?
If one of those is missing, debugging becomes opinion instead of engineering.

Cost problems appear after the first internal success

The expensive moment is rarely the pilot. It is the month after people start trusting the feature and usage spreads across teams.

Three controls help more than anything else for me:

  • Cache similar questions so repeat traffic does not trigger full generation every time
  • Route simpler requests to smaller models and keep the stronger model for ambiguous or multi-step questions
  • Track document hashes so embedding jobs only run on material changes
A RAG system does not need to be cheap at all costs. It needs to have a cost profile you can explain before adoption accelerates.

The week-one checklist I would repeat

  1. Pick one document family and make it work end to end.
  2. Add structured metadata before worrying about fancy prompts.
  3. Build a small evaluation set with expected evidence.
  4. Trace every answer and store citations.
  5. Add a refusal path for unsupported questions.
  6. Set a token budget and basic caching before launch.
That is enough to ship something serious. Most teams can add better ranking, smarter orchestration, and more model options later. They cannot recover as easily from a launch that burns trust on day three.

The useful RAG systems I have seen share the same property: they behave like boring software. They return grounded answers, expose their sources, and fail in understandable ways. That is a much better goal than sounding impressive in a demo.

More notes

Platform delivery notes

Building a Data Platform People Actually Trust

What matters after the architecture slide: warehouse boundaries, dbt discipline, cost guards, and operational habits that keep dashboards usable.

8 min read
Read article