The first production bug is usually not the model
Most RAG demos look better than the first live release. The reason is simple: demos are curated. Production traffic is not.
In a notebook, the same handful of documents are clean, recent, and already known by the person asking the question. In a real product, users ask with partial context, stale wording, and permissions you have to respect. If the answer fails, they do not say "interesting limitation of retrieval." They say the assistant is wrong.
That is why I treat a RAG feature as a data product with an LLM at the end, not as a chatbot with a vector database bolted on.
Ingestion is where trust starts
If source documents are inconsistent, retrieval quality degrades quietly. You can spend days tuning prompts and never recover from weak inputs.
The rules I keep from the start
- Every chunk gets a stable document ID, source version, language, and access scope
- Re-ingestion is diff-based, so unchanged files are not re-embedded
- Chunking follows the source format instead of a single global rule
- Deleted or superseded documents are explicitly retired from the index
Evaluate retrieval before you optimize the answer
Teams often watch the final answer and skip the step that decides whether a good answer was even possible.
The smallest useful evaluation set is not huge. I usually start with 30 to 50 real questions collected from the team that will own the feature. For each one, I want to know which document or section should have been retrieved. That is enough to detect obvious regressions when chunking, filters, or reranking change.
The signals I care about
- Did the right evidence appear in the top results?
- Was the answer able to cite the relevant source?
- Did the system refuse to answer when the corpus did not support the request?
Retrieval needs two stages in practice
Pure vector search is rarely enough on its own. Exact terms matter: article numbers, product names, ticket IDs, internal acronyms. Dense retrieval helps with semantic closeness; lexical search helps with precision. I have had better results by combining both than by endlessly benchmarking embedding models.
A pattern that keeps paying off:
- Filter by permissions, language, document type, and freshness.
- Retrieve a larger candidate set with hybrid search.
- Rerank the candidates before building the final context window.
- Keep only the evidence the model can realistically use.
Observability has to explain a bad answer in minutes
The question I ask after every release is simple: if a client sends a screenshot of a wrong answer, can I explain what happened without guessing?
To do that, I need a trace of the request, the retrieved chunks, the prompt, the model choice, and the final citations. Langfuse is useful here because it gives a concrete history instead of a vague feeling that the system "usually works."
The important part is not the tool name. The important part is being able to answer these four questions quickly:
- What did the user ask?
- What evidence was retrieved?
- What instruction path was sent to the model?
- Why did we allow that answer to ship?
Cost problems appear after the first internal success
The expensive moment is rarely the pilot. It is the month after people start trusting the feature and usage spreads across teams.
Three controls help more than anything else for me:
- Cache similar questions so repeat traffic does not trigger full generation every time
- Route simpler requests to smaller models and keep the stronger model for ambiguous or multi-step questions
- Track document hashes so embedding jobs only run on material changes
The week-one checklist I would repeat
- Pick one document family and make it work end to end.
- Add structured metadata before worrying about fancy prompts.
- Build a small evaluation set with expected evidence.
- Trace every answer and store citations.
- Add a refusal path for unsupported questions.
- Set a token budget and basic caching before launch.
The useful RAG systems I have seen share the same property: they behave like boring software. They return grounded answers, expose their sources, and fail in understandable ways. That is a much better goal than sounding impressive in a demo.