RAG in Production: Chunking, Retrieval Quality, and the Problems the Demo Hides
A RAG assistant becomes a product the moment someone relies on it during real work. From that point on, the engineering around the model matters more than the demo.
Context window limits, suboptimal chunking, retrieval evaluation, hybrid search, observability, and cost control — the real engineering behind a RAG system that works past the demo.
What matters
Index like a real pipeline
Stable IDs, source versions, and diff-based ingestion matter before prompt tuning does.
Score retrieval separately
If the right evidence is not showing up, the model never had a fair chance.
Plan for refusal and tracing
A trustworthy assistant cites, abstains, and leaves behind a debuggable trail.
Working rule
Trace every answer
First fix
Ingestion quality
Failure to avoid
Confident guesswork