LLMs

Building a RAG Pipeline That Actually Works in Production

By Marius Joksas2026-05-144 min read

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Next steps

Cost modelling is now part of our pre-merge checklist. Every PR that touches an LLM call includes an estimate of the per-request token spend and the expected daily volume. Surprises in the monthly invoice have dropped to nearly zero.

Evaluation suites grow faster than the codebase they cover. We treat them as first-class artefacts: versioned, reviewed, and regenerated on a schedule. The team that owns the model owns the eval set, not a separate QA group.

Next steps

Documentation written by the team that builds the system tends to be more useful than documentation written by anyone else. The trade-off is consistency, which we address with a shared style guide and a lightweight review process.

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

Observability for agent runs is qualitatively different from traditional APM. A single user request can spawn dozens of tool calls, each with its own latency, cost, and failure mode. Flat traces become unreadable; we render them as collapsible trees.

Building a RAG Pipeline That Actually Works in Production

Next steps

Next steps

Related articles

Voice Agents That Actually Understand Lithuanian

Reducing Hallucinations With Citation-First Retrieval

n8n vs Airflow vs Temporal for AI Workflows