MLOps

A Practical Guide to Evaluating LLM Outputs

By Karolis Stankevicius2026-05-159 min read

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

What we changed

Hardware is a moving target. The Jetson Orin we benchmarked in January was outperformed by an off-the-shelf mini-PC by August. We re-run the benchmark matrix every quarter and have stopped making long-term hardware commitments.

The first version of this system was deliberately simple. We wanted a baseline that could be measured against, rather than an architecture that anticipated every possible failure mode. That decision paid off — most of the issues we eventually hit were unrelated to the ones we had originally feared.

Open questions

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Cost modelling is now part of our pre-merge checklist. Every PR that touches an LLM call includes an estimate of the per-request token spend and the expected daily volume. Surprises in the monthly invoice have dropped to nearly zero.

Documentation written by the team that builds the system tends to be more useful than documentation written by anyone else. The trade-off is consistency, which we address with a shared style guide and a lightweight review process.

A Practical Guide to Evaluating LLM Outputs

What we changed

Open questions

Related articles

Building a RAG Pipeline That Actually Works in Production

Image Classification on a Raspberry Pi 5 With ONNX Runtime

Real-World Latency Numbers for GPT-5, Claude, and Gemini