RAG

Lessons From Shipping 40 AI Agents in 18 Months

By Marius Joksas2026-04-2110 min read

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

Results

Observability for agent runs is qualitatively different from traditional APM. A single user request can spawn dozens of tool calls, each with its own latency, cost, and failure mode. Flat traces become unreadable; we render them as collapsible trees.

Tool definitions should read like API documentation written for a careful junior engineer. The model behaves better when each parameter has a concrete example, a unit, and an explicit statement of what happens when the value is omitted.

What we changed

Hardware is a moving target. The Jetson Orin we benchmarked in January was outperformed by an off-the-shelf mini-PC by August. We re-run the benchmark matrix every quarter and have stopped making long-term hardware commitments.

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

The first version of this system was deliberately simple. We wanted a baseline that could be measured against, rather than an architecture that anticipated every possible failure mode. That decision paid off — most of the issues we eventually hit were unrelated to the ones we had originally feared.

Lessons From Shipping 40 AI Agents in 18 Months

Results

What we changed

Related articles

Building a Compliance-Ready Audit Trail for AI Decisions

Voice Agents That Actually Understand Lithuanian

Securing LLM Endpoints Against Prompt Injection