Streaming Inference at Scale With Kafka and Triton

By Tomas Vaitkus2026-04-2212 min read

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Trade-offs

Documentation written by the team that builds the system tends to be more useful than documentation written by anyone else. The trade-off is consistency, which we address with a shared style guide and a lightweight review process.

Observability for agent runs is qualitatively different from traditional APM. A single user request can spawn dozens of tool calls, each with its own latency, cost, and failure mode. Flat traces become unreadable; we render them as collapsible trees.

What we changed

When the system is wrong, the user should be able to understand why in under thirty seconds. Citation links, confidence scores, and the exact retrieved passages are surfaced in the UI for every generated answer.

Evaluation suites grow faster than the codebase they cover. We treat them as first-class artefacts: versioned, reviewed, and regenerated on a schedule. The team that owns the model owns the eval set, not a separate QA group.

Hardware is a moving target. The Jetson Orin we benchmarked in January was outperformed by an off-the-shelf mini-PC by August. We re-run the benchmark matrix every quarter and have stopped making long-term hardware commitments.

Streaming Inference at Scale With Kafka and Triton

Trade-offs

What we changed

Related articles

Multi-Agent Orchestration Patterns We Use in Production

Reducing Hallucinations With Citation-First Retrieval

From PDF Chaos to Structured Data with Vision Models