Securing LLM Endpoints Against Prompt Injection

By Marius Joksas2026-03-204 min read

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Trade-offs

Observability for agent runs is qualitatively different from traditional APM. A single user request can spawn dozens of tool calls, each with its own latency, cost, and failure mode. Flat traces become unreadable; we render them as collapsible trees.

When the system is wrong, the user should be able to understand why in under thirty seconds. Citation links, confidence scores, and the exact retrieved passages are surfaced in the UI for every generated answer.

Background

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

The first version of this system was deliberately simple. We wanted a baseline that could be measured against, rather than an architecture that anticipated every possible failure mode. That decision paid off — most of the issues we eventually hit were unrelated to the ones we had originally feared.

Documentation written by the team that builds the system tends to be more useful than documentation written by anyone else. The trade-off is consistency, which we address with a shared style guide and a lightweight review process.

Securing LLM Endpoints Against Prompt Injection

Trade-offs

Background

Related articles

Building Internal Copilots Without Leaking Confidential Data

Real-World Latency Numbers for GPT-5, Claude, and Gemini

Real-Time Anomaly Detection on Manufacturing Lines