Scraping at Scale Without Getting Blocked

By Marius Joksas2026-02-057 min read

Tool definitions should read like API documentation written for a careful junior engineer. The model behaves better when each parameter has a concrete example, a unit, and an explicit statement of what happens when the value is omitted.

Trade-offs

Cost modelling is now part of our pre-merge checklist. Every PR that touches an LLM call includes an estimate of the per-request token spend and the expected daily volume. Surprises in the monthly invoice have dropped to nearly zero.

Hardware is a moving target. The Jetson Orin we benchmarked in January was outperformed by an off-the-shelf mini-PC by August. We re-run the benchmark matrix every quarter and have stopped making long-term hardware commitments.

Next steps

When the system is wrong, the user should be able to understand why in under thirty seconds. Citation links, confidence scores, and the exact retrieved passages are surfaced in the UI for every generated answer.

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

Scraping at Scale Without Getting Blocked

Trade-offs

Next steps

Related articles

Cost Control Strategies for High-Volume LLM Applications

Computer Vision for Transport Monitoring

Reducing Hallucinations With Citation-First Retrieval