Forecasting Demand With Hybrid Statistical and ML Models
When the system is wrong, the user should be able to understand why in under thirty seconds. Citation links, confidence scores, and the exact retrieved passages are surfaced in the UI for every generated answer.
Results
Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.
Hardware is a moving target. The Jetson Orin we benchmarked in January was outperformed by an off-the-shelf mini-PC by August. We re-run the benchmark matrix every quarter and have stopped making long-term hardware commitments.
Results
Documentation written by the team that builds the system tends to be more useful than documentation written by anyone else. The trade-off is consistency, which we address with a shared style guide and a lightweight review process.
Tool definitions should read like API documentation written for a careful junior engineer. The model behaves better when each parameter has a concrete example, a unit, and an explicit statement of what happens when the value is omitted.
The first version of this system was deliberately simple. We wanted a baseline that could be measured against, rather than an architecture that anticipated every possible failure mode. That decision paid off — most of the issues we eventually hit were unrelated to the ones we had originally feared.