Fine-Tuning vs Prompting: Choosing the Right Approach

By Marius Joksas2026-01-248 min read

Retrieval quality is the lever that moves the most weight. No amount of prompt engineering compensates for a retriever that consistently surfaces the wrong passages. We spent two weeks tuning chunking and reranking before touching the prompt template.

Open questions

The first version of this system was deliberately simple. We wanted a baseline that could be measured against, rather than an architecture that anticipated every possible failure mode. That decision paid off — most of the issues we eventually hit were unrelated to the ones we had originally feared.

In production, latency distributions matter far more than averages. A pipeline whose mean response time looks acceptable can still feel sluggish if the 95th percentile drifts upward during peak hours. We instrument every stage with histograms so regressions surface immediately.

Results

Evaluation suites grow faster than the codebase they cover. We treat them as first-class artefacts: versioned, reviewed, and regenerated on a schedule. The team that owns the model owns the eval set, not a separate QA group.

Tool definitions should read like API documentation written for a careful junior engineer. The model behaves better when each parameter has a concrete example, a unit, and an explicit statement of what happens when the value is omitted.

Cost modelling is now part of our pre-merge checklist. Every PR that touches an LLM call includes an estimate of the per-request token spend and the expected daily volume. Surprises in the monthly invoice have dropped to nearly zero.

Fine-Tuning vs Prompting: Choosing the Right Approach

Open questions

Results

Related articles

n8n vs Airflow vs Temporal for AI Workflows

Migrating From OpenAI to Self-Hosted Llama Models

Building a RAG Pipeline That Actually Works in Production