Every other week, a founder or CTO asks us the same question: "Should we fine-tune our own model or use RAG?" Roughly 80% of the time, the right answer is neither. The other 20% it matters which one — and the cost of choosing wrong is large. This is the framework we use to answer.
Start by asking what's actually broken
Before you compare techniques, define the failure mode. Most teams jump to "we need fine-tuning" without articulating what the base model is getting wrong. The diagnosis matters because different problems have different fixes.
There are really only four classes of LLM failure that need an architectural response:
- Knowledge gap — the model doesn't know facts specific to your domain or business
- Style or format gap — the model knows the answer but expresses it wrong (tone, structure, format)
- Reasoning gap — the model fails at multi-step logic specific to your problem
- Latency or cost gap — the model works, but it's too slow or too expensive at your volume
Match the diagnosis to the technique and you're 90% of the way to the right answer.
The decision tree
Walk through these six questions in order. The first "yes" tells you what to do next.
- Can a better prompt or example fix it? — try that first
- Is the answer in a document you have? — use RAG
- Is the gap in style, format or tone? — use few-shot or light fine-tune
- Does the model need a domain-specific reasoning pattern? — use fine-tuning
- Are you serving so much volume that cost-per-token is the killer? — distill to a smaller fine-tuned model
- Are you protecting sensitive data? — host an open-weight model, fine-tune optionally
1. Prompt engineering first
Frontier models in 2026 are extraordinarily capable. Before you invest in either RAG or fine-tuning, spend two days on prompt engineering. Add structured examples. Add an explicit chain-of-thought scaffold. Add a self-critique step.
You'd be surprised how often a thoughtful prompt closes 70% of a perceived "gap." It's also the cheapest fix possible — measured in afternoons, not months.
2. When RAG is the right answer
Retrieval-augmented generation is the right tool when the failure is a knowledge gap — the model doesn't know your private documents, your product catalogue, your policies, your meeting transcripts.
RAG wins on three things:
- Freshness — your knowledge base changes weekly; updating a vector index is trivial, retraining a model is not
- Provenance — every answer cites the source document, which is non-negotiable in regulated industries
- Cost — adding a document is essentially free; fine-tuning to learn the same facts costs serious money
RAG loses when the question requires synthesising across many sources, or when style and tone matter as much as facts. It also loses when retrieval is hard — large unstructured PDFs, tabular data, and audio transcripts all need specialised pre-processing to retrieve well.
3. When few-shot or light fine-tuning is enough
Sometimes the model knows what to say but says it wrong. The CFO wants quarterly summaries in a very specific format. The legal team wants memos that follow a fixed structure. The customer support team wants the brand voice to be consistent across 10,000 chats a day.
These are style and format gaps. Start with few-shot prompting — putting 5–20 high-quality examples in the prompt itself. If that gets you to 80% quality and you can't afford the prompt overhead at production volume, graduate to lightweight fine-tuning on a few hundred examples.
4. When real fine-tuning earns its keep
Full fine-tuning makes sense in a narrow band of cases:
- The model needs to apply a domain-specific reasoning pattern that's hard to describe in a prompt — medical coding, legal cite-checking, certain forms of financial classification
- You have thousands to tens of thousands of high-quality input-output pairs
- You've already exhausted prompt engineering and RAG
- The improvement matters in dollars, not vibes
If you can't tick all four boxes, you're not ready to fine-tune. The honest answer is "not yet" — and that's fine.
5. When distillation matters
At very high volume — millions of inference calls a day — cost-per-token starts to dominate the business case. This is the situation where distilling a smaller fine-tuned model from your frontier-model outputs pays off. You generate training data with the big model, fine-tune a 7B or 13B open-weight model, and run it on your own infrastructure for one-tenth the cost.
Don't start here. Start with the frontier model, prove the use case, then distill once you have the volume to justify the engineering investment.
6. When privacy forces your hand
Some workloads can't leave your perimeter. Patient records, classified material, certain national-security categories. Here the choice isn't really about RAG vs. fine-tuning at all — it's about which open-weight base model to host. RAG is still usually the right augmentation pattern; fine-tuning becomes a question of whether the cost is justified, not whether it's the only option.
The most expensive mistake we see
The pattern is depressingly consistent. A team commits to a six-month fine-tuning project because "we're an AI company and we need our own model." They burn through engineering capacity, end up with a model that performs marginally better than well-prompted GPT-4-class on their evals, and watch the frontier model improve past their custom model within two quarters of releasing it.
Meanwhile, the team that started with prompt engineering and added RAG when they hit a knowledge gap is shipping. They iterate weekly. When the next frontier model lands, they get an immediate quality bump for free.
Fine-tuning is a bet on a moat that often doesn't exist. RAG is a bet on engineering discipline that almost always pays off.
What we actually recommend in 80% of cases
For most teams, the stack we recommend looks like:
- A strong frontier model accessed via API (Claude, GPT, or equivalent)
- Careful prompt engineering, with prompts versioned as code
- RAG over your private knowledge — hybrid retrieval (BM25 + vector + reranking) for quality
- Few-shot examples in the prompt for style and format alignment
- An evaluation harness that catches regressions before they ship
This stack handles the vast majority of business use cases. It's cheaper, faster to build, easier to iterate, and improves automatically as the underlying models get better. Fine-tuning is a powerful tool when you genuinely need it — but the bar for "need" is much higher than the bar for "want."
The buyer's checklist
Before you sign off on a fine-tuning project, ask your engineering team:
- What specific failure mode are we addressing — knowledge, style, reasoning, or cost?
- What does our current eval suite show is the actual gap?
- Have we tried prompt engineering with a strong frontier model first?
- Do we have at least 1,000 high-quality input-output pairs already?
- If the frontier model improves 30% over the next two quarters, do we still need this?
If the answer to any of those is "we're not sure," fine-tuning is the wrong move. Run the experiment with prompts and RAG first. The data you collect will sharpen the question — and might dissolve it entirely.
We'll pressure-test your approach in a 30-minute call.
No slides, no upsell — just an honest opinion on whether you should retrieve, fine-tune, or do neither.
Book a Discovery Call