RAG vs. fine-tuning: a buyer's decision tree

Every other week, a founder or CTO asks us the same question: "Should we fine-tune our own model or use RAG?" Roughly 80% of the time, the right answer is neither. The other 20% it matters which one — and the cost of choosing wrong is large. This is the framework we use to answer.

Start by asking what's actually broken

Before you compare techniques, define the failure mode. Most teams jump to "we need fine-tuning" without articulating what the base model is getting wrong. The diagnosis matters because different problems have different fixes.

There are really only four classes of LLM failure that need an architectural response:

Knowledge gap — the model doesn't know facts specific to your domain or business
Style or format gap — the model knows the answer but expresses it wrong (tone, structure, format)
Reasoning gap — the model fails at multi-step logic specific to your problem
Latency or cost gap — the model works, but it's too slow or too expensive at your volume

Match the diagnosis to the technique and you're 90% of the way to the right answer.

The decision tree

Walk through these six questions in order. The first "yes" tells you what to do next.

The 6-question decision tree

Can a better prompt or example fix it? — try that first
Is the answer in a document you have? — use RAG
Is the gap in style, format or tone? — use few-shot or light fine-tune
Does the model need a domain-specific reasoning pattern? — use fine-tuning
Are you serving so much volume that cost-per-token is the killer? — distill to a smaller fine-tuned model
Are you protecting sensitive data? — host an open-weight model, fine-tune optionally

1. Prompt engineering first

Frontier models in 2026 are extraordinarily capable. Before you invest in either RAG or fine-tuning, spend two days on prompt engineering. Add structured examples. Add an explicit chain-of-thought scaffold. Add a self-critique step.

You'd be surprised how often a thoughtful prompt closes 70% of a perceived "gap." It's also the cheapest fix possible — measured in afternoons, not months.

2. When RAG is the right answer

Retrieval-augmented generation is the right tool when the failure is a knowledge gap — the model doesn't know your private documents, your product catalogue, your policies, your meeting transcripts.

RAG wins on three things:

Freshness — your knowledge base changes weekly; updating a vector index is trivial, retraining a model is not
Provenance — every answer cites the source document, which is non-negotiable in regulated industries
Cost — adding a document is essentially free; fine-tuning to learn the same facts costs serious money

RAG loses when the question requires synthesising across many sources, or when style and tone matter as much as facts. It also loses when retrieval is hard — large unstructured PDFs, tabular data, and audio transcripts all need specialised pre-processing to retrieve well.

3. When few-shot or light fine-tuning is enough

Sometimes the model knows what to say but says it wrong. The CFO wants quarterly summaries in a very specific format. The legal team wants memos that follow a fixed structure. The customer support team wants the brand voice to be consistent across 10,000 chats a day.

These are style and format gaps. Start with few-shot prompting — putting 5–20 high-quality examples in the prompt itself. If that gets you to 80% quality and you can't afford the prompt overhead at production volume, graduate to lightweight fine-tuning on a few hundred examples.

4. When real fine-tuning earns its keep

Full fine-tuning makes sense in a narrow band of cases:

The model needs to apply a domain-specific reasoning pattern that's hard to describe in a prompt — medical coding, legal cite-checking, certain forms of financial classification
You have thousands to tens of thousands of high-quality input-output pairs
You've already exhausted prompt engineering and RAG
The improvement matters in dollars, not vibes

If you can't tick all four boxes, you're not ready to fine-tune. The honest answer is "not yet" — and that's fine.

5. When distillation matters

At very high volume — millions of inference calls a day — cost-per-token starts to dominate the business case. This is the situation where distilling a smaller fine-tuned model from your frontier-model outputs pays off. You generate training data with the big model, fine-tune a 7B or 13B open-weight model, and run it on your own infrastructure for one-tenth the cost.

Don't start here. Start with the frontier model, prove the use case, then distill once you have the volume to justify the engineering investment.

6. When privacy forces your hand

Some workloads can't leave your perimeter. Patient records, classified material, certain national-security categories. Here the choice isn't really about RAG vs. fine-tuning at all — it's about which open-weight base model to host. RAG is still usually the right augmentation pattern; fine-tuning becomes a question of whether the cost is justified, not whether it's the only option.

The most expensive mistake we see

The pattern is depressingly consistent. A team commits to a six-month fine-tuning project because "we're an AI company and we need our own model." They burn through engineering capacity, end up with a model that performs marginally better than well-prompted GPT-4-class on their evals, and watch the frontier model improve past their custom model within two quarters of releasing it.

Meanwhile, the team that started with prompt engineering and added RAG when they hit a knowledge gap is shipping. They iterate weekly. When the next frontier model lands, they get an immediate quality bump for free.

Fine-tuning is a bet on a moat that often doesn't exist. RAG is a bet on engineering discipline that almost always pays off.

What we actually recommend in 80% of cases

For most teams, the stack we recommend looks like:

A strong frontier model accessed via API (Claude, GPT, or equivalent)
Careful prompt engineering, with prompts versioned as code
RAG over your private knowledge — hybrid retrieval (BM25 + vector + reranking) for quality
Few-shot examples in the prompt for style and format alignment
An evaluation harness that catches regressions before they ship

This stack handles the vast majority of business use cases. It's cheaper, faster to build, easier to iterate, and improves automatically as the underlying models get better. Fine-tuning is a powerful tool when you genuinely need it — but the bar for "need" is much higher than the bar for "want."

The buyer's checklist

Before you sign off on a fine-tuning project, ask your engineering team:

What specific failure mode are we addressing — knowledge, style, reasoning, or cost?
What does our current eval suite show is the actual gap?
Have we tried prompt engineering with a strong frontier model first?
Do we have at least 1,000 high-quality input-output pairs already?
If the frontier model improves 30% over the next two quarters, do we still need this?

If the answer to any of those is "we're not sure," fine-tuning is the wrong move. Run the experiment with prompts and RAG first. The data you collect will sharpen the question — and might dissolve it entirely.

Choosing your AI architecture?

We'll pressure-test your approach in a 30-minute call.

No slides, no upsell — just an honest opinion on whether you should retrieve, fine-tune, or do neither.

Book a Discovery Call