All Insights
Applied AI·11 min read·April 2026

RAG vs. fine-tuning: a buyer's decision tree

When to retrieve, when to fine-tune, and when to do nothing. A 6-question flowchart that non-technical buyers can use to make the right call — before spending six figures on the wrong one.

OS
OlloSoft Engineering
Published April 22, 2026

Every other week, a founder or CTO asks us the same question: "Should we fine-tune our own model or use RAG?" Roughly 80% of the time, the right answer is neither. The other 20% it matters which one — and the cost of choosing wrong is large. This is the framework we use to answer.

Start by asking what's actually broken

Before you compare techniques, define the failure mode. Most teams jump to "we need fine-tuning" without articulating what the base model is getting wrong. The diagnosis matters because different problems have different fixes.

There are really only four classes of LLM failure that need an architectural response:

  1. Knowledge gap — the model doesn't know facts specific to your domain or business
  2. Style or format gap — the model knows the answer but expresses it wrong (tone, structure, format)
  3. Reasoning gap — the model fails at multi-step logic specific to your problem
  4. Latency or cost gap — the model works, but it's too slow or too expensive at your volume

Match the diagnosis to the technique and you're 90% of the way to the right answer.

The decision tree

Walk through these six questions in order. The first "yes" tells you what to do next.

The 6-question decision tree
  1. Can a better prompt or example fix it? — try that first
  2. Is the answer in a document you have? — use RAG
  3. Is the gap in style, format or tone? — use few-shot or light fine-tune
  4. Does the model need a domain-specific reasoning pattern? — use fine-tuning
  5. Are you serving so much volume that cost-per-token is the killer? — distill to a smaller fine-tuned model
  6. Are you protecting sensitive data? — host an open-weight model, fine-tune optionally

1. Prompt engineering first

Frontier models in 2026 are extraordinarily capable. Before you invest in either RAG or fine-tuning, spend two days on prompt engineering. Add structured examples. Add an explicit chain-of-thought scaffold. Add a self-critique step.

You'd be surprised how often a thoughtful prompt closes 70% of a perceived "gap." It's also the cheapest fix possible — measured in afternoons, not months.

2. When RAG is the right answer

Retrieval-augmented generation is the right tool when the failure is a knowledge gap — the model doesn't know your private documents, your product catalogue, your policies, your meeting transcripts.

RAG wins on three things:

RAG loses when the question requires synthesising across many sources, or when style and tone matter as much as facts. It also loses when retrieval is hard — large unstructured PDFs, tabular data, and audio transcripts all need specialised pre-processing to retrieve well.

3. When few-shot or light fine-tuning is enough

Sometimes the model knows what to say but says it wrong. The CFO wants quarterly summaries in a very specific format. The legal team wants memos that follow a fixed structure. The customer support team wants the brand voice to be consistent across 10,000 chats a day.

These are style and format gaps. Start with few-shot prompting — putting 5–20 high-quality examples in the prompt itself. If that gets you to 80% quality and you can't afford the prompt overhead at production volume, graduate to lightweight fine-tuning on a few hundred examples.

4. When real fine-tuning earns its keep

Full fine-tuning makes sense in a narrow band of cases:

If you can't tick all four boxes, you're not ready to fine-tune. The honest answer is "not yet" — and that's fine.

5. When distillation matters

At very high volume — millions of inference calls a day — cost-per-token starts to dominate the business case. This is the situation where distilling a smaller fine-tuned model from your frontier-model outputs pays off. You generate training data with the big model, fine-tune a 7B or 13B open-weight model, and run it on your own infrastructure for one-tenth the cost.

Don't start here. Start with the frontier model, prove the use case, then distill once you have the volume to justify the engineering investment.

6. When privacy forces your hand

Some workloads can't leave your perimeter. Patient records, classified material, certain national-security categories. Here the choice isn't really about RAG vs. fine-tuning at all — it's about which open-weight base model to host. RAG is still usually the right augmentation pattern; fine-tuning becomes a question of whether the cost is justified, not whether it's the only option.

The most expensive mistake we see

The pattern is depressingly consistent. A team commits to a six-month fine-tuning project because "we're an AI company and we need our own model." They burn through engineering capacity, end up with a model that performs marginally better than well-prompted GPT-4-class on their evals, and watch the frontier model improve past their custom model within two quarters of releasing it.

Meanwhile, the team that started with prompt engineering and added RAG when they hit a knowledge gap is shipping. They iterate weekly. When the next frontier model lands, they get an immediate quality bump for free.

Fine-tuning is a bet on a moat that often doesn't exist. RAG is a bet on engineering discipline that almost always pays off.

What we actually recommend in 80% of cases

For most teams, the stack we recommend looks like:

  1. A strong frontier model accessed via API (Claude, GPT, or equivalent)
  2. Careful prompt engineering, with prompts versioned as code
  3. RAG over your private knowledge — hybrid retrieval (BM25 + vector + reranking) for quality
  4. Few-shot examples in the prompt for style and format alignment
  5. An evaluation harness that catches regressions before they ship

This stack handles the vast majority of business use cases. It's cheaper, faster to build, easier to iterate, and improves automatically as the underlying models get better. Fine-tuning is a powerful tool when you genuinely need it — but the bar for "need" is much higher than the bar for "want."

The buyer's checklist

Before you sign off on a fine-tuning project, ask your engineering team:

If the answer to any of those is "we're not sure," fine-tuning is the wrong move. Run the experiment with prompts and RAG first. The data you collect will sharpen the question — and might dissolve it entirely.

Choosing your AI architecture?

We'll pressure-test your approach in a 30-minute call.

No slides, no upsell — just an honest opinion on whether you should retrieve, fine-tune, or do neither.

Book a Discovery Call

Continue reading