How is this different from hiring full-time engineers?

Full-time engineers are usually the right answer eventually. We're the bridge. You hire us when you need someone shipping by Monday and a job posting won't close for months. When you've hired the right team, we hand off everything. The code is already yours, the infra is already in your cloud, the runbook is already written. Working ourselves out of the job is the goal, and we don't mind when it happens.

What if our codebase is a mess?

Most are. The founders' decade-each in production engineering has touched most stack vintages still in use: Rails monoliths, 4-year-old Next.js apps with three router migrations, greenfield TypeScript. We adapt to your conventions, your CI, your branching model. We won't try to rewrite your stack to use our preferred one. That's a tell of an agency that's actually selling templates.

Do you really have no contracts longer than a month?

Real answer: our MSA is signed once, and engagements run on monthly purchase orders. You can pause or end at any time on 30 days' notice. That's not a fine-print clause, it's the operating model. We'd rather stay because we're earning it than because you signed a year of it.

What does 'senior engineer' actually mean here?

Founder-built. Senior operators with a decade in production engineering across AI, infrastructure, and platform work. There is no junior bench to hide behind. You get the people who'd actually be writing the architecture doc anywhere else. We don't post bios on the site because LinkedIn isn't the right hiring surface; we'd rather you meet whoever is staffed on the first call and decide from there.

How fast can you actually start?

Sprint engagements: typically 7 days from signature. Pod engagements: 14 days. We won't lie about availability. Founder-led means slots are real and limited. We'd rather tell you 'we can't take you for six weeks' than start late and miss the timeline.

We've been burned by agencies before. Why is this different?

Mostly because we don't try to be everything. Six services. Senior engineers only. Code in your repo. Monthly contracts. Weekly demos. Founder-led. No PMs in the loop, no offshore handoff, no proprietary platform. If you've been burned, you know what bit you. We've tried to make ourselves the opposite of that.

What does week one actually look like?

Day 1: kickoff call, Slack channel created, repo access exchanged, problem statement written and pinned. Day 2: scoping doc with the smallest shippable thing identified. Day 3–4: working prototype in a sandbox. Day 5: Loom walkthrough, demo on your calendar for next Friday. Week one is choreographed. The improvising starts in week two.

Do you sign NDAs? BAAs? SOC 2 vendor questionnaires?

Yes to all three. Vendor security questionnaires are turned around quickly because we are deliberately small and there's no committee to route around. BAAs are ready for legal review on day one for healthcare engagements. The goal is to be the easiest vendor your procurement team deals with this quarter.

May 14, 20267 min readfine-tuning · rag · models · ai-engineering

Fine-tuning answers a narrower question than you think

When a team says they want to fine-tune, the next question is usually 'to fix what?' The answers cluster, and most of them are not fine-tuning problems. Fine-tuning changes how a model behaves, not what it knows.

Written by the milebits founders.

OpenAI has started winding down self-serve fine-tuning, telling developers that newer base models follow instructions and formats well enough that prompting is usually cheaper and faster. You can read that as a business decision, and partly it is. You can also read it as one of the largest model providers conceding, out loud, what the order of operations should have been all along. Most of what teams reach for fine-tuning to fix is not a fine-tuning problem.

When a team tells us they want to fine-tune a model, the useful next question is not "on what data." It is "to fix what?"

The answers cluster into a small number of buckets. The model does not know enough about our domain. The model will not return the format we need. The model is too expensive or too slow at the quality we want. The model does not sound like us. Each of those is a real problem. Most of them are not fine-tuning problems, and reaching for fine-tuning first is how teams spend two months and a data-labelling budget to end up roughly where prompting and retrieval would have put them in a week.

Fine-tuning has a specific job. It is worth understanding the job precisely, because the gap between what teams expect it to do and what it actually does is where the disappointment lives.

Fine-tuning changes behaviour more than knowledge

This is the distinction that resolves most of the confusion. Fine-tuning reliably changes how a model behaves. It is an unreliable way to change what a model knows. It can push facts into the weights, but loosely, the way you half-remember something you read once and cannot quite cite. Teams reach for it to fix knowledge problems, which is why it so often disappoints.

If your complaint is "the model does not know our refund policy," fine-tuning is the wrong tool. You can fine-tune a model on a thousand examples of your refund policy and it will still confidently invent a refund policy when asked a question slightly outside the training distribution, because you did not give it the policy, you gave it a thousand examples of policy-shaped text. The knowledge is not reliably in there. It is statistically suggested, which is worse than absent because it looks like presence.

If your complaint is "the model does not answer in the terse, structured way our downstream system needs," that is a behaviour problem, and fine-tuning is genuinely good at it. If your complaint is "the model writes in a generic voice and we want it to sound like our brand," that is a behaviour problem too, and fine-tuning can move it.

The test is simple. Knowledge problems, where the issue is the model not having current or proprietary facts, are retrieval problems. Behaviour problems, where the issue is how the model uses facts it already has access to, are candidates for fine-tuning. Most teams that want to fine-tune have a knowledge problem wearing a behaviour problem's clothes.

The boring ladder

The order we work through, almost always, is prompting, then retrieval, then fine-tuning. Each rung is cheaper to build, cheaper to change, and cheaper to debug than the one after it. You climb only when the rung you are on demonstrably cannot reach.

Prompting first. A surprising amount of what teams want from fine-tuning is achievable with a well-constructed system prompt, a few good examples in context, and a clear output schema. Few-shot examples in the prompt move format and tone a long way. This is the rung most teams skip past too quickly because it feels too simple to be the answer.

Retrieval next. If the problem is knowledge, the fix is getting the right facts in front of the model at inference time, with citations, scoped and current. This is where most "the model does not know our domain" problems actually get solved. We have a whole note on why retrieval systems fail, and the short version is that good retrieval beats fine-tuning for knowledge, because retrieval can be updated by changing a row in a database and a fine-tune cannot.

Fine-tuning last, and only for what the first two rungs cannot do. By the time you get here, you should be able to state precisely what fine-tuning will fix that prompting and retrieval could not, and you should have the prompted-plus-retrieval baseline measured so you can prove the fine-tune actually beat it.

When fine-tuning genuinely earns its place

It does earn its place. The cases where we reach for it:

Format adherence at scale. If you need the model to return a specific structured format on every single call, with no preamble and no drift, and you are making millions of calls, a fine-tuned model holds the format more reliably than a prompted one and you stop paying for the few-shot examples on every request. The cost and reliability case compounds with volume.

Latency and cost through a smaller model. A small model fine-tuned on your specific task can match a large model's quality on that narrow task while running faster and cheaper. This is one of the strongest cases. You are not trying to make the model smarter. You are trying to make a cheap model competent at one thing so you can stop paying premium-model prices for it. The usual targets are the small open models, Qwen, Gemma, the Llama and Ministral families, tuned with LoRA or QLoRA, or a small hosted model where the provider still allows it. For a high-volume narrow task, the inference savings are real money, and this is the case that survives even as the providers pull back on general-purpose fine-tuning.

Consistent voice or domain register. If the output needs to consistently sound a specific way, in a register that is hard to fully specify in a prompt, fine-tuning on a corpus of the target style moves it further than prompt instructions do. Legal tone, clinical tone, a brand voice with specific rules about what it never says.

Notice what these have in common. They are all behaviour, format, or economics. None of them is "the model needs to know more facts." When the genuine need is one of these three, fine-tuning is the right call and we make it without hesitation.

The bill that arrives later

Fine-tuning has costs that do not show up in the demo and do show up six months later.

You inherit an eval burden. A fine-tuned model is a model you now own the quality of. Every base-model upgrade from the provider is one you cannot simply adopt, because your fine-tune was on the old base. You are holding a snapshot while the frontier moves. You need an eval harness specifically to know whether re-tuning on a newer base is worth it, which is more of the week-one eval work that teams already underinvest in.

You inherit a data pipeline. The training set that made the fine-tune good is now a thing you maintain. When your task shifts, the data has to be regenerated and the model retrained. This is a standing cost, not a one-time one.

You inherit version lock. The fine-tuned model is tied to a base model version. When the provider deprecates it, and they will, you are on a retraining treadmill on someone else's schedule. This is the same lock-in argument from the boring-stack note, applied to weights instead of frameworks. The OpenAI wind-down made this concrete for anyone who had fine-tuned on their platform: the base you tuned against is now on a clock you do not control, and the migration is yours to absorb.

None of these are reasons to never fine-tune. They are reasons to be sure the first two rungs of the ladder genuinely could not reach, because once you fine-tune you are maintaining a model, and maintaining a model is a heavier standing commitment than maintaining a prompt.

The reflex to fine-tune is understandable. It feels like the serious, sophisticated move, the thing real AI teams do, while prompting feels like something anyone could try. But the goal is not to look sophisticated. The goal is to ship a system that works and keeps working after the base models underneath it change three more times this year.

Most of the time, that system is a good prompt over good retrieval, and the fine-tune is a thing you considered, scoped honestly, and decided you did not need yet. When you do need it, you will be able to say exactly why, and you will have the baseline to prove it earned its place.

Considering fine-tuning? Get a second opinion on whether you need it.

Book a 20-min call

More field notes