How to Create a Custom AI Model with Fine-Tuning

I have been working on a project that requires an AI model to behave in a very specific and consistent way, and after weighing the options, I concluded that fine-tuning was the way forward. This post is the story of actually doing that: taking an open-weight model and turning it into something that behaves the way I need it to, end to end.

A word on that conclusion, because it is contested. There is a long-running debate about whether you should fine-tune a model at all or just write a better system prompt, and I am not going to try to settle it here. A lot of behavior people reach for fine-tuning to fix can be handled with prompting alone, and it is worth being honest with yourself about which camp your problem actually falls into before you spend the effort — fine-tuning is the heavier, slower, more expensive tool. This post assumes you have already done that research, weighed the tradeoffs, and, like me, concluded that fine-tuning is genuinely the right move for your use case. If you are still deciding between the two, this is not the post for you yet. Come back once you have made the call.

What “your own model” actually means

Worth clearing up a common misconception before we start. When most people say they have “their own proprietary AI model,” they almost never mean they trained a model from scratch. Pretraining a foundation model from the ground up is a game played by a small handful of very well funded labs. It takes an enormous amount of compute, data, and money, and the number of organizations doing it for real is tiny.

What everyone else is doing, and what I did, is taking an open-weight model that one of those labs already pretrained and fine-tuning it to specialize in a specific behavior. That is still a real, defensible, proprietary artifact. It is just worth being honest about where the heavy lifting happened. The base model is the foundation someone else poured. Your fine-tune is the part that makes it yours.

Why I am in a position to do this

I built my own small AI lab at home, which means I have the hardware to do this in my office. Running the GPUs does not cost me anything beyond electricity, so I can iterate as many times as I want without watching a cloud meter tick. That puts me in a slightly unusual position compared to most people, who work exclusively with cloud models and pay per token or per GPU-hour. When experimentation is effectively free, you experiment a lot more, and that changes what is worth attempting.

If you do not have that kind of hardware at home, none of what follows is impossible. You can rent GPUs by the hour from services like Runpod and run the same basic workflow there. You will just be more deliberate about it, because every smoke test and every aborted run shows up on the bill.

Step 1: Choose a base model

The first real decision is which base model to build on. I pulled mine from Hugging Face, the primary hub for open-weight AI models. Think of it as GitHub for AI models.

Two things drove the choice. First, what I actually needed the model to do. Different base models have different strengths, and you want to start from one that is already reasonably good at the kind of task you are specializing it for. Fine-tuning nudges behavior; it does not perform miracles on a model that was never any good at the underlying task. Second, the VRAM available on my hardware. There is no point picking a base model that cannot fit on the GPUs you have, even after quantization. Your hardware sets a hard ceiling, so know your ceiling before you fall in love with a model.

Before committing, I used the base model extensively. I ran it through real work, got familiar with its quirks and its capabilities, and built up an honest feel for what it could and could not do out of the box. Only after that did I decide this was the model I wanted to base my custom model on. Do not skip this part. You are about to invest a lot of effort on top of this foundation, so you should know the foundation cold.

Step 2: Establish a baseline with promptfoo

You cannot tell whether fine-tuning changed anything if you never measured where you started. So before touching the weights, I built a fairly extensive baseline test suite using promptfoo.

The suite covers a range of scenarios, including the specific behaviors I eventually wanted to change, and I saved all of the base model’s outputs. That saved baseline is the reference point for everything that follows. When I later compare the fine-tuned model against it, I can point to concrete, measurable differences in behavior rather than relying on a vague sense that “it feels better now.”

A useful property here: at this stage, the base model should fail the tests that target the new behavior, because it has not been trained for that behavior yet. That failure is the thing you are going to fix, and watching it flip to a pass later is how you know the fine-tune worked.

Step 3: Define the behavior change you want

This is the part that happens in your head, not in a terminal, and it is the part most people rush. While I was building the baseline I spent real time thinking hard about the precise behavior I wanted my custom model to have. Not “make it better,” but specifically what it should do differently, in what situations, and what the output should look like.

The clearer you are here, the easier every downstream step becomes, because the behavior you define is exactly what your synthetic data and your evaluations are built around. A fuzzy goal produces fuzzy data, which produces a fuzzy model.

Step 4: Generate the synthetic data

Fine-tuning needs examples. The way you nudge a model toward a behavior is by showing it many examples of that behavior, and unless you happen to have a large hand-labeled dataset lying around, you are going to generate synthetic data.

This is another place where having my own hardware paid off. I used my own GPUs to generate the synthetic training data, which meant I skipped the step of asking a cloud model to produce it for me. No per-token cost, no rate limits, no data leaving the building.

Once the data was generated, I inspected it carefully. This matters. Synthetic data is only as good as the process that made it, and bad examples teach bad behavior just as efficiently as good examples teach good behavior. I went through the dataset to make sure the examples actually demonstrated the behavior I wanted, that the formatting was consistent, and that there was nothing weird or off in there that would quietly poison the model.

Step 5: Smoke test and back up the pristine model

Before the real run, I ran a small smoke test to confirm the whole pipeline actually works end to end. The point is not to train a good model; it is to catch the dumb stuff: a broken data path, a misconfigured trainer, a format mismatch. You want to find those problems on a thirty-second run, not three-quarters of the way through the real one.

I also made a clean copy of the original, pristine base model at this point. If anything goes sideways during fine-tuning, I want to be able to fall back to a known-good starting point without re-downloading massive files.

A necessary detour: quantization

Here is a wrinkle worth explaining, because it shaped the next several steps.

Up to this point I had been running a quantized version of the model for everyday use. A model’s weights are just numbers, and in their native form those numbers are usually stored at 16-bit floating point precision. Quantization compresses them down to lower precision, commonly 4-bit integers. The result is dramatically smaller and lighter to run. As a rough sense of scale, a 30 billion parameter model at 16-bit precision needs somewhere around 60GB of memory, while the same model quantized to 4-bit fits in roughly 15 to 20GB. That difference is exactly what lets a large model run on a single consumer GPU instead of a rack of data center cards. The cost is a small loss in quality, which for a well-quantized model is often barely noticeable.

The catch: you cannot train a model this size at full precision on a single consumer GPU. The full-precision weights alone are several times larger than the card’s memory before you have spent a single byte on the training itself. The technique that squares this circle is QLoRA, and it is the route I took. You take the full-precision base, quantize it down to 4-bit in memory for the duration of training, freeze it, and train a small adapter on top of that frozen 4-bit base (more on what an adapter is in Step 7). Because the giant base is frozen and compressed and only the tiny adapter is actually learning, the whole thing fits.

One nuance that matters later: the 4-bit quantization I do for training is not the same artifact I serve. After training I merge the adapter back into the full-precision base and quantize that merged model fresh. So quantization shows up twice, for two different reasons — memory headroom during training, and a small footprint during serving — and it is worth keeping the two straight.

Step 6: Keep the comparison honest

A trap lurks here. It is easy to end up comparing two models that differ in more than one way — different precision, different settings, different anything — and then you genuinely cannot tell which change caused which result. The cleanest before/after holds everything constant except the single thing you are testing: the fine-tune itself.

In practice that meant evaluating the base and the fine-tuned version at the same precision, with the adapter as the only difference between them — same quantization, same prompts, same settings, adapter on or off. When the only variable is the adapter, any change in the results is unambiguously the fine-tune, not a side effect of how the model happened to be loaded. Control your variables, or your evals are telling you a story you cannot actually trust.

Step 7: Fine-tune with LoRA

Now the actual fine-tuning. The adapter I keep referring to is a LoRA (Low-Rank Adaptation).

In plain terms, LoRA freezes the original model weights and trains a small set of additional low-rank matrices that get applied alongside them. Instead of updating the model’s billions of parameters, you train a tiny fraction of that, which slashes the memory and compute the run requires. It is what makes fine-tuning a large model feasible on a single GPU. When training is done, those learned matrices can be merged back into the base weights to produce one consolidated model.

I also kept the adapter deliberately small and aimed it narrowly. The change I wanted was behavioral, not a new body of knowledge, so I pointed the adapter at a focused part of the network and used a low rank rather than trying to rewrite the whole model. A bigger, broader adapter is more capacity to learn the target behavior, but also more capacity to forget everything else. Matching the size of the intervention to the size of the change is most of the art here.

The run itself is the anticlimactic part. The GPU works through the synthetic data, the loss comes down, and after about an hour it was finished. Your mileage will vary a lot here: how long it takes depends on how much synthetic data you have, your hardware, your software stack, the number of epochs, and your sequence lengths. An hour is what it was for me, not a universal number.

Step 8: Evaluate the fine-tuned model

This is the satisfying part. I ran the full promptfoo suite again, this time against the fine-tuned model.

Remember that the base model failed the tests targeting the new behavior, because it had never been trained for it. The fine-tuned model should pass those same tests with flying colors. That is the whole point of the exercise made visible: the model now behaves the way I trained it to, replying in the specific ways the synthetic data taught it. Same tests, opposite result, and the difference is your fine-tune.

If your evals do not flip the way you expected, this is where you find out, before you have shipped anything. Better here than in production.

There is a second thing I check for at this stage that is just as important as the new behavior: catastrophic forgetting. It is entirely possible to teach a model its new trick so aggressively that it gets measurably worse at everything else. For exactly this reason my eval suite includes a block of general-capability tests that have nothing to do with the behavior I was changing, and I confirmed those scores did not move. I also verified that the more specialized abilities I actually depend on — structured tool-calling, in my case — still worked afterward. Teaching the model one new thing is a bad trade if it quietly forgets the others.

Step 9: Merge and re-quantize

With a fine-tuned model that passes, the next job is getting it back onto my GPU for serving. At this stage I have a LoRA adapter sitting on top of the full-precision base, so I merge the adapter into the base weights to produce a single standalone model, then quantize that merged model back down to 4-bit so it fits on my hardware.

The merge is not purely a stylistic preference, and this is worth saying because most write-ups present it as one. Some serving stacks can load an adapter on top of a base at runtime, which is genuinely convenient when it works. It did not work for my base model’s architecture — the serving engine choked on applying a runtime adapter to a quantized version of it. Merging sidesteps the entire problem: once the adapter is baked into the weights, there is no adapter to apply at serve time, it is just an ordinary model. What looked optional turned out to be the only clean path.

Two things about that final re-quantization caught me off guard, and they are the kind of detail that does not show up until you are in it:

Quantization needs calibration data. Compressing weights well is not blind rounding; the process runs real examples through the model to learn how to round each layer with the least damage. I reused a sample of my own training data for this, so the calibration distribution matched what the model would actually see.
It nearly ran out of memory, in a way I did not predict. As the quantizer worked through the model layer by layer, the compressed output accumulated in system RAM and crept steadily upward until it was on track to exceed the machine’s physical memory near the very end — after I had already sunk most of an hour into the run. The fix was unglamorous: I added a large swap file as a buffer before it hit the wall. The finished layers were cold data that paged out cleanly, so it cost almost nothing in speed and saved the run. Watch your memory headroom on the quantize step, not just on training.
If your base is a Mixture-of-Experts model, do not quantize the whole thing uniformly. Mine was, and these models contain a small internal “router” that decides which expert handles each token. Crush the router to 4-bit along with everything else and it starts routing to the wrong experts — quietly degrading quality in a way that is annoying to diagnose. The router (and a couple of other sensitive pieces) has to stay at full precision while the bulk of the weights get compressed.

Step 10: Evaluate again, quantized

Quantization can subtly shift behavior, so I do not assume the quantized fine-tune behaves identically to the full-precision one. I run the promptfoo suite one more time against the freshly quantized model to confirm it still passes everything. If the quantized version had regressed on the behaviors I cared about, this is where I would catch it and reconsider the quantization settings.

It passed — with one honest caveat I think is worth stating, because the temptation is to round these results up to “perfect.” The new behavior holds rock-solid under the conditions the model actually runs in day to day. Push it into adversarial corners it was not really trained for, though, and you can still find the seams — a stray edge case where the old behavior leaks through. For my use case those corners do not occur in normal operation, so it is a known, bounded limitation rather than a blocker, and a future round of data could close it. But “passes the suite” and “flawless in every conceivable prompt” are not the same claim, and it is more useful to tell you which one is true.

This before-and-after-quantization check is cheap insurance regardless. Skipping it is how you end up surprised in production by a model that tested perfectly at full precision and then quietly degraded the moment you shrank it.

Step 11: Serve it

The last step is the payoff. I hooked the new quantized model into my pipeline through vLLM, and my proprietary AI model is now serving inference inside the system.

That is the full arc: pick a solid open-weight base, measure where it starts, decide precisely what you want it to do, teach it with synthetic data, train a LoRA adapter with QLoRA, evaluate, merge, quantize, evaluate again, and serve. None of the individual steps are exotic. The discipline is in the measurement around them: the baseline at the start, and the evaluations at every checkpoint.

If you are running your own home lab and want to compare notes on hardware, base models, or eval setups, get in touch.