Fine-TuningLoRAQLoRASFTRLHFGRPOLLM TrainingMental Model

The Two Axes of Fine-Tuning: A Mental Model That Stops the Confusion

Published June 11, 20266 min read

The Two Axes of Fine-Tuning: A Mental Model That Stops the Confusion

Someone asked me last month: "LoRA or GRPO — which is better for our agent?"

It's a trick question, and the fact that it sounds reasonable is exactly the problem. LoRA and GRPO don't compete. Asking which is better is like asking "is a screwdriver better than a Phillips head pattern?" — one is a tool, the other is a thing the tool produces.

Every fine-tuning term you've heard — SFT, LoRA, QLoRA, DPO, RLHF, PPO, GRPO, RLVR — fits on two independent axes. Once you see the axes, the whole space collapses into something you can reason about in a meeting without hand-waving.

The one idea holding all of it together

Here is the sentence I come back to every time:

Every fine-tuning method writes into the same transformer weights. They only differ in the learning signal they use.

Same student, same brain, different teacher. SFT says "copy this answer." DPO says "answer A is better than B." PPO/RLHF says "the reward for that answer was high." GRPO and RLVR say "the verifier says you got it right."

The weights being updated are identical. What changes is the instruction you give the optimizer about which direction is good. Hold that, and the two axes fall out naturally.

The two independent axes of fine-tuning

Axis B — how you update the weights

This axis is about parameter efficiency: when the gradient arrives, which weights actually move?

Full fine-tune — every weight is trainable. Maximum flexibility, maximum cost. For a 7B model with the Adam optimizer you need roughly 16 bytes per parameter (weights + gradients + optimizer states), so ~112GB of memory before you've even loaded a batch. You reach for this when you have a lot of data and a deep domain shift.
LoRA — freeze the big weight matrix, and learn a thin low-rank adapter beside it. Only the adapter (~1% of parameters) gets gradients and optimizer states. This is the sensible default for most work.
QLoRA — LoRA on top of a base that's been quantized to 4 bits. The frozen base shrinks ~4×, so a large model fits on a single GPU. This is what you pick when you're memory-bound.

Notice these are three points on one dial. They answer "how expensive is the update," not "what is the model learning."

Axis A — what signal you train on

This axis is the learning signal — the teacher.

SFT (supervised fine-tuning) — imitate labelled answers. The model copies curated (prompt, response) examples. This is where almost every project starts.
DPO (direct preference optimization) — learn from preference pairs (this answer is better than that one), offline, with no separate reward model.
RLHF / PPO — train a reward model on human preferences, then optimize the policy against it with reinforcement learning. Powerful, but it keeps four models in memory at once.
GRPO / RLVR — reinforcement learning with the expensive parts removed: GRPO drops the critic and scores answers against the average of a group; RLVR replaces the learned reward model with an automatic verifier (does the code pass? is the math right?).

These are points on a different dial. They answer "what is the model learning," not "how expensive is the update."

Why this matters: you pick one from each

The two axes are independent. You don't choose "LoRA or GRPO." You choose one option from each axis and combine them:

SFT + QLoRA → a cheap instruction tune
GRPO + LoRA → cheap reinforcement learning
A typical agent recipe: SFT → DPO → GRPO, each stage usually running with LoRA or QLoRA underneath

So the honest answer to "LoRA or GRPO?" is: "Those aren't alternatives. GRPO decides the objective; LoRA decides parameter efficiency. I'd typically run GRPO with LoRA." That single reframing is the difference between sounding confused and sounding like you've actually shipped this.

The universal training loop

Here's the part that makes the mental model click. Every method on both axes runs the same loop. Only one box changes.

The universal training loop every fine-tuning method shares

Batch the data, apply the chat template and tokenize, run a forward pass to get logits, compute the learning signal, backpropagate, let the optimizer step, repeat. Steps 1–3 and 5–7 are identical whether you're doing SFT or GRPO.

The only thing that differs is box 4 — the loss:

Method	What box 4 computes	In plain words
SFT	cross-entropy vs the one correct answer	"copy this"
DPO	push chosen answer up, rejected down	"A beats B"
PPO / RLHF	maximise reward, critic estimates advantage	"reward was high"
GRPO / RLVR	score a group of answers, no critic	"verifier says correct"

Axis B changes which weights box 5 sends gradients into (all of them, or just the LoRA adapters). Axis A changes what box 4 says is good. That's the entire space.

How I use this before a project

When a new fine-tuning task lands, I make two decisions, in order:

Axis A first — what's my teacher? Do I have labelled demonstrations (SFT)? Preference pairs (DPO)? A way to automatically verify correctness like tests or a SQL result (RLVR with GRPO)? The signal I can actually get usually decides this for me.
Axis B second — what's my budget? One GPU and a big model → QLoRA. Room to spare and a default mindset → LoRA. Lots of data and a deep domain shift with real hardware → full fine-tune.

Two questions, and the method names stop being a fog. Everything else — response masking, rank, KL leashes, verifiers — is detail that lives inside one of these two choices.

In the rest of this series I go one level deeper into each piece, always tied back to real work: a tool-calling agent fine-tuned with SFT + QLoRA on Gemma 3 4B, and the reward design behind a multi-tool agent. If you've ever had to choose between prompting, RAG, and fine-tuning in the first place, that decision sits one level above this one — I covered it in the decision framework post.

The takeaway

Two axes, not one list. How you update the weights (full / LoRA / QLoRA) is independent of what signal you train on (SFT / DPO / RLHF / GRPO).
Same brain, different teacher. Every method runs the identical training loop; only the loss in box 4 changes.
Pick one from each. "LoRA + GRPO" is a normal recipe, not a contradiction.
Decide the signal before the budget. What you can teach the model is usually more constrained than how cheaply you can teach it.

Next in the series: Inside a Transformer Block — why where knowledge lives in the model decides which layers your adapter should touch.