LoRA & QLoRA: Fine-Tuning a Model That Doesn't Fit on Your GPU
LoRA & QLoRA: Fine-Tuning a Model That Doesn't Fit on Your GPU
The first time I tried to full fine-tune a 7B model, the math stopped me before the code did. With the Adam optimizer you need roughly 16 bytes per parameter — weights, gradients, and two optimizer states. That's about 112GB for a 7B model, before a single batch of data. My GPU had 48GB.
There are two ways out, and they stack. LoRA stops updating almost all the weights. QLoRA shrinks the frozen ones to 4 bits. Together they put a large model on a single card.
LoRA: train a thin detour, freeze the rest
The insight behind LoRA is that the change needed to adapt a model is "low-rank" — you don't need to move all of a giant weight matrix's numbers, just a small correction. So you freeze the big matrix and learn a thin detour beside it.
The input flows two ways: through the big frozen weight W (never trained), and through a small trainable detour made of two low-rank matrices, A then B. The outputs are added. Only A and B get gradients — roughly 1% of the parameters — so the optimizer state you have to hold collapses from "the whole model" to "a thin adapter."
One detail matters more than it looks: B is initialized to zero. At step 0 the detour contributes nothing, so training starts exactly at the pretrained model and drifts gently from there. (If you randomly initialized both A and B, you'd inject a large random perturbation at step 0 — effectively corrupting the pretrained model before training even begins, and the run becomes unstable. A random, B zero is the safe, smooth start.)
Sticky notes on a textbook
The analogy I use to explain this to non-specialists:
The frozen base is a printed textbook you're not allowed to rewrite. LoRA is sticky notes in the margins. The book stays untouched; your handful of sticky notes (the tiny A·B matrices) carry all the task-specific corrections. The final model is the book plus the notes. And rank r is how much you're allowed to write on each note — too little and you can't capture the change, too much and you're effectively rewriting the book (overfit, more memory).
The knobs
| Knob | What it controls | Typical | Trade-off |
|---|---|---|---|
r (rank) | capacity of the adapter | 16 or 32 (range 8–128) | ↑ more expressive but more VRAM & overfit; ↓ cheaper but may underfit |
lora_alpha | update scaling (effective ≈ α/r) | r or 2r | too high → unstable/over-strong updates |
lora_dropout | regularization on the adapter | 0 | ↑ helps tiny datasets overfit less |
target_modules | which matrices get an adapter | all 7 (attn + MLP) | attn-only = fewer params but underfits knowledge-heavy domains |
| learning rate | step size | ~2e-4 | too high → loss spikes; too low → underfits |
That target_modules row is the one from the transformer-block post: for domain knowledge, include the MLP (gate_proj, up_proj, down_proj), not just attention. And a common trap to be ready for — "you raised r from 16 to 256 on a 2,000-row dataset and quality dropped, why?" — is overfitting: rank 256 gives the adapter far more free parameters than 2,000 rows can constrain, so it memorizes phrasing and loses generalization. Match rank to data size and task complexity, not "bigger is better."
QLoRA: put the frozen base in 4 bits
LoRA shrinks the trainable footprint. The frozen base is still sitting in memory at full precision. QLoRA attacks that — it quantizes the frozen base to 4 bits so it takes ~4× less space, then runs LoRA on top. It's four parts, and interviewers probe all of them.
1 — NF4 (the big saving). The base weights are stored in a 4-bit format. The clever bit: NF4's 16 levels sit at the quantiles of a normal distribution. This trips people up, so say it precisely: NF4 does not "make the weights normal" — transformer weights are already approximately normal, and NF4 places its 16 levels to match that, with more levels where weights cluster (near zero) and fewer in the tails. That minimizes quantization error versus uniform spacing.
2 — Double Quantization. Weights are quantized in blocks, and each block needs a scale to reconstruct it. For a large model that's millions of scales — enough metadata to matter. So QLoRA quantizes the scales too. Think of zipping a folder where the file index itself also got compressed.
3 — LoRA adapters. The same low-rank adapters as before, kept in BF16/FP16 because adapter training needs higher precision for stable updates. They're already tiny, so this costs little — the big saving came from quantizing the base, not the adapter.
4 — Paged Optimizers. Optimizer state can spike during training (AdamW holds momentum and variance per parameter), and a spike past your GPU's memory is a CUDA OOM crash. Paged optimizers act like OS virtual memory: spill optimizer state to CPU RAM during spikes so training continues instead of dying.
The single most important QLoRA fact to state out loud: gradients flow only into the LoRA adapters. The 4-bit base stays frozen — it's dequantized on the fly purely to compute activations, never updated. A classic production bug is accidentally setting base_model.requires_grad_(True): now gradients are computed for all ~billions of base params instead of just the adapters → CUDA OOM, training crawls, optimizer memory explodes.
The QLoRA config in practice
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, # store the frozen base in 4-bit NF4 bnb_4bit_quant_type="nf4", # NF4 levels tuned for normal-distributed weights bnb_4bit_use_double_quant=True, # quantize the per-block scales too bnb_4bit_compute_dtype="bfloat16", # dequantize to bf16 for the actual math ) # optimizer: paged_adamw_8bit → survives memory spikes + 8-bit states save memory # the base stays frozen; only the LoRA A/B matrices are trainable
load_in_4bit saves >75% VRAM versus 16-bit at a small accuracy cost; the base must stay frozen so gradients flow only into A and B.
How I actually used this
For the tool-using assistant on Gemma 3 4B, QLoRA was the whole reason it fit comfortably and trained fast on a single GPU: the 4-bit NF4 base plus BF16 LoRA adapters on attention and MLP, with a paged optimizer so a mid-run memory spike didn't kill a long job. Rank stayed modest (matched to the dataset size, not cranked up), and the base stayed frozen — which, as a bonus, is also why LoRA resists catastrophic forgetting: the general knowledge lives in weights that can't be overwritten. I'll put the full pipeline together — data, training, and serving behind the agent — in the case-study post.
This all lives on Axis B from the pillar post: LoRA and QLoRA are how you update the weights, independent of what signal you train on. You can pair either with SFT, DPO, or GRPO.
The takeaway
- Full fine-tuning is memory-bound. ~16 bytes/param with Adam → ~112GB for 7B. LoRA and QLoRA exist to dodge that.
- LoRA trains a low-rank detour. Freeze
W, learnA·B(~1% of params),B=0for a safe start. Rank = capacity; match it to your data. - QLoRA = four pieces. NF4 4-bit base (big saving), double-quant the scales, BF16 adapters, paged optimizer for spikes.
- Gradients touch only the adapters. The 4-bit base is frozen and dequantized only to compute — which is also why LoRA resists catastrophic forgetting.
Next in the series: the case study that puts it all together — Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B.