Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune
Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune
Here's a mistake I made early, and watched plenty of others make since: I set up a LoRA fine-tune, targeted q_proj, k_proj, v_proj, o_proj because that's what every tutorial showed, ran it on a specialized medical dataset — and the model came out fluent but no smarter about the domain. It phrased things nicely. It didn't know more.
The reason is structural. I was adapting the part of the transformer that decides what to look at, and leaving untouched the part that actually stores knowledge. To see why that matters, you have to open up a single transformer block.
A block has two machines, not one
People learn the four attention projections and assume that's the transformer. But every block in every modern model — Llama, Qwen, Mistral, Gemma — has two machines stacked together, and they do completely different jobs.
Attention (q_proj, k_proj, v_proj, o_proj) answers "which tokens should I look at?" It's routing and retrieval — it moves information between positions, discovering relationships and context. After a residual add, the result flows into the second machine.
The MLP (gate_proj, up_proj, down_proj) answers "what should I do with it?" This is computation and, crucially, knowledge storage. The facts and task-specific patterns the model has learned live largely in these weights.
The catchphrase I use to keep it straight: attention retrieves, the MLP remembers.
A worked example
Take the input "The capital of France is Paris." Attention discovers the link France → Paris — the relationship, the context. But attention alone doesn't create the knowledge. The MLP is where the pattern France → capital → Paris is actually encoded in the weights. Attention found the connection; the MLP is what holds the fact.
The library researcher
If the projection names blur together, this analogy makes it stick.
Attention is the researcher walking the library and pulling the relevant books off the shelves. The MLP is the researcher sitting down and reasoning over those books to produce an answer. Pulling books (routing) is useless without a brain to process them; a brain is useless if it can't fetch the right books. You need both — but only one of them is where the content of the books is stored.
What the MLP actually does inside
The MLP isn't a single matrix. It expands, gates, then compresses — modern Llama-style models use a SwiGLU shape that combines gate_proj and up_proj.
The input goes to two projections in parallel. up_proj expands it into a much wider space (more room to compute); gate_proj decides which of those expanded features actually matter. The two are multiplied together, and down_proj compresses the result back down to the original size. That wide middle is where there's enough capacity to store and combine knowledge — which is exactly why it matters for fine-tuning.
Why this decides your LoRA config
Now the practical payoff. Here are the two configs people actually argue about:
# Config A — attention only (the common default) target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"] # Config B — attention + MLP (knowledge-heavy domains) target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
A classic interview question is: "Config A is attention only, Config B is attention plus MLP — which learns more medical knowledge?" The answer is Config B, and now you can say why, not just guess:
Attention layers mainly change how the model routes and relates tokens. The MLP layers are heavily involved in encoding and transforming knowledge. In a specialized domain the goal isn't only better attention patterns — it's learning new concepts, terminology, and relationships. Adapting
gate_proj,up_proj,down_projgives the model the capacity to store and use that domain knowledge.
| What you adapt | Good for | |
|---|---|---|
| Attention only | routing / how tokens relate | style, format, light behavior shifts |
| Attention + MLP | routing and knowledge storage | new domain facts, terminology, deep adaptation |
So the question "is Warfarin interacts with Aspirin more about attention or the MLP?" has a clear answer: it's a new fact to store, not just a routing pattern — that's an MLP job. If your adapter only touches attention, the model has no room to encode it.
How this played out in practice
When I fine-tuned Gemma 3 4B on a multi-turn conversational dataset for a tool-using assistant, this was a deliberate config decision, not a copy-paste. The assistant needed to learn domain behavior and vocabulary, not just talk more smoothly — so the LoRA adapters targeted the MLP projections alongside attention. The "attention-only" default would have produced exactly the fluent-but-shallow result I described at the top. I'll cover that full setup in the LoRA & QLoRA post and the agent case study later in this series.
This sits one level down from the two axes of fine-tuning: which layers your adapter touches is a detail inside Axis B (how you update the weights). Get the mental model of the block right, and the config stops being guesswork.
The takeaway
- A transformer block has two machines. Attention (
q/k/v/o) routes and retrieves; the MLP (gate/up/down) computes and stores knowledge. - Attention retrieves, the MLP remembers. New domain facts live in the MLP, not in attention.
- Target the MLP for knowledge-heavy domains. Attention-only LoRA changes style; attention + MLP gives the model capacity to learn new concepts.
- Config is a decision, not a default. Match
target_modulesto whether you're shifting behavior or teaching knowledge.
Next in the series: SFT — what the model is actually predicting — why supervised fine-tuning is the same objective as pretraining, and the masking detail that quietly decides whether it works.