Back to Blog
TransformersAttentionMLPLoRAFine-TuningModel Internals

Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune

Published June 12, 20265 min read
Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune

Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune

Here's a mistake I made early, and watched plenty of others make since: I set up a LoRA fine-tune, targeted q_proj, k_proj, v_proj, o_proj because that's what every tutorial showed, ran it on a specialized medical dataset — and the model came out fluent but no smarter about the domain. It phrased things nicely. It didn't know more.

The reason is structural. I was adapting the part of the transformer that decides what to look at, and leaving untouched the part that actually stores knowledge. To see why that matters, you have to open up a single transformer block.

A block has two machines, not one

People learn the four attention projections and assume that's the transformer. But every block in every modern model — Llama, Qwen, Mistral, Gemma — has two machines stacked together, and they do completely different jobs.

The two machines inside every transformer block

Attention (q_proj, k_proj, v_proj, o_proj) answers "which tokens should I look at?" It's routing and retrieval — it moves information between positions, discovering relationships and context. After a residual add, the result flows into the second machine.

The MLP (gate_proj, up_proj, down_proj) answers "what should I do with it?" This is computation and, crucially, knowledge storage. The facts and task-specific patterns the model has learned live largely in these weights.

The catchphrase I use to keep it straight: attention retrieves, the MLP remembers.

A worked example

Take the input "The capital of France is Paris." Attention discovers the link France → Paris — the relationship, the context. But attention alone doesn't create the knowledge. The MLP is where the pattern France → capital → Paris is actually encoded in the weights. Attention found the connection; the MLP is what holds the fact.

The library researcher

If the projection names blur together, this analogy makes it stick.

The library researcher analogy

Attention is the researcher walking the library and pulling the relevant books off the shelves. The MLP is the researcher sitting down and reasoning over those books to produce an answer. Pulling books (routing) is useless without a brain to process them; a brain is useless if it can't fetch the right books. You need both — but only one of them is where the content of the books is stored.

What the MLP actually does inside

The MLP isn't a single matrix. It expands, gates, then compresses — modern Llama-style models use a SwiGLU shape that combines gate_proj and up_proj.

What the MLP does inside: expand, gate, multiply, compress

The input goes to two projections in parallel. up_proj expands it into a much wider space (more room to compute); gate_proj decides which of those expanded features actually matter. The two are multiplied together, and down_proj compresses the result back down to the original size. That wide middle is where there's enough capacity to store and combine knowledge — which is exactly why it matters for fine-tuning.

Why this decides your LoRA config

Now the practical payoff. Here are the two configs people actually argue about:

# Config A — attention only (the common default)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Config B — attention + MLP (knowledge-heavy domains)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

A classic interview question is: "Config A is attention only, Config B is attention plus MLP — which learns more medical knowledge?" The answer is Config B, and now you can say why, not just guess:

Attention layers mainly change how the model routes and relates tokens. The MLP layers are heavily involved in encoding and transforming knowledge. In a specialized domain the goal isn't only better attention patterns — it's learning new concepts, terminology, and relationships. Adapting gate_proj, up_proj, down_proj gives the model the capacity to store and use that domain knowledge.

What you adaptGood for
Attention onlyrouting / how tokens relatestyle, format, light behavior shifts
Attention + MLProuting and knowledge storagenew domain facts, terminology, deep adaptation

So the question "is Warfarin interacts with Aspirin more about attention or the MLP?" has a clear answer: it's a new fact to store, not just a routing pattern — that's an MLP job. If your adapter only touches attention, the model has no room to encode it.

How this played out in practice

When I fine-tuned Gemma 3 4B on a multi-turn conversational dataset for a tool-using assistant, this was a deliberate config decision, not a copy-paste. The assistant needed to learn domain behavior and vocabulary, not just talk more smoothly — so the LoRA adapters targeted the MLP projections alongside attention. The "attention-only" default would have produced exactly the fluent-but-shallow result I described at the top. I'll cover that full setup in the LoRA & QLoRA post and the agent case study later in this series.

This sits one level down from the two axes of fine-tuning: which layers your adapter touches is a detail inside Axis B (how you update the weights). Get the mental model of the block right, and the config stops being guesswork.

The takeaway

  • A transformer block has two machines. Attention (q/k/v/o) routes and retrieves; the MLP (gate/up/down) computes and stores knowledge.
  • Attention retrieves, the MLP remembers. New domain facts live in the MLP, not in attention.
  • Target the MLP for knowledge-heavy domains. Attention-only LoRA changes style; attention + MLP gives the model capacity to learn new concepts.
  • Config is a decision, not a default. Match target_modules to whether you're shifting behavior or teaching knowledge.

Next in the series: SFT — what the model is actually predicting — why supervised fine-tuning is the same objective as pretraining, and the masking detail that quietly decides whether it works.