Fine-TuningSFTQLoRALangGraphReAct AgentMCPTool CallingGemma

Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B

Published June 15, 20267 min read

Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B

The agent worked. It also cost too much and drifted.

It was a LangGraph ReAct agent sitting over a custom MCP server that exposed 40+ tools into an existing backend. The model was a large general-purpose API model, prompted with the full tool catalog on every turn. It was capable — but each multi-step task burned a long context of tool schemas, latency was high, and the tool-call formatting wobbled enough that the harness occasionally failed to parse a call.

The fix wasn't a better prompt. It was teaching a small model to use these specific tools well — SFT + QLoRA on Gemma 3 4B. This post is the case study that ties together the two axes, the transformer block, SFT, and LoRA & QLoRA — applied to one real system.

Why fine-tune instead of prompt

Three reasons made fine-tuning the right call here, not premature optimization:

The tool surface was fixed and large. 40+ tools with stable schemas. Re-describing them in the prompt every turn is paying, in tokens and latency, for knowledge the model could just have.
The behavior was repetitive and verifiable. The agent did the same families of multi-step tasks. There were plenty of successful runs to learn from.
Format reliability mattered. A flaky tool-call format breaks the whole trajectory. A model trained on the exact format emits it far more consistently than one steered by a prompt.

On the two axes: the signal is SFT (imitate successful trajectories) and the weight method is QLoRA (fit Gemma 3 4B comfortably on one GPU). A clean "one from each axis" choice.

Step 1: the dataset is the product

The agent had been running in production, which meant it was already generating exactly the data I needed: multi-turn tool-use trajectories. The work was turning that exhaust into a clean training set.

Building the fine-tuning dataset from agent tool trajectories

The pipeline:

Filter. Keep only successful, verified runs — trajectories where the task actually completed correctly. Drop failures, dead-ends, and near-duplicates. A failed trajectory teaches the model to fail.
Format. Render each trajectory as a multi-turn conversation using Gemma's own chat template, with tool calls and tool results in the exact structure the harness uses. Then apply masking (next section).
Dataset. The output is a curated, on-distribution multi-turn SFT set. This is "golden trajectories" — hand-verified perfect runs — plus rejection-sampled completions (generate many, keep only the ones that pass).

Most of the effort lives in steps 1 and 2. Garbage in, garbage out: the dataset is the product.

Step 2: masking in multi-turn data

This is where the SFT post's masking detail gets sharper. In a single Q&A row you mask the prompt and compute loss on the response. In a multi-turn agent trajectory, there are many turns — user messages, assistant tool calls, tool results, more assistant turns — and you have to mask the right ones.

The rule: every assistant turn carries loss; every user and tool turn is masked to -100.

<user>      task description            → masked (-100)
<assistant> tool_call: lookup(args)     → LOSS  (learn to call the tool)
<tool>      {result: ...}               → masked (-100)
<assistant> tool_call: update(args)     → LOSS  (learn the next step)
<tool>      {ok: true}                  → masked (-100)
<assistant> final answer                → LOSS  (learn to finish)

Get this wrong — mask the assistant turns, or compute loss on tool results — and the model learns to generate the user's side of the conversation or hallucinate tool outputs. Both are exactly the failure you don't want in an agent. Tools like TRL's collators handle multi-turn response masking, but you must verify it on a real tokenized row, not assume it.

Step 3: train with QLoRA

Gemma 3 4B is small enough that QLoRA makes training comfortable on a single GPU. From the LoRA & QLoRA post: a 4-bit NF4 frozen base, BF16 LoRA adapters, a paged optimizer for memory spikes.

from peft import LoraConfig
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="bfloat16",
)

lora_config = LoraConfig(
    r=16,                       # modest rank — matched to dataset size
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[            # attention AND MLP — tool behavior + domain knowledge
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    task_type="CAUSAL_LM",
)
# optimizer: paged_adamw_8bit  ·  1–3 epochs  ·  base stays frozen

Two decisions worth calling out, both straight from earlier posts:

Target the MLP, not just attention. From the transformer-block post: the tools carry domain vocabulary and behavior, so the adapters touch gate_proj, up_proj, down_proj too — not just q/k/v/o. Attention-only would have produced a fluent agent that didn't really learn the tools.
Modest rank, few epochs. Rank 16, 1–3 epochs. Cranking rank on a finite trajectory set overfits phrasing; few epochs protect the base model's general ability (and QLoRA's frozen base resists catastrophic forgetting by construction).

Step 4: serve it back behind the agent

The fine-tuned model drops into the same ReAct loop it learned from — nothing about the orchestration changes.

The tool-calling loop a ReAct agent runs

A user message arrives with the tool schemas; the model emits a structured tool call (name + JSON args); the harness runs the matching MCP tool and feeds the result back as a tool message; repeat until the model emits a final answer. The whole system looks like this:

Architecture of the fine-tuned tool-calling agent

The LangGraph agent orchestrates. It calls the fine-tuned Gemma 3 4B to decide each step and the MCP tools to act, against the existing backend. The fine-tune didn't replace the architecture — it replaced the expensive, prompt-steered decision model with a small one that already knows these tools.

The one thing you must keep identical between training and serving is the chat template. Train on Gemma's template, serve on Gemma's template — via apply_chat_template on both sides. A mismatch here is the silent quality collapse from the SFT post, and in an agent it shows up as rising tool-call parse failures.

What changed, and what to watch

The wins were the ones the three reasons predicted: the tool catalog no longer had to live in every prompt, latency and cost dropped with the smaller model, and tool-call formatting got far more reliable because the model had seen the exact format thousands of times.

The things I watched closely:

Tool-call parse rate as a first-class metric. If it ever drifts, the template or the format is wrong — catch it early.
Read real multi-turn generations, not the loss curve. An agent can have great loss and still loop, call the wrong tool, or fabricate a result. Curves lie; trajectories don't.
State, not just final text. When evaluating, check that the right tools were called in a legal order and the backend ended in the right state — not only that the final sentence looked good. (That reward/verifier design is its own topic, coming next in the series.)

The takeaway

Fine-tune when the tool surface is fixed, the behavior is repetitive, and format reliability matters. Otherwise keep prompting.
The dataset is the product. Mine successful production trajectories, filter hard, format with the model's own chat template.
Mask every non-assistant turn. In multi-turn agent data, loss goes on assistant turns only — or the model learns to write the user and hallucinate tool outputs.
QLoRA on a small model, MLP included. 4-bit base, BF16 adapters on attention and MLP, modest rank, few epochs — and keep the chat template identical from training to serving.

This is the payoff of the whole series so far: the two axes, the transformer block, SFT, and QLoRA, composed into one working system. Next, I'll go up a level into reward design and verifiers — how you'd push this agent further with RL by scoring whether it took the right actions, not just whether the text looked right.