Project Showcase

Fine-Tuned Tool-Calling Agent

A Small Model Taught to Use 40+ Tools

A LangGraph ReAct agent over a custom MCP backend exposing 40+ tools, with the decision model replaced by a fine-tuned Gemma 3 4B. Successful multi-turn tool trajectories from production were filtered and reformatted into an SFT dataset, then used to train Gemma 3 4B with QLoRA — cutting cost and latency while making tool-call formatting far more reliable.

View on GitHub

Tech Stack

Gemma 3 4BQLoRASFTPEFTTRLLangGraphMCPPythonPyTorch

Key Features

Fine-Tuned Decision Model

Gemma 3 4B trained with SFT + QLoRA on the agent’s own successful tool trajectories, replacing an expensive prompt-steered API model.

40+ Tool MCP Backend

A custom MCP server exposes the existing backend as 40+ tools the ReAct agent calls during multi-step tasks.

Trajectory-Mined Dataset

Production multi-turn runs filtered to verified successes, deduplicated, and formatted with the model’s own chat template and multi-turn response masking.

Reliable Tool Calling

Training on the exact tool-call format makes the model emit valid name + JSON-args calls consistently, so the harness parses them reliably.

Architecture

LangGraph ReAct agent orchestrates the multi-turn tool-calling loop

Custom MCP server exposes 40+ backend operations as tools

Gemma 3 4B fine-tuned with QLoRA (4-bit NF4 base, BF16 LoRA adapters)

LoRA adapters target attention and MLP projections for tool + domain behavior

Paged optimizer to survive training memory spikes on a single GPU

Identical chat template at training and serving via apply_chat_template

Diagrams

Fine-tuned tool-calling agent architecture — A LangGraph ReAct agent orchestrates the loop, calling the fine-tuned Gemma 3 4B to decide each step and the custom MCP server (40+ tools) to act against the existing backend.

Tool-calling loop — A user message with tool schemas goes to the model, which emits a structured tool call; the harness runs the MCP tool and feeds the result back, repeating over multiple steps until a final answer.

API Usage

QLoRA Config (Gemma 3 4B)

from peft import LoraConfig
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="bfloat16",
)

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=[              # attention AND MLP
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    task_type="CAUSAL_LM",
)
# optimizer: paged_adamw_8bit  ·  base stays frozen

Impact

Replaced a large prompt-steered model with a small fine-tuned one: lower cost and latency, and far more reliable tool-call formatting across multi-step tasks.

40+

MCP Tools

Gemma Parameters

QLoRA

Single-GPU Training