Technical Blog

Thoughts on AI, ML, and Engineering

Deep dives into production-grade AI systems, agentic workflows, and lessons learned from building ML products at scale.

Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B
Fine-TuningSFTQLoRA+5

Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B

How I turned a prompt-driven ReAct agent over a 40+ tool MCP backend into a fine-tuned one — collecting multi-turn tool trajectories, building the SFT dataset, training Gemma 3 4B with QLoRA, and serving it back behind the agent loop. The case study that ties the whole fine-tuning series together.

June 15, 20267 min read
LoRA & QLoRA: Fine-Tuning a Model That Doesn't Fit on Your GPU
LoRAQLoRAPEFT+4

LoRA & QLoRA: Fine-Tuning a Model That Doesn't Fit on Your GPU

Full fine-tuning a 7B model needs ~112GB before you load a batch. LoRA trains ~1% of the weights; QLoRA squeezes the frozen base into 4 bits so a large model fits on a single card. Here's how both work — and the four pieces of QLoRA interviewers always probe.

June 14, 20267 min read
SFT: What the Model Is Actually Predicting (and the Mask That Decides If It Works)
SFTSupervised Fine-TuningChat Templates+3

SFT: What the Model Is Actually Predicting (and the Mask That Decides If It Works)

Supervised fine-tuning looks like 'teach the model the answer.' Mathematically that's not what happens. It's the same next-token prediction as pretraining — and two quiet details, response masking and the chat template, decide whether your fine-tune works or silently rots.

June 13, 20266 min read
Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune
TransformersAttentionMLP+3

Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune

Most people learn q_proj, k_proj, v_proj, o_proj and think that's the whole transformer. It isn't. A block has two machines — attention and the MLP — and knowing which does what is the difference between a LoRA config that learns your domain and one that doesn't.

June 12, 20265 min read
The Two Axes of Fine-Tuning: A Mental Model That Stops the Confusion
Fine-TuningLoRAQLoRA+5

The Two Axes of Fine-Tuning: A Mental Model That Stops the Confusion

LoRA, SFT, QLoRA, DPO, PPO, GRPO — they all blur together until you see they live on two independent axes. One decides how you touch the weights, the other decides what signal you train on. The map I use before every fine-tuning project.

June 11, 20266 min read
Using Mem0 for User Preference Memory and Context Switching in AI Agents
Mem0Agent MemoryUser Preferences+4

Using Mem0 for User Preference Memory and Context Switching in AI Agents

How to store user preferences in Mem0, retrieve them by context tag before each response, and handle context switching when users shift between different tasks. Practical patterns with LangGraph integration.

June 9, 202610 min read
Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains
LLM VerificationNLI Cross-EncoderHallucination Detection+3

Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains

How I built a verification pipeline that catches hallucinated citations before they reach production. NLI cross-encoder, Claude judge, hard-fail gates, and the economics of getting it right at $0.05 per query.

June 2, 20268 min read
The Rise of the Autonomous Company: Hermes, OpenClaw, and the Paperclip Revolution
AI AgentsAutonomous CompaniesOpenClaw+2

The Rise of the Autonomous Company: Hermes, OpenClaw, and the Paperclip Revolution

A comparative analysis of the leading AI agent frameworks in May 2026: Hermes' self-improvement, OpenClaw's massive ecosystem, and Paperclip's orchestration.

May 18, 20264 min read
Building a Self-Hosted Voice AI Assistant with LiveKit Agents
LiveKitVoice AIFastAPI+7

Building a Self-Hosted Voice AI Assistant with LiveKit Agents

How I built a real-time conversational voice AI using LiveKit Agents, FastAPI, Whisper, Ollama, and Edge-TTS — fully self-hosted with no cloud dependencies, zero API costs, and sub-2s response latency.

May 13, 20267 min read
Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use
Prompt EngineeringRAGFine-Tuning+3

Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

Stop guessing which LLM technique to use. Data-driven decision framework with ROI analysis, cost comparisons, and real production case studies from Document Extraction Pipeline and DeepAgent.

April 25, 202614 min read
Evaluating Generative AI in Production: Metrics Beyond 'Correct' and 'Incorrect'
LLM EvaluationLLM-as-JudgeGenerative AI Metrics+3

Evaluating Generative AI in Production: Metrics Beyond 'Correct' and 'Incorrect'

Stop eyeballing LLM outputs. Learn production-grade evaluation: LLM-as-judge, deterministic checks, RAG faithfulness, and A/B testing frameworks that actually work.

April 24, 202614 min read
Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy
AI Agent MemoryContext WindowLangGraph+3

Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy

Stop letting context windows bankrupt your AI budget. Learn sliding window, summarization, and vectorized memory strategies with real cost savings from production deployments.

April 23, 202611 min read
From Document Processing to LLM Resilience: Patterns That Scale
LLM ResilienceCircuit BreakerExponential Backoff+3

From Document Processing to LLM Resilience: Patterns That Scale

Building on Document Extraction Pipeline's Celery/Redis foundation, learn to extend async patterns to LLM-specific resilience: circuit breakers, multi-provider fallbacks, and token bucket rate limiting.

April 22, 20268 min read
Production-Grade LLM System Architecture: From Notebook to 10k RPM
LLM System DesignFastAPICelery+4

Production-Grade LLM System Architecture: From Notebook to 10k RPM

Learn how to design decoupled LLM systems that handle 10,000 requests per minute. Covers async queues, caching strategies, RAG integration, and the three biggest bottlenecks you'll face.

April 21, 20266 min read
Building a Production-Ready LangChain DeepAgent with SSE Streaming
LangChainLangGraphFastAPI+4

Building a Production-Ready LangChain DeepAgent with SSE Streaming

How I built a production AI agent API using LangGraph, Ollama, and FastAPI with real-time SSE streaming, Redis rate limiting, and multi-turn session memory.

April 13, 20267 min read
OpenClaw: A Self-Hosted AI Assistant with Ollama, Telegram & Discord
OpenClawOllamaAI Agents+8

OpenClaw: A Self-Hosted AI Assistant with Ollama, Telegram & Discord

Install OpenClaw, wire Anthropic/OpenAI/Google or a local Ollama model, control it from Telegram and Discord, extend it with skills.sh, and turn it into a business gateway for lead qualification, customer support, and agent-ecosystem management.

April 13, 202614 min read
Building a Production-Ready Document Extraction Pipeline with FastAPI and MinerU
FastAPIMinerUOllama+5

Building a Production-Ready Document Extraction Pipeline with FastAPI and MinerU

How I built a high-performance document extraction system using FastAPI, Celery, MinerU, and LLMs. Learn the architecture, challenges, and lessons from deploying AI-powered document processing at scale.

April 9, 20267 min read
Getting Started with LLM Agents in Production
LLMLangChainFastAPI+2

Getting Started with LLM Agents in Production

A practical guide to building and deploying LLM-powered agents using LangChain and FastAPI, with lessons from real-world implementations.

April 9, 20264 min read