Thoughts on AI, ML, and Engineering
Deep dives into production-grade AI systems, agentic workflows, and lessons learned from building ML products at scale.
Fine-Tuning a Tool-Calling Agent: SFT + QLoRA on Gemma 3 4B
How I turned a prompt-driven ReAct agent over a 40+ tool MCP backend into a fine-tuned one — collecting multi-turn tool trajectories, building the SFT dataset, training Gemma 3 4B with QLoRA, and serving it back behind the agent loop. The case study that ties the whole fine-tuning series together.
LoRA & QLoRA: Fine-Tuning a Model That Doesn't Fit on Your GPU
Full fine-tuning a 7B model needs ~112GB before you load a batch. LoRA trains ~1% of the weights; QLoRA squeezes the frozen base into 4 bits so a large model fits on a single card. Here's how both work — and the four pieces of QLoRA interviewers always probe.
SFT: What the Model Is Actually Predicting (and the Mask That Decides If It Works)
Supervised fine-tuning looks like 'teach the model the answer.' Mathematically that's not what happens. It's the same next-token prediction as pretraining — and two quiet details, response masking and the chat template, decide whether your fine-tune works or silently rots.
Inside a Transformer Block: Why Where Knowledge Lives Decides Where You Fine-Tune
Most people learn q_proj, k_proj, v_proj, o_proj and think that's the whole transformer. It isn't. A block has two machines — attention and the MLP — and knowing which does what is the difference between a LoRA config that learns your domain and one that doesn't.
The Two Axes of Fine-Tuning: A Mental Model That Stops the Confusion
LoRA, SFT, QLoRA, DPO, PPO, GRPO — they all blur together until you see they live on two independent axes. One decides how you touch the weights, the other decides what signal you train on. The map I use before every fine-tuning project.
Using Mem0 for User Preference Memory and Context Switching in AI Agents
How to store user preferences in Mem0, retrieve them by context tag before each response, and handle context switching when users shift between different tasks. Practical patterns with LangGraph integration.
Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains
How I built a verification pipeline that catches hallucinated citations before they reach production. NLI cross-encoder, Claude judge, hard-fail gates, and the economics of getting it right at $0.05 per query.
The Rise of the Autonomous Company: Hermes, OpenClaw, and the Paperclip Revolution
A comparative analysis of the leading AI agent frameworks in May 2026: Hermes' self-improvement, OpenClaw's massive ecosystem, and Paperclip's orchestration.
Building a Self-Hosted Voice AI Assistant with LiveKit Agents
How I built a real-time conversational voice AI using LiveKit Agents, FastAPI, Whisper, Ollama, and Edge-TTS — fully self-hosted with no cloud dependencies, zero API costs, and sub-2s response latency.
Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use
Stop guessing which LLM technique to use. Data-driven decision framework with ROI analysis, cost comparisons, and real production case studies from Document Extraction Pipeline and DeepAgent.
Evaluating Generative AI in Production: Metrics Beyond 'Correct' and 'Incorrect'
Stop eyeballing LLM outputs. Learn production-grade evaluation: LLM-as-judge, deterministic checks, RAG faithfulness, and A/B testing frameworks that actually work.
Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy
Stop letting context windows bankrupt your AI budget. Learn sliding window, summarization, and vectorized memory strategies with real cost savings from production deployments.
From Document Processing to LLM Resilience: Patterns That Scale
Building on Document Extraction Pipeline's Celery/Redis foundation, learn to extend async patterns to LLM-specific resilience: circuit breakers, multi-provider fallbacks, and token bucket rate limiting.
Production-Grade LLM System Architecture: From Notebook to 10k RPM
Learn how to design decoupled LLM systems that handle 10,000 requests per minute. Covers async queues, caching strategies, RAG integration, and the three biggest bottlenecks you'll face.
Building a Production-Ready LangChain DeepAgent with SSE Streaming
How I built a production AI agent API using LangGraph, Ollama, and FastAPI with real-time SSE streaming, Redis rate limiting, and multi-turn session memory.
OpenClaw: A Self-Hosted AI Assistant with Ollama, Telegram & Discord
Install OpenClaw, wire Anthropic/OpenAI/Google or a local Ollama model, control it from Telegram and Discord, extend it with skills.sh, and turn it into a business gateway for lead qualification, customer support, and agent-ecosystem management.
Building a Production-Ready Document Extraction Pipeline with FastAPI and MinerU
How I built a high-performance document extraction system using FastAPI, Celery, MinerU, and LLMs. Learn the architecture, challenges, and lessons from deploying AI-powered document processing at scale.
Getting Started with LLM Agents in Production
A practical guide to building and deploying LLM-powered agents using LangChain and FastAPI, with lessons from real-world implementations.