Blog

Notes on AI engineering, product systems, and the work behind taking prototypes into production.

Observability for AI Systems: Tracing, Evals, and Feedback Loops

Jun 13, 2026

AI-powered systems need a different kind of observability than traditional software. This guide covers tracing LLM calls, tracking evaluation scores over time, and building feedback loops that let you improve continuously.

AI engineeringObservabilityLLMTracingProduction

How AI Tools Actually Work

Jun 13, 2026

Models do not read files or run commands on their own. They ask for tools, and a harness runs them.

AIToolsAgentsTooling

LLM Evaluation in Practice: Beyond MMLU

Jun 12, 2026

MMLU tells you nothing about whether your LLM-powered application works. This is a practical guide to evaluating LLM pipelines for production — harness-based benchmarks, LLM-as-judge, and the signals that actually matter.

LLMEvaluationBenchmarksAI engineeringQuality

RAG Is Not Enough: The Pipeline Problems Nobody Talks About

Jun 11, 2026

Embedding and retrieving is the easy part. The real failure modes in production RAG systems are chunking strategy, retrieval quality, re-ranking, and evaluation — and most teams discover them too late.

RAGAIEmbeddingsVector searchNLP

Building Reliable MCP Servers in Production

Jun 10, 2026

A practical guide to designing MCP servers that behave predictably under load — schema validation, error handling, observability, and the tool shapes that actually hold up in production.

MCPAITool callingProductionSchema

Agentic AI: What Actually Breaks in Production

Jun 9, 2026

Agent demos look autonomous. Production agents fail in predictable ways: tool loops, bad recovery, memory drift, weak permissions, missing observability, and workflows with no clear stop condition.

AIAgentsProductionAutomation

LLM Evaluation in Practice: Beyond MMLU

Jun 9, 2026

Academic benchmarks tell you whether a model is generally capable. Product evals tell you whether your AI application works for your users, data, tools, and failure cases.

AIEvalsLLMsQuality

Observability for AI Systems: Tracing, Evals, and Feedback Loops

Jun 9, 2026

AI observability is not just logs around a model call. Reliable systems need traces across prompts, tools, retrieval, evals, costs, feedback, and production regressions.

AIObservabilityTracingEvals

RAG Is Not Enough: The Pipeline Problems Nobody Talks About

Jun 9, 2026

Basic embedding plus retrieval is only the start. Real RAG quality depends on ingestion, chunking, hybrid search, reranking, context assembly, and evaluation.

AIRAGSearchEvaluation

Building Reliable MCP Servers in Production

Jun 9, 2026

MCP servers are easy to demo and surprisingly easy to break in production. Reliability comes from narrow tools, schema validation, explicit permissions, timeouts, idempotency, and observability.

AIMCPProductionTooling

MCP Servers: A Practical Introduction

Jun 6, 2026

A practical guide to Model Context Protocol servers: what they expose, how clients call them, and how to design tools that stay predictable.

AIMCPAgentsTooling