Jun 13, 2026
AI-powered systems need a different kind of observability than traditional software. This guide covers tracing LLM calls, tracking evaluation scores over time, and building feedback loops that let you improve continuously.
Notes on AI engineering, product systems, and the work behind taking prototypes into production.
Jun 13, 2026
AI-powered systems need a different kind of observability than traditional software. This guide covers tracing LLM calls, tracking evaluation scores over time, and building feedback loops that let you improve continuously.
Jun 13, 2026
Models do not read files or run commands on their own. They ask for tools, and a harness runs them.
Jun 12, 2026
MMLU tells you nothing about whether your LLM-powered application works. This is a practical guide to evaluating LLM pipelines for production — harness-based benchmarks, LLM-as-judge, and the signals that actually matter.
Jun 11, 2026
Embedding and retrieving is the easy part. The real failure modes in production RAG systems are chunking strategy, retrieval quality, re-ranking, and evaluation — and most teams discover them too late.
Jun 10, 2026
A practical guide to designing MCP servers that behave predictably under load — schema validation, error handling, observability, and the tool shapes that actually hold up in production.
Jun 9, 2026
Agent demos look autonomous. Production agents fail in predictable ways: tool loops, bad recovery, memory drift, weak permissions, missing observability, and workflows with no clear stop condition.
Jun 9, 2026
Academic benchmarks tell you whether a model is generally capable. Product evals tell you whether your AI application works for your users, data, tools, and failure cases.
Jun 9, 2026
AI observability is not just logs around a model call. Reliable systems need traces across prompts, tools, retrieval, evals, costs, feedback, and production regressions.
Jun 9, 2026
Basic embedding plus retrieval is only the start. Real RAG quality depends on ingestion, chunking, hybrid search, reranking, context assembly, and evaluation.
Jun 9, 2026
MCP servers are easy to demo and surprisingly easy to break in production. Reliability comes from narrow tools, schema validation, explicit permissions, timeouts, idempotency, and observability.
Jun 6, 2026
A practical guide to Model Context Protocol servers: what they expose, how clients call them, and how to design tools that stay predictable.