Building agentic systems that actually work in production

Every enterprise we work with is asking a variant of the same question: "How do we build AI agents that actually ship to production — and stay shipped?" Most have a prototype that demos beautifully but stalls in compliance review. Some have systems that made it to production and then drifted, broke silently, or got too expensive to operate.

This essay is the playbook we've developed across more than a dozen agentic deployments. It's opinionated, occasionally contrarian, and structured around the engineering decisions that matter — not the model selection or framework debates that dominate most agent content.

The gap between a demo and a system.

An agent demo is easy. Wire an LLM to a few tools, give it a system prompt, and watch it handle a contrived task. The result is genuinely impressive — and almost entirely unrelated to what it takes to run an agent in production for two years.

The gap between demo and production agent has nothing to do with the model. Foundation models are good enough for almost every enterprise use case we encounter. The gap is operational: can your business trust this thing to act on its behalf, debug it when it fails, hold it accountable when it does something wrong, and improve it over time without breaking what's already working?

That trust is engineered, not promised. Here's how.

1. The tool registry is the system.

Every production agent we've shipped has a centralized, typed, versioned tool registry. Tools are not defined in prompts. They're not configured in the application code. They live in a registry the agent reads at runtime, with explicit schemas, RBAC permissions, audit hooks, and version history.

Why does this matter? Three reasons:

Permissions follow tools, not agents. When the same agent serves different surfaces (customer-facing vs internal vs admin), the available tools differ. Permission gating at the tool level is auditable; permission gating in prompts is not.
You can swap models without changing tools. When you upgrade from GPT-4 to GPT-5 to Claude Opus 5 to whatever's next, your tool registry doesn't change. Your evaluation can compare models head-to-head on the same tool set.
Audit and compliance become possible. Every tool call is logged with the agent, the user, the input, the output, the permissions checked. Regulators love this. Auditors love this. Your future self will love this.

// Example tool registry entry (simplified)
{
  "id": "lookup_employee",
  "version": "2.1.0",
  "description": "Look up employee by ID or email",
  "input_schema": {
    "type": "object",
    "properties": {
      "id_or_email": { "type": "string" }
    },
    "required": ["id_or_email"]
  },
  "output_schema": { /* ... */ },
  "permissions": {
    "roles": ["EMPLOYEE", "MANAGER", "HR_ADMIN"],
    "scopes": ["read:employees"]
  },
  "side_effects": "none",
  "rate_limit": "100/minute/user",
  "deprecated": false
}

Tool registries that don't look roughly like this are not yet production-ready.

2. Deterministic routing for known intents.

The single biggest gap between agent demos and production agents is what happens when the intent is known. In a demo, the LLM handles every turn. In production, that's slow, expensive, and unnecessary for 60-80% of interactions.

If the user's intent is known, don't ask the model.

Our standard pattern: a deterministic router catches well-formed intents (status checks, lookups, simple commands) and routes them to a finite-state machine or a small task handler. The LLM gets invoked only when the intent is genuinely open-ended, when reasoning is required, or when no rule applies.

The economics are striking. On a typical customer-service agent, we see 65% of interactions handled by the deterministic layer at sub-100ms latency, 25% by a single LLM call, and only 10% requiring multi-step reasoning. The cost difference between these tiers is roughly 1:50:500.

3. Observability that's built in, not bolted on.

If you can't see what your agent is doing, you can't operate it. Every production agent needs at minimum:

Trace logging for every conversation — the full prompt history, tool calls, and outputs
Latency and cost metrics per turn, per session, per user, per intent
Quality signals — explicit user feedback, implicit signals like retry rate, escalation rate
Drift detection — are the kinds of questions the agent receives shifting over time?
Hallucination detection — are tool calls being fabricated? Are responses citing non-existent sources?

We typically deploy LangSmith or Helicone for trace logging, with Prometheus + Grafana for the operational metrics, and a custom quality dashboard built on the conversation logs.

4. Evaluation is the real product.

Here's a contrarian opinion: the most valuable artifact from an agent engagement is rarely the agent itself. It's the evaluation framework. The evaluation framework is what makes the agent operable.

A good evaluation framework includes:

A test set of representative interactions, growing over time as production interactions are reviewed
Automated scoring per interaction across multiple dimensions (correctness, helpfulness, safety, format)
A way to compare two model/prompt/tool variants head-to-head on the test set
A way to debug specific failures — which test cases failed, why, and how to fix them

Without this, every model change is a leap of faith. With it, you can ship updates confidently and measurably.

5. Human-in-the-loop for actions that matter.

The most common production failure mode we see: an agent autonomously takes an action it shouldn't have. Refunded the wrong customer. Cancelled the wrong subscription. Sent a confidential document to the wrong recipient.

These don't usually require model failures. They happen when humans build systems where the agent can take consequential actions without human review. The solution is straightforward: categorize tools by their reversibility, and require human-in-the-loop confirmation for irreversible ones.

In practice, this means a third axis on your tool registry beyond permissions and rate limits: a requires_confirmation flag. When set, the agent generates the proposed action and routes it to a human reviewer rather than executing directly. The reviewer accepts, modifies, or rejects.

Putting it together: a reference architecture.

Every production agentic system we've shipped looks roughly like this:

User input
    ↓
[Deterministic router] ─ handles 60-80% of intents
    ↓ (unmatched)
[Concierge LLM] ─ classifies intent, selects specialist agent
    ↓
[Specialist agent] ─ handles domain-specific reasoning
    ↓ (tool calls)
[Tool registry] ─ permission-checked, audit-logged, rate-limited
    ↓
[HITL queue] ─ for irreversible actions
    ↓
[Action executor]
    ↓
[Response synthesizer] ─ formats output for surface
    ↓
[Observability] ─ logs, metrics, evals — captured throughout

This architecture has shipped in finance, healthcare, retail, and enterprise SaaS contexts. The specifics differ by domain. The bones are the same.

Closing thought

The hardest thing about agentic systems isn't building them. Foundation models have gotten good enough that the modeling work is increasingly commoditized. The hard part is the operational discipline: tool registries, deterministic routing, observability, evaluation, human-in-the-loop design.

If you're starting an agentic project, invest disproportionately in this discipline from day one. The teams that get this right ship in months. The teams that don't, ship demos for years.

Building agentic systems that actually work in production.

The gap between a demo and a system.

1. The tool registry is the system.

2. Deterministic routing for known intents.

3. Observability that's built in, not bolted on.

4. Evaluation is the real product.

5. Human-in-the-loop for actions that matter.

Putting it together: a reference architecture.

Closing thought

Ketul Kumar

Building agentic systems that actually work in production.

The gap between a demo and a system.

1. The tool registry is the system.

2. Deterministic routing for known intents.

3. Observability that's built in, not bolted on.

4. Evaluation is the real product.

5. Human-in-the-loop for actions that matter.

Putting it together: a reference architecture.

Closing thought

Ketul Kumar

More from Insights.

When NOT to use a language model

Evaluating RAG systems: what we use in practice

The case for boring infrastructure