We deployed our first LLM agent into production in July 2024. It was supposed to handle Tier-1 support tickets. Within six hours, it had confidently told three customers that their SLA was covered under a plan they weren't on, fabricated a refund policy that didn't exist, and cheerfully escalated a routine billing question with the message "This appears to be a legal matter." We turned it off. Then we rebuilt it correctly. Here's what that took.
The LLM agent space has matured dramatically in 18 months. Frameworks like LangChain, LlamaIndex, and CrewAI have lowered the barrier to building. But the distance between a demo and a production system is enormous, and most content covers the demo side. This article covers the other side.
The Three Problems Nobody Talks About
1. Reliability at Non-Determinism
LLMs are stochastic. Temperature > 0 means you get different outputs for identical inputs. In a traditional software system, this is catastrophic. Your entire engineering culture is built around reproducibility — unit tests, integration tests, regression suites. None of these work on LLM agents the way they work on deterministic code.
The solution is not to eliminate non-determinism — it's to design around it. Successful production agents have three properties: bounded action spaces (the agent can only call approved tools), structured outputs (JSON schemas enforced by the model), and idempotent operations (every tool call results in the same system state regardless of how many times it's called).
2. Cost at Scale
GPT-4o tokens are cheap. At scale, they are not. A support agent handling 10,000 conversations per day with a 3,000-token context window costs $18,000/month at GPT-4o pricing — before you add retrieval, tool calls, or multi-turn memory. We've seen startups receive $40K surprise invoices from OpenAI because nobody built cost attribution into their agent systems.
Cost Architecture Principle
Every agent action should have an estimated cost attached before execution. Build a cost governor that can pause agent execution, switch to a cheaper model, or compress context when spending exceeds per-conversation thresholds.
3. Trust & Guardrails
Agents can take real actions in the real world: send emails, modify databases, call APIs, create infrastructure resources. Every tool you give an agent is a surface area for unintended consequences. The question is not "can the agent do this?" but "under what conditions, with what oversight, and with what rollback mechanism?"
The Architecture That Actually Works in Production
After running six different production agents, we've converged on an architecture we call the "Constrained Autonomy Stack." It has four layers:
# production_agent.py — The Constrained Autonomy Pattern
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
import json, logging
logger = logging.getLogger(__name__)
client = OpenAI()
# ── 1. Strongly-typed tool definitions ───────────────────────────
class TicketAction(BaseModel):
action: Literal["escalate", "resolve", "request_info", "apply_credit"]
reason: str
confidence: float # 0.0–1.0
requires_human_review: bool
# ── 2. Constrained tool registry ─────────────────────────────────
APPROVED_TOOLS = {
"get_customer_account": lambda cid: fetch_customer(cid),
"get_ticket_history": lambda tid: fetch_history(tid),
"apply_credit": lambda cid, amt: apply_credit_safe(cid, amt), # idempotent
"escalate_to_human": lambda tid, reason: escalate(tid, reason),
}
# ── 3. Agent execution with guardrails ───────────────────────────
def run_support_agent(ticket: dict) -> TicketAction:
system_prompt = """You are a Tier-1 support agent.
STRICT RULES:
- Never state policies not explicitly in the knowledge base.
- If confidence < 0.75, set requires_human_review = true.
- Credits > $50 always require human review.
- Never access tools not in your approved list."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": json.dumps(ticket)}
]
# ── 4. Cost-gated execution ───────────────────────────────────
estimated_tokens = estimate_tokens(messages)
if estimated_tokens > 4000:
logger.warning(f"Context too large ({estimated_tokens} tokens), compressing")
messages = compress_context(messages, max_tokens=3000)
response = client.beta.chat.completions.parse(
model="gpt-4o-mini", # cheaper for routine; upgrade only for complex
messages=messages,
response_format=TicketAction,
temperature=0.1, # low temperature for consistency
)
action = response.choices[0].message.parsed
# ── 5. Pre-execution safety check ────────────────────────────
if action.action == "apply_credit" and action.requires_human_review:
logger.info("Credit action flagged for human review, escalating")
action.action = "escalate_to_human"
# ── 6. Audit trail ────────────────────────────────────────────
log_agent_action(ticket["id"], action, estimated_tokens)
return actionObservability for LLM Agents
You cannot improve what you cannot measure. LLM agent observability requires a new stack: traces at the conversation level (not just the API call), token consumption per action, tool call success/failure rates, hallucination detection metrics, and human override rates (the percentage of agent decisions that a human later reverses — your best signal of agent quality).
- Langfuse or Helicone for LLM-native tracing and cost attribution
- Arize AI or Evidently for model drift and quality monitoring
- Custom Prometheus metrics for business-level outcomes (resolution rate, escalation rate, CSAT correlation)
- Alerting on human override rate > 15% (threshold depends on your use case)
The Honest Truth
Fully autonomous agents that make consequential decisions without human oversight are not appropriate for most business contexts today. The most successful production agents are those with well-defined scopes, clear escalation paths, and humans in the loop for high-stakes decisions. "AI-assisted" is safer and more valuable than "AI-autonomous" at most maturity levels.
LLM agents are genuinely transformative when deployed thoughtfully. The teams winning with this technology are not the ones who gave their agent the most tools — they're the ones who gave it the fewest tools necessary, surrounded it with the best guardrails, and measured rigorously. Start constrained. Expand based on data. Your users will thank you.