PylonworksTell us what's eating your time
All posts

What Agent Logs Taught Me About Failure Modes

Jordan Ellis6 min read

Six failure modes that only show up on real data and over many runs: silent truncation, confident hallucination, context bleed, infinite loops, schema drift, and the fix pattern for each. A catalog from production agent logs.

The first time a production Claude agent silently truncated its output at the token cap and returned a half-finished JSON object to the caller, the downstream service parsed it, wrote garbage to the database, and nobody noticed for three days. That is the failure mode the demo never shows.

What happens when output hits the token cap?

Silent truncation is the failure mode most people hit first and diagnose last. The agent runs, it returns something, and the caller assumes completion. But max_tokens is an absolute ceiling on the response. The model cannot exceed it. When the model hits it mid-output, it stops. No error. No warning. Just a response that ends wherever the token budget ran out.

For structured output this is especially bad. A JSON object that terminates mid-value, a list that cuts off before the final item, a reasoning trace that ends in the middle of a sentence. The response is syntactically broken or semantically incomplete, and a naive caller will swallow it.

The fix is output validation before you do anything with the result. If you asked for JSON, parse it. If the parse fails or the schema check fails, treat it as an agent error and retry with a larger max_tokens budget or a prompt that asks for a shorter response.

function validateAgentOutput<T>(raw: string, schema: ZodSchema<T>): T {
  let parsed: unknown;
  try {
    parsed = JSON.parse(raw);
  } catch {
    throw new AgentOutputError('Truncated or malformed JSON', raw);
  }
  const result = schema.safeParse(parsed);
  if (!result.success) {
    throw new AgentOutputError('Schema validation failed', result.error);
  }
  return result.data;
}

Set max_tokens generously for structured output tasks. Then validate. In that order.

Why does the agent invent a tool result?

Tool call hallucination is quieter than it sounds. The agent does not announce that it is making something up. It returns a plausible-looking tool result in the right schema, with realistic-looking values, and moves on. The call never happened.

This shows up most often when the tool is slow, the context is long, or the agent is near its context limit. The model learns the shape of your tool responses from the examples in context. When it is uncertain or running low on attention, it fills the gap.

The agent that hallucinates a tool result is not broken. It is doing exactly what it was trained to do: complete the pattern. The problem is that the pattern it is completing is your tool response schema rather than your actual data.

The fix is server-side verification. Do not trust that a tool was called just because the agent says it was called. Log every real tool invocation with a unique ID. If the agent returns a result that does not have a matching log entry, it invented the result. Structured output with a required tool_call_id field that you issue and verify closes most of this gap.

How does context filling break long tasks?

On tasks that run for many turns, the context window fills up and early instructions get pushed out. The agent was told to use a specific output format in turn 1. By turn 40, that instruction is gone. It starts formatting differently. It forgets constraints. It starts repeating work it already did.

This is not a bug in the model. It is a property of attention over long sequences. The fix is to summarize and continue: at a threshold, pause the agent, summarize the conversation so far into a compact state object, and restart with that summary as the new system context. The agent picks up from the summary instead of a 60,000-token scroll of prior turns.

For production agents that run many turns, build the summarize-and-continue pattern in from the start. Retrofitting it is painful.

How do you stop an agent that loops forever?

An agent loops when it cannot finish a task and cannot recognize that it cannot finish it. It retries the same approach, hits the same wall, and tries again. Without an exit condition, it will run until you run out of budget or patience.

The failure modes here are usually one of three things: the tool is returning an error the agent does not know how to handle, the task is underspecified and the agent is oscillating between interpretations, or the agent is stuck waiting on a result that will never come.

Failure mode Symptom Fix
Silent truncation Incomplete or unparseable output, no error Validate output schema; increase max_tokens; retry with shorter prompt
Tool hallucination Plausible tool results that never happened Server-side tool call logging; verify call IDs; structured output with issued IDs
Context bleed Agent forgets early instructions, repeats work Summarize-and-continue at a turn threshold; compact state object
Infinite loop Agent retries the same failing approach indefinitely Hard turn cap; budget limit; loop-detection on repeated tool calls
Schema drift Structured output stops matching expected schema Strict schema validation on every response; retry with explicit schema in prompt

The fix for loops is mechanical: a hard turn cap and a budget limit. Pick a maximum number of turns that represents the worst plausible legitimate run, and add a margin. When the agent hits it, surface a failure instead of continuing. A bounded failure is always better than an unbounded loop.

const MAX_TURNS = 20;
const MAX_COST_USD = 0.50;

async function runBoundedAgent(task: string) {
  let turns = 0;
  let cost = 0;

  while (turns < MAX_TURNS && cost < MAX_COST_USD) {
    const result = await agentTurn(task);
    cost += result.usage.cost;
    turns++;

    if (result.done) return result;
    if (result.repeatingPriorToolCall) {
      throw new AgentLoopError(`Loop detected at turn ${turns}`);
    }
  }

  throw new AgentBudgetError(`Exceeded limits: ${turns} turns, $${cost.toFixed(4)} cost`);
}

Loop detection on repeated tool calls is worth adding. If the agent calls the same tool with the same arguments twice in a row, it is stuck.

What is schema drift and why does it happen suddenly?

Schema drift is when the agent's structured output stops matching the schema you specified. It starts dropping required fields, renaming keys, nesting objects differently. It was working yesterday. It is broken today.

Usually this is a context length issue combined with a prompt issue. As the context gets longer, the schema specification in the prompt gets less weight. Or you updated the prompt and the schema example in the prompt no longer matches the validation schema in your code.

The fix is to include the full JSON schema in the system prompt on every call, validate every response against it strictly, and on failure retry with the schema appended again to the user turn. Three retries with exponential backoff catches most transient drift. Persistent drift means your prompt and your code schema have diverged; reconcile them.

The one pattern that covers most of these

Validate, cap, and retry. Validate output schema on every response. Cap turns and budget on every agent run. Retry on validation failure with the schema re-injected. These three things together close most of the failure modes that kill production agent reliability.

The logs will tell you which one you are missing.


FAQ

How many retries should I allow before surfacing a failure?

Three retries covers most transient failures (context noise, model variance, brief tool errors). More than three without a change in the prompt or context usually signals that the task is broken rather than that the model is unlucky. Surface the failure and let the caller decide.

Does summarize-and-continue work for all agent types?

It works well for research and planning agents where the accumulated context is text. It works poorly for agents that depend on exact prior tool outputs because the summary loses precision. For those, serialize the tool results to a structured state object rather than a prose summary.

How do I know if my agent is looping versus legitimately retrying?

Check the tool call log for repeated (tool, args) pairs. A legitimate retry calls the same tool with different arguments or on a different resource. A loop calls the same tool with the same arguments. That distinction is detectable in your logs without any model-side changes.


Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time

Back to all posts