The missing layer between observability and benchmarks.


Over the last year, Agentic AI has exploded.

OpenAI has agents.

Anthropic has agents.

Google DeepMind has agents.

Every startup suddenly has a multi-agent architecture diagram.

And if you look closely, something interesting has happened.

The industry solved observability.

The industry largely solved benchmarks.

Yet somehow, we still cannot answer a deceptively simple question:

Was this agent actually good?

Not “did it finish.”

Not “did it return the correct answer.”

Not “did it beat another model on a leaderboard.”

Was the behavior itself good?

That sounds like a trivial question.

It isn’t.


The Three Layers of Agent Evaluation

Most people think evaluation is a single thing.

In reality, the ecosystem has evolved into three distinct layers.

The three layers of agent evaluation: observability and benchmarks are solved; the evaluation layer in the middle is the gap

The first two layers are advancing rapidly.

The third layer barely exists.


Layer A: Observability Is Mostly Solved

A few years ago, debugging AI agents was painful.

You had no idea what happened between the prompt and the final answer.

Today that’s no longer true.

The ecosystem has largely standardized around tracing.

Platforms such as LangSmith, Braintrust, Arize Phoenix, Langfuse, MLflow, AgentOps, and OpenTelemetry can capture nearly every step of an agent execution.

A modern trace can tell you:

  • Which tools were called
  • Which prompts were used
  • Which agent delegated work
  • How many tokens were consumed
  • How much latency was introduced
  • Which branch of execution was taken

The industry now has excellent visibility into agent behavior.

The question is no longer:

What happened?

The traces already tell us.


The observability layer: a user request flows through a planner to research and coder agents, each calling tools, down to the final output — every step captured by the trace


Layer B: Benchmarks Are Mostly Solved

The second layer is benchmark evaluation.

The ecosystem has produced a remarkable collection of benchmarks:

BenchmarkMeasures
GAIATool use + reasoning
SWE-benchSoftware engineering
OSWorldComputer use
WebArenaWeb navigation
MLE-benchML engineering
PaperBenchResearch replication
τ-benchAgent-user interactions

These benchmarks answer a useful question:

Can the agent complete a predefined task?

This is important.

But there is a hidden assumption.

The benchmark only cares about the final score.

It rarely cares about how the score was achieved.


The Hidden Problem

Imagine two agents.

Both successfully book a flight.

Same outcome, different behavior: Agent A takes one clean step while Agent B stumbles through retries and a cancellation — both log the same Success = True

Both agents receive the same score.

Yet anyone who has deployed production systems knows these agents are fundamentally different.

One is reliable.

One is fragile.

The benchmark cannot see the difference.


The Missing Layer

This is where evaluation begins.

Not benchmarking.

Evaluation.

The missing question is:

How well did the agent behave while solving the task?

Consider these failure modes:

Failure TypeExample
Wrong planCorrect tools, bad reasoning
Wrong toolSQL needed, calculator used
Wrong parametersAPI called incorrectly
Handoff failureContext lost between agents
Infinite delegationAgents bouncing forever
Memory corruptionShared state overwritten
Policy violationUnauthorized action
Cost explosionThousands of unnecessary tokens
Latency explosionDozens of unnecessary calls

A benchmark score cannot distinguish among these.

But production systems care deeply about them.


Benchmarks grade the endpoints — task in, answer out, pass or fail. Evaluation grades every stage along the path: planning, delegation, tool use, recovery, policy checks


Anthropic’s Important Insight

One of the most important evaluation principles published recently is surprisingly simple:

The transcript is not the outcome.

An agent can say:

“Your flight has been booked.”

That does not mean the flight was booked.

The real evaluation target is the environment itself.

Did a reservation actually appear in the database?

Did the file actually get created?

Did the code actually compile?

Did the money actually move?

The distinction sounds obvious.

Yet much of the industry still evaluates text instead of reality.


The Gap Nobody Owns

This is where things become interesting.

Observability platforms do not solve this problem.

Benchmark leaderboards do not solve this problem.

Most agent frameworks do not solve this problem.

Everyone captures traces.

Everyone reports benchmark scores.

Almost nobody grades the trace itself.

Especially for:

  • Multi-agent coordination
  • Context preservation
  • Delegation quality
  • Safety compliance
  • Recovery behavior
  • Long-horizon reasoning

The industry has built cameras.

The industry has built leaderboards.

The industry has not built referees.

Cameras for observability and leaderboards for benchmarks are built; the referee that judges agent behavior is missing


The Opportunity

The next major category in Agentic AI will not be another benchmark.

It will not be another observability platform.

It will be an evaluation layer.

A system that can consume traces from any framework, evaluate how the agent behaved, and produce standardized scores for:

  • Trajectory quality
  • Coordination quality
  • Safety compliance
  • Cost efficiency
  • Reliability
  • Recovery ability

In other words:

Observability
Evaluation
Benchmarking

Today the middle layer is missing.

That is where the biggest opportunity lies.


Final Thought

The AI industry spent years asking:

Can models answer questions?

Then it asked:

Can agents complete tasks?

The next question is harder.

And ultimately more important.

Can we objectively measure whether an agent behaved well while completing those tasks?

Until we answer that question, every leaderboard is incomplete.

Every benchmark is partial.

And every production deployment is operating with a blind spot.

The future of Agentic AI is not just building agents.

It’s learning how to evaluate them.