Jun 23, 2026

๐Ÿงช Testing AI Agents: Multi-Layer Evaluation Strategy

Testing AI agents is fundamentally different from testing traditional software.

Traditional systems follow a simple contract:

input → expected output → assertion

Agents don't.

They reason dynamically, choose tools, compose responses, and operate on data that changes constantly. Hardcoded expected answers quickly become stale and brittle.

To address this, we use a three-layer evaluation strategy that combines deterministic validation, semantic evaluation, and model migration testing.


๐Ÿšซ No Hardcoded Expected Results

One of the biggest challenges in agent testing is avoiding stale expectations.

Market data changes.
Accounts change.
Reference data evolves.

A hardcoded answer is often outdated the moment it's written.

Instead of storing static expected responses, each test case points to one or more live reference APIs.

Before evaluation runs, the framework fetches current ground-truth data directly from upstream services and evaluates the agent response against that live data.

This shifts the goal from validating snapshots to validating reasoning and communication against current reality.


⚡ Layer 1: Deterministic Validation

The first layer performs fast, reproducible checks against the agent response.

Examples include:

  • Required values present
  • Expected identifiers returned
  • Missing fields detected
  • Error or fallback messages detected
  • Structured formats validated

A key capability is validating values against live reference data.

Expected values are extracted from the reference payload and normalized into multiple formats before checking whether they appear in the response.

For example:

659161818

might be recognized as:

659,161,818
$659.2M
659.2 million

This layer catches:

  • Empty responses
  • Missing information
  • Incorrect values
  • Malformed output
  • Tool execution failures

Fast, deterministic, and easy to debug.


๐Ÿค– Layer 2: LLM-as-Judge

Deterministic checks verify facts.

They cannot determine whether an answer is complete, coherent, or grounded.

For that, a second LLM acts as a judge.

The judge evaluates:

  • Completeness — Did the response answer everything that was asked?
  • Coherence — Is the response clear and logically structured?
  • Groundedness — Are all claims supported by reference data?

The judge receives:

  • The original question
  • The agent response
  • The live reference data

It reasons step-by-step before assigning scores.

Importantly, the judge does not validate numeric accuracy. That responsibility remains with the deterministic layer.

Both layers use the same reference data but in different ways:

  • Layer 1 verifies that required values appear in the response
  • Layer 2 verifies that claims made in the response are supported by the reference

This separation prevents overlap and conflicting evaluations.


๐Ÿ”„ Layer 3: Model Migration Testing

Sometimes the question isn't:

Is this answer correct?

It's:

Can I safely switch from one model to another?

For migration testing, each test case runs twice:

  • Baseline model
  • Candidate model

A judge then compares the two responses and classifies the candidate

This mode answers a fundamentally different question from the first two layers.

The first two layers compare responses against objective ground truth.

The migration layer compares a candidate model against the current production baseline.


๐Ÿ—️ Evaluation Pipeline

Ground Truth Evaluation

invoke_agent
    ↓
fetch_live_reference
    ↓
deterministic_validation
    ↓
(optional) llm_judge



Model Migration Evaluation

baseline_model
        ↓
candidate_model
        ↓
pairwise_judge

This path focuses solely on migration safety and does not run the ground-truth evaluation layers.


๐ŸŽฏ Why Multiple Layers?

No single evaluation method is sufficient.

Layer 1 provides:

  • Speed
  • Exactness
  • Reproducibility

Layer 2 provides:

  • Semantic validation
  • Contextual reasoning
  • Groundedness checks

Layer 3 provides:

  • Safe model migration
  • Regression detection
  • Comparative evaluation

Together they provide a practical framework for testing AI agents without relying on brittle hardcoded outputs.

As agents become more autonomous and business-critical, having a robust evaluation strategy becomes just as important as the agent itself.Testing AI agents is fundamentally different from testing traditional software.



Jun 5, 2026

๐Ÿ”ฅ Chaos Engineering in Production: The Challenges Nobody Talks About

Part 2 of our multi-region failover series

Last week I wrote about running monthly chaos engineering exercises in production to validate our disaster recovery architecture.

The obvious follow-up:

"How do you get there safely?"

That's the right question.

Because chaos engineering in production is not where you start.

It's where you arrive — after building the operational maturity that makes failure survivable.

Here are the challenges you need to solve before you get there.


๐Ÿ” Challenge 1: Observability Gaps Will Expose You

Before triggering your first failover exercise, ask yourself:

If failover started right now, could you tell exactly what was happening?

Not after the fact.

In real time.

Can you see:

  • traffic shifting between regions?
  • application health in both regions?
  • user impact during the transition?

Most dashboards are built for normal operations. Failover creates a completely different signal spanning infrastructure, DNS, networking, and applications simultaneously.

If you can't observe the recovery process, you can't safely test it.

Maturity bar: Build dashboards specifically for failover scenarios, not just general infrastructure health.


❤️ Challenge 2: Health Checks That Lie

Does your health check confirm the application is ready to serve traffic — or just that the process is running?

Applications can report healthy while:

  • connection pools are still initializing
  • caches are cold
  • downstream services are unavailable

A failover mechanism that trusts shallow health checks can route traffic to a region that isn't actually ready.

Maturity bar: Validate readiness, not existence.

A lying health check is worse than no health check at all.


๐ŸŒ Challenge 3: DNS TTL Is Not Your Friend

Route 53 failover is not a switch.

Traffic does not instantly move from one region to another.

DNS caches.
Clients cache.
Resolvers cache.

Even with aggressive TTLs, some traffic continues flowing to the original region during transition.

Maturity bar: Understand your propagation behavior before running production exercises.

What looks like a failure may simply be DNS doing exactly what DNS does.


๐Ÿš€ Challenge 4: Cold Start Reality vs. Cold Start Assumption

Recovery timelines often look great on architecture diagrams.

What they rarely account for:

  • connection pool initialization
  • cache warmup
  • downstream dependency stabilization
  • application readiness under real load

Container startup is only the beginning.

Maturity bar: Measure end-to-end recovery under load and let observed behavior define your recovery objectives.


๐Ÿ›‘ Challenge 5: Blast Radius Without Boundaries

Chaos engineering gives automation authority over production infrastructure.

Without guardrails, a controlled exercise can become an actual incident.

Every exercise should have:

  • hard limits
  • abort conditions
  • rollback procedures
  • a designated kill switch

Maturity bar: Define the boundaries before the exercise starts.


๐Ÿ’“ Challenge 6: The Traffic Shift Window Is Invisible Without a Heartbeat

During failover:

  • the primary region is degrading
  • the secondary region is coming online
  • DNS is propagating

The question isn't:

"Did failover work?"

The real question is:

"Did users experience downtime?"

To answer that, we run an external synthetic heartbeat every 30 seconds through the same public endpoint users access.

The heartbeat records:

  • success/failure
  • latency
  • timestamps

After every exercise we have evidence.

Not:

"We think there was no downtime."

But:

"Every heartbeat succeeded during the entire failover window."

The heartbeat doesn't prevent outages.

It proves the absence of them.

Maturity bar: Build an independent synthetic monitor before attempting production chaos.


✅ The Maturity Stack Before Production Chaos

Before running chaos engineering in production, you should have:

  • Full observability
  • Deep readiness-based health checks
  • External heartbeat monitoring
  • Tested recovery automation
  • Practiced failback procedures
  • Defined blast-radius controls
  • Team readiness and communication plans
  • Successful non-production validation

๐ŸŽฏ Final Thought

Chaos engineering in production is not about proving you're brave.

It's about proving your recovery process works.

The architecture matters.

The automation matters.

But confidence comes from continuous validation.

The heartbeat monitor proves users weren't impacted.

The recovery tests prove automation still works.

Together they turn disaster recovery from a theoretical capability into a continuously validated one.

That's not chaos.

That's engineering.