Jun 5, 2026

🔥 Chaos Engineering in Production: The Challenges Nobody Talks About

Part 2 of our multi-region failover series

Last week I wrote about running monthly chaos engineering exercises in production to validate our disaster recovery architecture.

The obvious follow-up:

"How do you get there safely?"

That's the right question.

Because chaos engineering in production is not where you start.

It's where you arrive — after building the operational maturity that makes failure survivable.

Here are the challenges you need to solve before you get there.


🔍 Challenge 1: Observability Gaps Will Expose You

Before triggering your first failover exercise, ask yourself:

If failover started right now, could you tell exactly what was happening?

Not after the fact.

In real time.

Can you see:

  • traffic shifting between regions?
  • application health in both regions?
  • user impact during the transition?

Most dashboards are built for normal operations. Failover creates a completely different signal spanning infrastructure, DNS, networking, and applications simultaneously.

If you can't observe the recovery process, you can't safely test it.

Maturity bar: Build dashboards specifically for failover scenarios, not just general infrastructure health.


❤️ Challenge 2: Health Checks That Lie

Does your health check confirm the application is ready to serve traffic — or just that the process is running?

Applications can report healthy while:

  • connection pools are still initializing
  • caches are cold
  • downstream services are unavailable

A failover mechanism that trusts shallow health checks can route traffic to a region that isn't actually ready.

Maturity bar: Validate readiness, not existence.

A lying health check is worse than no health check at all.


🌐 Challenge 3: DNS TTL Is Not Your Friend

Route 53 failover is not a switch.

Traffic does not instantly move from one region to another.

DNS caches.
Clients cache.
Resolvers cache.

Even with aggressive TTLs, some traffic continues flowing to the original region during transition.

Maturity bar: Understand your propagation behavior before running production exercises.

What looks like a failure may simply be DNS doing exactly what DNS does.


🚀 Challenge 4: Cold Start Reality vs. Cold Start Assumption

Recovery timelines often look great on architecture diagrams.

What they rarely account for:

  • connection pool initialization
  • cache warmup
  • downstream dependency stabilization
  • application readiness under real load

Container startup is only the beginning.

Maturity bar: Measure end-to-end recovery under load and let observed behavior define your recovery objectives.


🛑 Challenge 5: Blast Radius Without Boundaries

Chaos engineering gives automation authority over production infrastructure.

Without guardrails, a controlled exercise can become an actual incident.

Every exercise should have:

  • hard limits
  • abort conditions
  • rollback procedures
  • a designated kill switch

Maturity bar: Define the boundaries before the exercise starts.


💓 Challenge 6: The Traffic Shift Window Is Invisible Without a Heartbeat

During failover:

  • the primary region is degrading
  • the secondary region is coming online
  • DNS is propagating

The question isn't:

"Did failover work?"

The real question is:

"Did users experience downtime?"

To answer that, we run an external synthetic heartbeat every 30 seconds through the same public endpoint users access.

The heartbeat records:

  • success/failure
  • latency
  • timestamps

After every exercise we have evidence.

Not:

"We think there was no downtime."

But:

"Every heartbeat succeeded during the entire failover window."

The heartbeat doesn't prevent outages.

It proves the absence of them.

Maturity bar: Build an independent synthetic monitor before attempting production chaos.


✅ The Maturity Stack Before Production Chaos

Before running chaos engineering in production, you should have:

  • Full observability
  • Deep readiness-based health checks
  • External heartbeat monitoring
  • Tested recovery automation
  • Practiced failback procedures
  • Defined blast-radius controls
  • Team readiness and communication plans
  • Successful non-production validation

🎯 Final Thought

Chaos engineering in production is not about proving you're brave.

It's about proving your recovery process works.

The architecture matters.

The automation matters.

But confidence comes from continuous validation.

The heartbeat monitor proves users weren't impacted.

The recovery tests prove automation still works.

Together they turn disaster recovery from a theoretical capability into a continuously validated one.

That's not chaos.

That's engineering.

May 30, 2026

🌪️ Chaos Engineering for Disaster Recovery: Proving Multi-Region Failover Actually Works

 Most disaster recovery architectures are built with a hidden assumption:

The failover process will work when we need it.

The problem is that assumptions don't survive outages.

Infrastructure changes.
Deployments drift.
Permissions break.
Health checks evolve.
Automation silently fails.

A disaster recovery strategy is only as good as the last time it was tested.

That's why we built chaos engineering directly into our multi-region architecture.


The Architecture

Like many organizations, we run a primary AWS region that handles all production traffic.

The secondary region is fully provisioned but runs with zero application tasks during normal operation.

When the primary region becomes unhealthy:

  • CloudWatch detects degradation
  • Lambda initiates recovery
  • The secondary region scales up
  • Health checks begin passing
  • Route 53 shifts traffic

The entire process is automated and completes in roughly 10 minutes.

This isn't designed for instant failover.

It's designed to provide a balance between resilience and cost efficiency.


The Real Challenge Isn't Failover

The real challenge is confidence.

Most teams test disaster recovery once during implementation and then assume it continues to work forever.

But recovery paths are software.

And software breaks.

The critical question becomes:

How do you know your failover automation still works six months from now?


Enter Chaos Engineering

Once a month we intentionally trigger a failover event in production.

Not a simulation.

A real failover.

We reduce capacity in the primary region and allow the system to respond naturally.

Alarms fire.
Recovery automation executes.
The secondary region activates.
Route 53 redirects traffic.
Production traffic runs from the backup region.

Several hours later we restore the primary region and validate failback behavior.


What Gets Validated

Each exercise validates the entire recovery chain:

✅ CloudWatch alarms

✅ Lambda execution

✅ Auto-scaling behavior

✅ Route 53 failover

✅ Application startup

✅ Service dependencies

✅ Recovery procedures

Instead of testing components individually, we're testing the complete system under real conditions.


The Detail That Prevents Downtime

One implementation detail made these exercises much safer.

During chaos testing, we don't scale the primary region to zero.

Instead, we reduce capacity by a single task.

That leaves enough healthy capacity to continue serving traffic while the secondary region comes online.

As DNS transitions occur, users continue receiving responses.

The recovery path is exercised without creating customer-visible downtime.


Why This Matters

The biggest risk in disaster recovery isn't infrastructure failure.

It's recovery procedures that haven't been tested recently.

A recovery plan sitting in a wiki isn't resilience.

A recovery plan executed successfully every month is.


🎯 Final Thought

Most organizations invest heavily in disaster recovery infrastructure.

Far fewer invest in continuously validating it.

Our multi-region architecture is intentionally cost-optimized, with the secondary region sitting idle most of the time.

But the real value isn't the architecture.

It's the confidence that comes from proving every month that failover still works.

Because in disaster recovery, the question isn't:

"Do we have a failover plan?"

It's:

"When was the last time we proved it actually works?"

May 18, 2026

🚦 Controlling Tool Output: Response Field Projection in Agent Workflows

One of the less obvious performance problems in agentic systems isn’t which tool gets called — it’s how much data comes back from it.

As agent workflows become more sophisticated, context growth can quietly become one of the biggest drivers of:

  • latency
  • token cost
  • reasoning instability

🧠 The Problem: Context Growth Across Cycles

In chained workflows, tool responses accumulate across reasoning cycles:

Cycle 1 → tool response (2,000 tokens)
Cycle 2 → tool response + prior context (4,500 tokens)
Cycle 3 → accumulated context (8,000+ tokens)

Most enterprise APIs are designed for systems integration, not LLM efficiency.

A financial data endpoint may return:

  • dozens of fields per record
  • nested metadata
  • audit attributes
  • internal identifiers
  • unused fields

But the agent may only need two or three fields to answer the user’s question.

When raw responses flow into the model unfiltered:

  • 📈 token usage grows every cycle
  • 🐢 latency increases as context expands
  • ⚠️ field-selection mistakes become more common
  • 🧾 prompt-level filtering becomes ineffective because tokens are already consumed before the instruction executes

A simple lookup can easily turn into thousands of unnecessary tokens.


🚀 The Fix: Projection at the Tool Layer

Instead of relying on the LLM to discard unnecessary fields after receiving the response, we moved the optimization into the tool layer itself.

We added a response_fields parameter to the HTTP request tool.

The agent specifies exactly which fields it needs before making the request, and the tool filters the response before returning it to the model.

Instead of:

Tool → large raw JSON → LLM → filter + reason + respond

We now use:

Tool → projected response → LLM → respond

The projection supports:

  • arrays and nested objects
  • dot-notation field selection
  • graceful fallback to full responses when projection is unavailable

✅ Minimal payloads
✅ Smaller context
✅ Faster reasoning


🧩 Closing the Loop

Projection only works if the agent knows which fields to request.

That knowledge can come from:

  • system prompts
  • tool metadata
  • endpoint descriptions
  • execution guidance
  • field-level documentation

The important part is that the agent identifies required fields before making the tool call instead of reasoning over a large payload afterward.

This shifts optimization from:

  • post-processing responses
    to:
  • controlling responses at the source

🔄 Execution Pattern

Execution guidance
→ agent identifies required fields
→ tool-level response projection
→ minimal structured output into LLM context

Rather than continuously expanding context across cycles, the agent keeps context compact and purpose-driven.


📊 Results

Queries that previously returned thousands of tokens per cycle now return only a fraction of that.

For multi-step workflows, this:

  • 💰 reduces token consumption
  • ⚡ lowers latency
  • 📏 stabilizes context growth
  • 🎯 improves reasoning reliability

The pattern is similar to GraphQL:
the client declares what it needs, and only that data comes back.

In this case, the “client” is the LLM itself.


🎯 Final Thought

Efficient agents don’t just call the right tools.

They also control what data comes back from them.

In many production systems, optimizing tool output has a larger impact on performance and reliability than changing the model itself.

May 14, 2026

Reviewing an Agent: From “Works” to “Works Efficiently”

 I recently reviewed an agent that was functionally correct — it answered queries, used tools properly, and produced accurate results.

👉 But it wasn’t efficient.

This is a quick summary of what was happening and what improved.

🧠 Initial Behavior: Reactive Execution

The agent followed a step-by-step execution loop:

Cycle 1 → resolve date via tool
Cycle 2 → fetch primary data
Cycle 3 → fetch additional data
Cycle 4 → fetch metadata
Cycle 5 → final response

What was happening?

  • No upfront planning
  • Data discovered incrementally
  • Each missing piece triggered another tool call
  • All operations executed sequentially

The agent was technically correct, but operationally inefficient.


⏱️ Why a Time Tool Existed

The agent handled queries like:

“first 10 days of last month”

Since LLMs don’t reliably know the current date, a time tool was added to:

  • ensure correct date calculations
  • avoid inconsistent outputs

👉 It solved correctness
👉 But added an extra execution cycle every time


🚨 Core Issue

The agent was reactive instead of planned.

Execution looked like:

Do something → discover missing data → do more → repeat

Instead of:

Understand requirements → plan execution → execute efficiently

🚀 Improvements

1. Provide Current Date Directly

Removed dependency on the time tool by injecting the current date into context.

✅ Eliminated one full execution cycle.


2. Add Upfront Planning

The agent now:

  • identifies required data first
  • plans execution before calling tools
  • understands dependencies early

3. Parallelize Independent Calls

Independent data fetches now execute together instead of sequentially.

This reduced unnecessary waiting between cycles.


4. Add Dependency Awareness

Execution flow became smarter:

  • independent data → parallel execution
  • dependent data → delayed until required

✅ Final Execution Patterns

No dependency

Cycle 1 → fetch all data (parallel)
Cycle 2 → final response

With dependency

Cycle 1 → fetch base data (parallel)
Cycle 2 → fetch dependent data
Cycle 3 → final response

🎯 Final Thought

Many agents already “work.”

The bigger challenge is making them:

  • efficient
  • predictable
  • low latency
  • cost aware

In agent systems, execution planning often matters as much as model quality.

Apr 12, 2026

Beyond Last-K Turns: Building Memory That Actually Thinks

Every multi-turn AI agent needs memory. The simplest implementation is obvious: load the last N turns of conversation before each call, then append the new turn after.

Aug 30, 2025

Building Smarter AI Agents

When building production-ready AI agents, the model is just one part of the story. Equally important are the evaluation frameworks, logging/observability tools, lightweight client libraries, and prompt orchestration frameworks.

Here are the key packages I use, what they do, and other similar options in the ecosystem.


RAGAS (RAG Assessment & Scoring)

  • What it is:
    Ragas is a framework to evaluate retrieval-augmented generation (RAG) systems. It provides automatic metrics for faithfulness, answer relevance, retrieval precision/recall, and more.
  • Use case:
    When testing RAG pipelines, I can automatically score how well my system retrieves documents and whether the model’s answers stick to the evidence.
  • Similar libraries:
    DeepEval – generic LLM evaluation toolkit.
    TruLens – for evaluating and monitoring LLM apps (esp. RAG).
    Evalchemy – simpler eval DSL for agents and RAG pipelines.


Langfuse

  • What it is:
    Langfuse is an observability and logging platform for LLM applications. It captures traces, spans, prompts, model outputs, tool calls, and lets you replay & analyze runs.
  • Use case:
    I use it to debug agent workflows, track cost/latency, and visualize multi-tool execution. It’s like “OpenTelemetry for LLMs.”

LiteLLM

  • What it is:
    LiteLLM is a unified API wrapper for >100 LLM providers (OpenAI, Anthropic, Bedrock, Ollama, Azure, etc.). 
  • Use case:
    Lets me switch between models (Claude, GPT-4, Llama, etc.) without rewriting my code. Also supports rate-limiting, retries, logging, and cost tracking.


Aug 7, 2025

Deep Dive into Strands Agent Trace

When debugging or understanding LLM agents like Strands, tracing is critical. Recently, I ran a simple prompt — “greet rahul using tool” — and captured the trace emitted by Strands Agent using OpenTelemetry. Even though the use case was simple, the trace revealed the elegance of how the agent plans, executes, and finalizes its response in structured steps.

Here’s a breakdown of the key spans and why they exist, with a focus on the two chat spans and two execute_event_loop_cycle spans, and how they tie together.

What Happened?

The user instruction was: "greet rahul using tool"

Strands Agent, powered by Claude 3 Sonnet, interpreted this and:

  • Planned a toolUse of the greet tool with input "rahul".
  • Executed the tool.
  • Used the result to send back a human-friendly message.

This simple interaction resulted in a two-turn conversation, captured in two event loop cycles.

Trace Summary (at a glance)

Span Name Duration (s) Key Events Role
execute_event_loop_cycle #1 2.28 toolUse, toolResult Plans and executes tool
chat #1 2.15 LLM decides: use greet on "rahul" Claude generates plan
execute_tool greet 0.13 Input: "rahul" → Output: "Hello, rahul" Tool executes
execute_event_loop_cycle #2 3.06 Final reply from model Uses tool result to complete
chat #2 3.06 LLM returns final message Claude wraps up
invoke_agent 5.33 All agent activity spans Wraps both cycles
agent.run 5.33 Full request lifecycle Top-level root span


agent.run (INTERNAL)

└── invoke_agent "Strands Agents" (CLIENT)

    ├── execute_event_loop_cycle (INTERNAL)

    │   ├── chat (LLM planning) (CLIENT)

    │   └── execute_tool greet (INTERNAL)

    └── execute_event_loop_cycle (INTERNAL)

        └── chat (LLM response) (CLIENT)

Why Two execute_event_loop_cycle?

Strands Agent works in planning cycles — each one wraps:

  • Observation of the current context (user input, tool results)
  • A model call (chat)
  • Optional tool execution (execute_tool)
  • A choice of whether to continue, end, or act again

Cycle 1:

  • Interprets "greet rahul using tool"
  • Decides to call greet({ name: "rahul" })
  • Tool responds with "Hello, rahul"

Cycle 2:

  • Receives the tool result
  • Generates a final assistant message:

Why Two chat Spans?

Each chat span is a model call. Here’s how they differ:

chat #1:

  • Input: 
    • Raw user message
    • Full conversation history: Previous user inputs, assistant replies, tool calls (toolUse) and tool results (toolResult) are automatically replayed in the context window so the model has memory of the ongoing session.
    • Tool schemas: Strands injects all available tools (from gen_ai.agent.tools) into the prompt as structured JSON function specs. Even if only one tool is used, the model sees them all so it can reason about the best fit.
    • {
        "name": "greet",
        "description": "Greets a person by name.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": { "type": "string", "description": "The name of the person to greet." }
          },
          "required": ["name"]
        }
      }
      
  • Decision: The model (e.g. Claude or GPT) then decides which tool to use based on the prompt and context, in this case its greet tool
  • Output: Tool plan: toolUseId = tooluse_jN8iuxwKRC6eji43nu64OQ
  • [
      {
        "text": "I'll greet Rahul using the greet tool right away."
      },
      {
        "toolUse": {
          "toolUseId": "tooluse_jN8iuxwKRC6eji43nu64OQ",
          "name": "greet",
          "input": {
            "name": "rahul"
          }
        }
      }
    ]
    

chat #2:

  • Input: Includes the toolResult as context
  • Output: Final message to the user

This separation is deliberate: tool planning and result-based reply generation are split for clarity, control, and extensibility.

Final Thoughts

This trace, from a seemingly trivial agent command, illustrates the powerful architecture of Strands Agent:

  • Planning is explicit and looped (via execute_event_loop_cycle)
  • Decisions and actions are clearly separated (chat, execute_tool)
  • Full observability is built-in with OpenTelemetry

If you're designing agents or trying to debug them, traces like this are your best lens into what’s really going on — and how LLMs, planning logic, and tools interact.