May 8, 2026

From Swagger Overload to a Single Capable Agent

My first multi-agent architecture looked clean on paper.

I had specialized agents, each responsible for a specific domain and equipped with the full Swagger specification for the APIs it owned. An orchestrator sat on top, routing questions between agents and coordinating responses.

It seemed logical. It didn’t scale.

What Went Wrong

Swagger specifications are massive.

A single service can expose 40+ endpoints, most of which an agent will never use. Feeding the full spec into an LLM’s context created several problems at once:

  • Massive token consumption
  • Ambiguity between similar endpoints
  • Increased reasoning complexity
  • Operational guidance buried under schema noise

The architecture was technically sophisticated, but operationally fragile.

The Capability Registry

We replaced Swagger injection with a capability registry.

Instead of giving agents entire API specs, we indexed individual callable capabilities in a vector store. Each capability represented one executable action:

  • an HTTP endpoint
  • a SQL query
  • a native tool
  • or another callable operation

When the agent needed to act, it performed a semantic lookup and retrieved only the most relevant capability.

Each result contained just enough information to execute:

  • tool_name
  • base_url
  • path
  • parameters
  • agent_notes

The agent_notes field became the operational brain of the system — guidance on when to use something, edge cases, and what assumptions not to make.

The LLM no longer had to reason across dozens of irrelevant endpoints. It only saw the capability it needed.

Any Backend, One Interface

Because the registry abstracted invocation types, every backend looked identical during discovery.

HTTP capabilities returned:

  • base_url
  • path

SQL capabilities returned:

  • database location
  • query template

Native tools returned:

  • tool_name

One discovery pattern. One execution model.

From Many Agents to One

Once capability discovery became dynamic, the need for multiple specialized agents started disappearing.

Previously, specialization existed because each agent required carefully curated prompts containing only its domain knowledge.

With runtime capability lookup, the registry became the knowledge layer.

We collapsed the system into a single agent with access to the full registry.

The result:

  • no routing errors
  • lower latency
  • simpler debugging
  • one deployment
  • one trace stream

The Trade-Off

The capability registry is not free.

It must stay synchronized with live systems. Capabilities need curation. Writing high-quality agent_notes takes discipline.

But that cost already existed — previously paid through wasted tokens, routing failures, and difficult debugging sessions.

The Bigger Lesson

One well-configured agent with the right capability registry turned out to be far more powerful — and dramatically simpler to operate — than a fleet of narrowly scoped agents.

The breakthrough wasn’t adding more agents.

It was reducing what the agent needed to know at any given moment.

Apr 12, 2026

Beyond Last-K Turns: Building Memory That Actually Thinks

Every multi-turn AI agent needs memory. The simplest implementation is obvious: load the last N turns of conversation before each call, then append the new turn after.

LangGraph and Agent Frameworks: Using the Right Tool for the Job

There's a common trap when building AI-powered pipelines: reaching for an agentic framework because the problem feels “intelligent,” even when the solution is fundamentally deterministic. This post walks through a document ingestion system where that mistake shows up—and what the right mental model looks like.


The System: Ingesting Documents at Scale

The pipeline processes documents at scale—loading files from storage, extracting structured metadata via an LLM, enriching that metadata against external systems, and indexing everything into a vector store and document store for downstream retrieval.

The flow looks like this:

Object storage / local filesystem

list_documents

[per document]
load → classify → chunk → embed → extract_metadata → enrich → store → archive

Simple enough on paper. The complexity comes from two questions:

  1. How do you orchestrate deterministic steps cleanly?
  2. Where does the LLM fit in—and how?

The system uses two patterns to answer these: a graph-based workflow engine for orchestration and agent-based execution for LLM-driven tasks. Understanding when to use each is key.


LangGraph: When the Path Is Known

LangGraph is a workflow engine built on top of LangChain. Its core primitive is a directed graph where nodes are Python functions and edges define allowed transitions. State flows through the graph as a typed dictionary.

Here’s a simplified version of the ingestion graph:

from langgraph.graph import END, StateGraph

workflow = StateGraph(dict)

workflow.add_node("load_document", load_document)
workflow.add_node("classify_document", classify_document)
workflow.add_node("chunk_document", chunk_document)
workflow.add_node("embed_chunks", embed_chunks_node)
workflow.add_node("extract_metadata", extract_metadata_node)
workflow.add_node("enrich_metadata", enrich_metadata_node)
workflow.add_node("store_embeddings", store_embeddings_node)
workflow.add_node("store_summary", store_summary_node)
workflow.add_node("archive_document", archive_document)
workflow.add_node("skip_document", skip_document)

workflow.set_entry_point("load_document")
workflow.add_edge("load_document", "classify_document")

workflow.add_conditional_edges(
"classify_document",
should_process,
{"process": "chunk_document", "skip": "skip_document"},
)

workflow.add_edge("chunk_document", "embed_chunks")
workflow.add_edge("embed_chunks", "extract_metadata")
workflow.add_edge("extract_metadata", "enrich_metadata")
workflow.add_edge("enrich_metadata", "store_embeddings")
workflow.add_edge("store_embeddings", "store_summary")
workflow.add_edge("store_summary", "archive_document")
workflow.add_edge("archive_document", END)
workflow.add_edge("skip_document", END)

graph = workflow.compile()

What this gives you:

  • Explicit control flow: Every transition is defined in code.
  • Typed state management: Each node declares inputs and outputs.
  • Deterministic branching: Conditions are pure Python—no LLM needed.
  • Composability: Easy to wrap per-document flows into batch processing.

Mental model: Use LangGraph when you know the answer to “what happens next?”
If the pipeline topology is fixed, a deterministic DAG is the right tool.


Agent Frameworks: When the LLM Decides the Path

Agent frameworks introduce a different execution model: the LLM drives control flow by choosing tools, interpreting results, and deciding what to do next.

The Right Use: Orchestrator with Tools

At query time, an orchestrator agent can route user questions to specialized downstream components, each exposed as a tool.

Example pattern:

def build_tools():
return [
make_tool("query_domain_a"),
make_tool("query_domain_b"),
make_tool("synthesize_results"),
]

At runtime, the LLM decides:

  • Should it call one tool or multiple?
  • Does it need to combine results?
  • Does it need to resolve entities first?

This kind of routing depends on semantic understanding, not deterministic rules.

No static DAG can reliably express this.

Mental model: Use an agent when the path depends on meaning the LLM must interpret.


A Valid Use Case: Enrichment with Tool Interaction

In the enrichment step, an agent can call external systems (e.g., registries or APIs), interpret responses, and resolve ambiguity.

agent = Agent(
model=model,
system_prompt=prompt,
tools=tools,
)

response = agent(prompt)

This is justified when:

  • Tool results may be ambiguous
  • Multiple calls may be needed
  • The LLM must reason about correctness

However, it’s worth monitoring: if it always becomes a single tool call, a simpler pattern may be better.


The Anti-Pattern: Agent as a Thin Wrapper

A common mistake is using an agent for simple, single-step tasks:

agent = Agent(
model=model,
system_prompt=prompt,
)

response = agent(chunk)
parsed = parse_json(response)

No tools. No iteration. No decision-making.

This is just a prompt → JSON call with unnecessary overhead.

Problems:

  • Added latency from agent loop setup
  • Repeated overhead for each chunk
  • Fragile parsing logic
  • No strong structure guarantees

The Better Approach: Structured LLM Calls

Use direct structured output instead:

from langchain_core.messages import HumanMessage, SystemMessage

llm = SomeLLM(model="...", temperature=0.2)
chain = llm.with_structured_output(MySchema)

result = chain.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=chunk),
])

Benefits:

  • Strong typing via schema validation
  • No manual parsing
  • Lower latency
  • Simpler execution model

The Decision Framework

Does control flow depend on meaning the LLM must interpret?

├─ NO → Use LangGraph (or plain code)
│ Fixed steps, deterministic branching
│ Examples: ETL, document pipelines

└─ YES → Does the LLM need tools or iteration?

├─ NO → Use direct structured LLM call
│ Prompt → structured output
│ Examples: extraction, classification

└─ YES → Use an agent
Tool selection + reasoning loop
Examples: routing, research, disambiguation

When each layer does only its job, the system becomes simpler, faster, and easier to reason about.

Apr 11, 2026

Managing Tool Output: Avoiding Context Explosion in Agent Systems

 

While reviewing and optimizing agent execution, another important issue surfaced:

👉 Tool outputs can silently bloat the context

Even with perfect planning and parallel execution, performance can degrade if the data flowing into the model is too large.


🧠 The Problem: Context Growth Over Cycles

In agent workflows, especially with chaining:

Cycle 1 → tool output  
Cycle 2 → tool output + previous data  
Cycle 3 → tool output + accumulated data  

👉 Context keeps growing with each step


🚨 Why this is a problem

  • Large payloads (nested JSON, unused fields)

  • Duplicate data across steps

  • Irrelevant fields carried forward

Impact

  • Increased token usage

  • Slower LLM response time

  • Higher cost

  • Greater chance of confusion or incorrect field usage


🔍 Root Cause

Tools typically return:

  • full API responses

  • deeply nested structures

  • more data than required

The LLM then:

  • has to sift through everything

  • often carries forward unnecessary data


🚀 Improvements

1. Let the LLM discard unnecessary data (lightweight fix)

Instruct the model to:

  • extract only required fields

  • ignore irrelevant data

👉 Helps, but not always reliable for large payloads


2. Add intelligence at the tool layer (stronger fix)

Instead of returning raw responses:

  • Return only relevant fields

  • Flatten nested structures

  • Provide clean, minimal data

👉 Similar to how GraphQL works:

  • client specifies what it needs

  • response includes only that


✅ Target Pattern

Tool → minimal structured output → LLM → format response

Instead of:

Tool → large raw JSON → LLM → filter + format

🎯 Final Thought

Efficient agents don’t just call the right tools —
they also control what data comes back from them


From Reactive Chaos to Planned Parallelism: Optimizing a Bedrock Agent

Reviewing a Bedrock Agent: From “Works” to “Works Efficiently”

I recently reviewed a Bedrock agent that was functionally correct — it answered queries, used tools properly, and produced accurate results.

👉 But it wasn’t efficient.

This is a quick summary of what was happening and what improved.


🧠 Initial Behavior: Reactive Execution

The agent followed a step-by-step loop:

Cycle 1 → resolve date via tool  
Cycle 2 → fetch primary data  
Cycle 3 → fetch additional data  
Cycle 4 → fetch metadata  
Cycle 5 → final response  

What was happening?

  • No upfront planning

  • Data discovered incrementally

  • Each missing piece triggered another call

  • All operations were sequential


⏱️ Why a time tool existed

The agent handled queries like:

“first 10 days of last month”

Since LLMs don’t reliably know the current date, a time tool was added to:

  • ensure correct date calculations

  • avoid inconsistent outputs

👉 It solved correctness
👉 But added an extra cycle every time


🚨 Core Issue

The agent was reactive instead of planned

Do something → realize missing data → do more → repeat

Instead of:

Understand everything → execute once

🚀 Improvements

1. Provide current date directly

  • Removed dependency on time tool

  • Eliminated one full cycle


2. Upfront planning

The agent now:

  • identifies all required data

  • plans execution before acting


3. Parallel execution

Independent data is now fetched together instead of sequentially


4. Dependency awareness

  • Independent data → parallel

  • Dependent data → separate step only when required


✅ Final Execution Patterns

No dependency

Cycle 1 → fetch all data (parallel)  
Cycle 2 → final response  

With dependency

Cycle 1 → fetch base data (parallel)  
Cycle 2 → fetch dependent data  
Cycle 3 → final response  

Aug 30, 2025

Building Smarter AI Agents

When building production-ready AI agents, the model is just one part of the story. Equally important are the evaluation frameworks, logging/observability tools, lightweight client libraries, and prompt orchestration frameworks.

Here are the key packages I use, what they do, and other similar options in the ecosystem.


RAGAS (RAG Assessment & Scoring)

  • What it is:
    Ragas is a framework to evaluate retrieval-augmented generation (RAG) systems. It provides automatic metrics for faithfulness, answer relevance, retrieval precision/recall, and more.
  • Use case:
    When testing RAG pipelines, I can automatically score how well my system retrieves documents and whether the model’s answers stick to the evidence.
  • Similar libraries:
    DeepEval – generic LLM evaluation toolkit.
    TruLens – for evaluating and monitoring LLM apps (esp. RAG).
    Evalchemy – simpler eval DSL for agents and RAG pipelines.


Langfuse

  • What it is:
    Langfuse is an observability and logging platform for LLM applications. It captures traces, spans, prompts, model outputs, tool calls, and lets you replay & analyze runs.
  • Use case:
    I use it to debug agent workflows, track cost/latency, and visualize multi-tool execution. It’s like “OpenTelemetry for LLMs.”

LiteLLM

  • What it is:
    LiteLLM is a unified API wrapper for >100 LLM providers (OpenAI, Anthropic, Bedrock, Ollama, Azure, etc.). 
  • Use case:
    Lets me switch between models (Claude, GPT-4, Llama, etc.) without rewriting my code. Also supports rate-limiting, retries, logging, and cost tracking.


Aug 7, 2025

Deep Dive into Strands Agent Trace

When debugging or understanding LLM agents like Strands, tracing is critical. Recently, I ran a simple prompt — “greet rahul using tool” — and captured the trace emitted by Strands Agent using OpenTelemetry. Even though the use case was simple, the trace revealed the elegance of how the agent plans, executes, and finalizes its response in structured steps.

Here’s a breakdown of the key spans and why they exist, with a focus on the two chat spans and two execute_event_loop_cycle spans, and how they tie together.

What Happened?

The user instruction was: "greet rahul using tool"

Strands Agent, powered by Claude 3 Sonnet, interpreted this and:

  • Planned a toolUse of the greet tool with input "rahul".
  • Executed the tool.
  • Used the result to send back a human-friendly message.

This simple interaction resulted in a two-turn conversation, captured in two event loop cycles.

Trace Summary (at a glance)

Span Name Duration (s) Key Events Role
execute_event_loop_cycle #1 2.28 toolUse, toolResult Plans and executes tool
chat #1 2.15 LLM decides: use greet on "rahul" Claude generates plan
execute_tool greet 0.13 Input: "rahul" → Output: "Hello, rahul" Tool executes
execute_event_loop_cycle #2 3.06 Final reply from model Uses tool result to complete
chat #2 3.06 LLM returns final message Claude wraps up
invoke_agent 5.33 All agent activity spans Wraps both cycles
agent.run 5.33 Full request lifecycle Top-level root span


agent.run (INTERNAL)

└── invoke_agent "Strands Agents" (CLIENT)

    ├── execute_event_loop_cycle (INTERNAL)

    │   ├── chat (LLM planning) (CLIENT)

    │   └── execute_tool greet (INTERNAL)

    └── execute_event_loop_cycle (INTERNAL)

        └── chat (LLM response) (CLIENT)

Why Two execute_event_loop_cycle?

Strands Agent works in planning cycles — each one wraps:

  • Observation of the current context (user input, tool results)
  • A model call (chat)
  • Optional tool execution (execute_tool)
  • A choice of whether to continue, end, or act again

Cycle 1:

  • Interprets "greet rahul using tool"
  • Decides to call greet({ name: "rahul" })
  • Tool responds with "Hello, rahul"

Cycle 2:

  • Receives the tool result
  • Generates a final assistant message:

Why Two chat Spans?

Each chat span is a model call. Here’s how they differ:

chat #1:

  • Input: 
    • Raw user message
    • Full conversation history: Previous user inputs, assistant replies, tool calls (toolUse) and tool results (toolResult) are automatically replayed in the context window so the model has memory of the ongoing session.
    • Tool schemas: Strands injects all available tools (from gen_ai.agent.tools) into the prompt as structured JSON function specs. Even if only one tool is used, the model sees them all so it can reason about the best fit.
    • {
        "name": "greet",
        "description": "Greets a person by name.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": { "type": "string", "description": "The name of the person to greet." }
          },
          "required": ["name"]
        }
      }
      
  • Decision: The model (e.g. Claude or GPT) then decides which tool to use based on the prompt and context, in this case its greet tool
  • Output: Tool plan: toolUseId = tooluse_jN8iuxwKRC6eji43nu64OQ
  • [
      {
        "text": "I'll greet Rahul using the greet tool right away."
      },
      {
        "toolUse": {
          "toolUseId": "tooluse_jN8iuxwKRC6eji43nu64OQ",
          "name": "greet",
          "input": {
            "name": "rahul"
          }
        }
      }
    ]
    

chat #2:

  • Input: Includes the toolResult as context
  • Output: Final message to the user

This separation is deliberate: tool planning and result-based reply generation are split for clarity, control, and extensibility.

Final Thoughts

This trace, from a seemingly trivial agent command, illustrates the powerful architecture of Strands Agent:

  • Planning is explicit and looped (via execute_event_loop_cycle)
  • Decisions and actions are clearly separated (chat, execute_tool)
  • Full observability is built-in with OpenTelemetry

If you're designing agents or trying to debug them, traces like this are your best lens into what’s really going on — and how LLMs, planning logic, and tools interact.