Apr 12, 2026

Beyond Last-K Turns: Building Memory That Actually Thinks

Every multi-turn AI agent needs memory. The simplest implementation is obvious: load the last N turns of conversation before each call, then append the new turn after.

LangGraph and Agent Frameworks: Using the Right Tool for the Job

There's a common trap when building AI-powered pipelines: reaching for an agentic framework because the problem feels “intelligent,” even when the solution is fundamentally deterministic. This post walks through a document ingestion system where that mistake shows up—and what the right mental model looks like.


The System: Ingesting Documents at Scale

The pipeline processes documents at scale—loading files from storage, extracting structured metadata via an LLM, enriching that metadata against external systems, and indexing everything into a vector store and document store for downstream retrieval.

The flow looks like this:

Object storage / local filesystem

list_documents

[per document]
load → classify → chunk → embed → extract_metadata → enrich → store → archive

Simple enough on paper. The complexity comes from two questions:

  1. How do you orchestrate deterministic steps cleanly?
  2. Where does the LLM fit in—and how?

The system uses two patterns to answer these: a graph-based workflow engine for orchestration and agent-based execution for LLM-driven tasks. Understanding when to use each is key.


LangGraph: When the Path Is Known

LangGraph is a workflow engine built on top of LangChain. Its core primitive is a directed graph where nodes are Python functions and edges define allowed transitions. State flows through the graph as a typed dictionary.

Here’s a simplified version of the ingestion graph:

from langgraph.graph import END, StateGraph

workflow = StateGraph(dict)

workflow.add_node("load_document", load_document)
workflow.add_node("classify_document", classify_document)
workflow.add_node("chunk_document", chunk_document)
workflow.add_node("embed_chunks", embed_chunks_node)
workflow.add_node("extract_metadata", extract_metadata_node)
workflow.add_node("enrich_metadata", enrich_metadata_node)
workflow.add_node("store_embeddings", store_embeddings_node)
workflow.add_node("store_summary", store_summary_node)
workflow.add_node("archive_document", archive_document)
workflow.add_node("skip_document", skip_document)

workflow.set_entry_point("load_document")
workflow.add_edge("load_document", "classify_document")

workflow.add_conditional_edges(
"classify_document",
should_process,
{"process": "chunk_document", "skip": "skip_document"},
)

workflow.add_edge("chunk_document", "embed_chunks")
workflow.add_edge("embed_chunks", "extract_metadata")
workflow.add_edge("extract_metadata", "enrich_metadata")
workflow.add_edge("enrich_metadata", "store_embeddings")
workflow.add_edge("store_embeddings", "store_summary")
workflow.add_edge("store_summary", "archive_document")
workflow.add_edge("archive_document", END)
workflow.add_edge("skip_document", END)

graph = workflow.compile()

What this gives you:

  • Explicit control flow: Every transition is defined in code.
  • Typed state management: Each node declares inputs and outputs.
  • Deterministic branching: Conditions are pure Python—no LLM needed.
  • Composability: Easy to wrap per-document flows into batch processing.

Mental model: Use LangGraph when you know the answer to “what happens next?”
If the pipeline topology is fixed, a deterministic DAG is the right tool.


Agent Frameworks: When the LLM Decides the Path

Agent frameworks introduce a different execution model: the LLM drives control flow by choosing tools, interpreting results, and deciding what to do next.

The Right Use: Orchestrator with Tools

At query time, an orchestrator agent can route user questions to specialized downstream components, each exposed as a tool.

Example pattern:

def build_tools():
return [
make_tool("query_domain_a"),
make_tool("query_domain_b"),
make_tool("synthesize_results"),
]

At runtime, the LLM decides:

  • Should it call one tool or multiple?
  • Does it need to combine results?
  • Does it need to resolve entities first?

This kind of routing depends on semantic understanding, not deterministic rules.

No static DAG can reliably express this.

Mental model: Use an agent when the path depends on meaning the LLM must interpret.


A Valid Use Case: Enrichment with Tool Interaction

In the enrichment step, an agent can call external systems (e.g., registries or APIs), interpret responses, and resolve ambiguity.

agent = Agent(
model=model,
system_prompt=prompt,
tools=tools,
)

response = agent(prompt)

This is justified when:

  • Tool results may be ambiguous
  • Multiple calls may be needed
  • The LLM must reason about correctness

However, it’s worth monitoring: if it always becomes a single tool call, a simpler pattern may be better.


The Anti-Pattern: Agent as a Thin Wrapper

A common mistake is using an agent for simple, single-step tasks:

agent = Agent(
model=model,
system_prompt=prompt,
)

response = agent(chunk)
parsed = parse_json(response)

No tools. No iteration. No decision-making.

This is just a prompt → JSON call with unnecessary overhead.

Problems:

  • Added latency from agent loop setup
  • Repeated overhead for each chunk
  • Fragile parsing logic
  • No strong structure guarantees

The Better Approach: Structured LLM Calls

Use direct structured output instead:

from langchain_core.messages import HumanMessage, SystemMessage

llm = SomeLLM(model="...", temperature=0.2)
chain = llm.with_structured_output(MySchema)

result = chain.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=chunk),
])

Benefits:

  • Strong typing via schema validation
  • No manual parsing
  • Lower latency
  • Simpler execution model

The Decision Framework

Does control flow depend on meaning the LLM must interpret?

├─ NO → Use LangGraph (or plain code)
│ Fixed steps, deterministic branching
│ Examples: ETL, document pipelines

└─ YES → Does the LLM need tools or iteration?

├─ NO → Use direct structured LLM call
│ Prompt → structured output
│ Examples: extraction, classification

└─ YES → Use an agent
Tool selection + reasoning loop
Examples: routing, research, disambiguation

When each layer does only its job, the system becomes simpler, faster, and easier to reason about.

Apr 11, 2026

Managing Tool Output: Avoiding Context Explosion in Agent Systems

 

While reviewing and optimizing agent execution, another important issue surfaced:

👉 Tool outputs can silently bloat the context

Even with perfect planning and parallel execution, performance can degrade if the data flowing into the model is too large.


🧠 The Problem: Context Growth Over Cycles

In agent workflows, especially with chaining:

Cycle 1 → tool output  
Cycle 2 → tool output + previous data  
Cycle 3 → tool output + accumulated data  

👉 Context keeps growing with each step


🚨 Why this is a problem

  • Large payloads (nested JSON, unused fields)

  • Duplicate data across steps

  • Irrelevant fields carried forward

Impact

  • Increased token usage

  • Slower LLM response time

  • Higher cost

  • Greater chance of confusion or incorrect field usage


🔍 Root Cause

Tools typically return:

  • full API responses

  • deeply nested structures

  • more data than required

The LLM then:

  • has to sift through everything

  • often carries forward unnecessary data


🚀 Improvements

1. Let the LLM discard unnecessary data (lightweight fix)

Instruct the model to:

  • extract only required fields

  • ignore irrelevant data

👉 Helps, but not always reliable for large payloads


2. Add intelligence at the tool layer (stronger fix)

Instead of returning raw responses:

  • Return only relevant fields

  • Flatten nested structures

  • Provide clean, minimal data

👉 Similar to how GraphQL works:

  • client specifies what it needs

  • response includes only that


✅ Target Pattern

Tool → minimal structured output → LLM → format response

Instead of:

Tool → large raw JSON → LLM → filter + format

🎯 Final Thought

Efficient agents don’t just call the right tools —
they also control what data comes back from them


From Reactive Chaos to Planned Parallelism: Optimizing a Bedrock Agent

Reviewing a Bedrock Agent: From “Works” to “Works Efficiently”

I recently reviewed a Bedrock agent that was functionally correct — it answered queries, used tools properly, and produced accurate results.

👉 But it wasn’t efficient.

This is a quick summary of what was happening and what improved.


🧠 Initial Behavior: Reactive Execution

The agent followed a step-by-step loop:

Cycle 1 → resolve date via tool  
Cycle 2 → fetch primary data  
Cycle 3 → fetch additional data  
Cycle 4 → fetch metadata  
Cycle 5 → final response  

What was happening?

  • No upfront planning

  • Data discovered incrementally

  • Each missing piece triggered another call

  • All operations were sequential


⏱️ Why a time tool existed

The agent handled queries like:

“first 10 days of last month”

Since LLMs don’t reliably know the current date, a time tool was added to:

  • ensure correct date calculations

  • avoid inconsistent outputs

👉 It solved correctness
👉 But added an extra cycle every time


🚨 Core Issue

The agent was reactive instead of planned

Do something → realize missing data → do more → repeat

Instead of:

Understand everything → execute once

🚀 Improvements

1. Provide current date directly

  • Removed dependency on time tool

  • Eliminated one full cycle


2. Upfront planning

The agent now:

  • identifies all required data

  • plans execution before acting


3. Parallel execution

Independent data is now fetched together instead of sequentially


4. Dependency awareness

  • Independent data → parallel

  • Dependent data → separate step only when required


✅ Final Execution Patterns

No dependency

Cycle 1 → fetch all data (parallel)  
Cycle 2 → final response  

With dependency

Cycle 1 → fetch base data (parallel)  
Cycle 2 → fetch dependent data  
Cycle 3 → final response  

Aug 30, 2025

Building Smarter AI Agents

When building production-ready AI agents, the model is just one part of the story. Equally important are the evaluation frameworks, logging/observability tools, lightweight client libraries, and prompt orchestration frameworks.

Here are the key packages I use, what they do, and other similar options in the ecosystem.


RAGAS (RAG Assessment & Scoring)

  • What it is:
    Ragas is a framework to evaluate retrieval-augmented generation (RAG) systems. It provides automatic metrics for faithfulness, answer relevance, retrieval precision/recall, and more.
  • Use case:
    When testing RAG pipelines, I can automatically score how well my system retrieves documents and whether the model’s answers stick to the evidence.
  • Similar libraries:
    DeepEval – generic LLM evaluation toolkit.
    TruLens – for evaluating and monitoring LLM apps (esp. RAG).
    Evalchemy – simpler eval DSL for agents and RAG pipelines.


Langfuse

  • What it is:
    Langfuse is an observability and logging platform for LLM applications. It captures traces, spans, prompts, model outputs, tool calls, and lets you replay & analyze runs.
  • Use case:
    I use it to debug agent workflows, track cost/latency, and visualize multi-tool execution. It’s like “OpenTelemetry for LLMs.”

LiteLLM

  • What it is:
    LiteLLM is a unified API wrapper for >100 LLM providers (OpenAI, Anthropic, Bedrock, Ollama, Azure, etc.). 
  • Use case:
    Lets me switch between models (Claude, GPT-4, Llama, etc.) without rewriting my code. Also supports rate-limiting, retries, logging, and cost tracking.


Aug 7, 2025

Deep Dive into Strands Agent Trace

When debugging or understanding LLM agents like Strands, tracing is critical. Recently, I ran a simple prompt — “greet rahul using tool” — and captured the trace emitted by Strands Agent using OpenTelemetry. Even though the use case was simple, the trace revealed the elegance of how the agent plans, executes, and finalizes its response in structured steps.

Here’s a breakdown of the key spans and why they exist, with a focus on the two chat spans and two execute_event_loop_cycle spans, and how they tie together.

What Happened?

The user instruction was: "greet rahul using tool"

Strands Agent, powered by Claude 3 Sonnet, interpreted this and:

  • Planned a toolUse of the greet tool with input "rahul".
  • Executed the tool.
  • Used the result to send back a human-friendly message.

This simple interaction resulted in a two-turn conversation, captured in two event loop cycles.

Trace Summary (at a glance)

Span Name Duration (s) Key Events Role
execute_event_loop_cycle #1 2.28 toolUse, toolResult Plans and executes tool
chat #1 2.15 LLM decides: use greet on "rahul" Claude generates plan
execute_tool greet 0.13 Input: "rahul" → Output: "Hello, rahul" Tool executes
execute_event_loop_cycle #2 3.06 Final reply from model Uses tool result to complete
chat #2 3.06 LLM returns final message Claude wraps up
invoke_agent 5.33 All agent activity spans Wraps both cycles
agent.run 5.33 Full request lifecycle Top-level root span


agent.run (INTERNAL)

└── invoke_agent "Strands Agents" (CLIENT)

    ├── execute_event_loop_cycle (INTERNAL)

    │   ├── chat (LLM planning) (CLIENT)

    │   └── execute_tool greet (INTERNAL)

    └── execute_event_loop_cycle (INTERNAL)

        └── chat (LLM response) (CLIENT)

Why Two execute_event_loop_cycle?

Strands Agent works in planning cycles — each one wraps:

  • Observation of the current context (user input, tool results)
  • A model call (chat)
  • Optional tool execution (execute_tool)
  • A choice of whether to continue, end, or act again

Cycle 1:

  • Interprets "greet rahul using tool"
  • Decides to call greet({ name: "rahul" })
  • Tool responds with "Hello, rahul"

Cycle 2:

  • Receives the tool result
  • Generates a final assistant message:

Why Two chat Spans?

Each chat span is a model call. Here’s how they differ:

chat #1:

  • Input: 
    • Raw user message
    • Full conversation history: Previous user inputs, assistant replies, tool calls (toolUse) and tool results (toolResult) are automatically replayed in the context window so the model has memory of the ongoing session.
    • Tool schemas: Strands injects all available tools (from gen_ai.agent.tools) into the prompt as structured JSON function specs. Even if only one tool is used, the model sees them all so it can reason about the best fit.
    • {
        "name": "greet",
        "description": "Greets a person by name.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": { "type": "string", "description": "The name of the person to greet." }
          },
          "required": ["name"]
        }
      }
      
  • Decision: The model (e.g. Claude or GPT) then decides which tool to use based on the prompt and context, in this case its greet tool
  • Output: Tool plan: toolUseId = tooluse_jN8iuxwKRC6eji43nu64OQ
  • [
      {
        "text": "I'll greet Rahul using the greet tool right away."
      },
      {
        "toolUse": {
          "toolUseId": "tooluse_jN8iuxwKRC6eji43nu64OQ",
          "name": "greet",
          "input": {
            "name": "rahul"
          }
        }
      }
    ]
    

chat #2:

  • Input: Includes the toolResult as context
  • Output: Final message to the user

This separation is deliberate: tool planning and result-based reply generation are split for clarity, control, and extensibility.

Final Thoughts

This trace, from a seemingly trivial agent command, illustrates the powerful architecture of Strands Agent:

  • Planning is explicit and looped (via execute_event_loop_cycle)
  • Decisions and actions are clearly separated (chat, execute_tool)
  • Full observability is built-in with OpenTelemetry

If you're designing agents or trying to debug them, traces like this are your best lens into what’s really going on — and how LLMs, planning logic, and tools interact.

Jun 29, 2025

Agentic AI Framework

Building an AI agent that can reason and use tools requires more than just a powerful LLM. Ever wondered what's happening behind the scenes of a conversational AI agent? This post breaks down—to help you understand where your agent's conversation state lives, how tools are managed, and which solution offers the most flexibility.

OpenAI Responses API

  1. Open AI Responses API is a unified interface for building powerful, agent-like applications. Its an evolution of Chat Completions which doesn;t have server-side state, so you need to resend history.
      client.responses.create(
        model="gpt-4o-mini",
        input=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Use tools when needed.",
            },
            {"role": "user", "content": user_question},
        ],
        tools=tools,
        parallel_tool_calls=True,
    )
    
  2. Conversation state lives On OpenAI’s servers (for OpenAI-hosted models) when you pass previous_response_id. If you point your OpenAI client at a proxy (e.g., LiteLLM), state (if any) is maintained by the proxy, not OpenAI. Pass previous_response_id to link to prior turns without resending them.
    client.responses.create(
        model="gpt-4o-mini",
        input=tool_outputs,
        tools=tools,  # keep tools if follow-up calls might happen
        previous_response_id=resp1.id,  # <— important
    )
  3. Each turn re-bills the effective prompt (prior items + new items). You may get prompt caching discounts for repeated prefixes.
  4. You execute tools and feed results back in a follow-up call.

    request → model returns function_call → you run the tool(s) → send function_call_output → repeat until no more tool calls → final answer.
      if response_output["type"] == "function_call":
    	function_name = response_output["name"]
    	function_args = response_output["arguments"]
      if response_output["type"] == "message":
      	#an assistant message with content blocks
  5. SDK provides built-in tracing & run history
  6. Native only to OpenAI (and Azure OpenAI). For other LLMs you’d need a proxy that emulates the Responses API.


AWS Bedrock Agents

  1. Fully managed AWS service, configured via console
  2. When you invoke a Bedrock Agent (via API or console), AWS establishes a runtime session for that user/conversation. The conversation history (prior user inputs, model responses, tool invocations, intermediate results) is stored on AWS infrastructure associated with that sessionId.When the agent calls a tool, the outputs are persisted in session state. 
  3. Pay-per-use on AWS (per token + infra integration). Costs tied to Bedrock pricing. AWS runtime decides what minimal state to pass back into the LLM — e.g., compressed summaries, selected tool outputs, prior reasoning steps. You don’t control (or see) the exact serialization, but the idea is that AWS optimizes the context window management for you.
  4. Tools are configured in AWS console (e.g., Lambdas, Step Functions). Execution handled natively by Bedrock runtime.
  5. Integrated with AWS CloudWatch/X-Ray
  6. Bedrock-hosted models (Anthropic, Llama, Claude, Mistral, etc.)

Strands Agent

  1. A Python agent runtime (SDK) that runs in your process. You bring any model (OpenAI, Bedrock via LiteLLM, etc.), and Strands orchestrates prompts, tools, and streaming.
        agent=Agent(
            model=model, tools=tools, system_prompt=system_prompt
        )
        answer = agent(user_input)
        
  2. Conversation state lives in your app. Strands holds the working memory/trace during a run; you decide what to persist (DB, Redis, files). If your underlying model/proxy also supports server state, you can choose to use it, but Strands doesn’t require it.
  3. You only pay for tokens you actually send to the underlying model. No separate “state storage” cost; total cost depends on how much context you include.
  4. you register functions (schemas), Strands drives the reason→act→observe loop, runs tools, and feeds results back to the model. Parallelization is under your control.
  5. Rich observability (structured logs, OpenTelemetry)
  6. Vendor-agnostic. Use OpenAI, Bedrock (Claude), local models, etc. via adapters (e.g., LiteLLMModel). 

LangGraph

  1. A Python agent graph runtime (SDK) that runs in your process. You model the agent as a graph (nodes = LLM/tool/human steps; edges = control flow). Use prebuilt agents like create_react_agent or compose your own nodes/routers. Works fine with OpenAI, Bedrock, local LLMs, etc.
  2. Conversation state lives in your app via LangGraph checkpointers. You pass a thread_id and a checkpointer (in-memory, SQLite/Postgres, or the hosted LangGraph Platform). LangGraph restores prior turns/working memory automatically. If your model/proxy has server state, you can use it, but LangGraph doesn’t require it—you choose what to persist (messages, summaries, tool outputs).
    agent = create_react_agent(
        model=llm,
        tools=tools,
        prompt="You are a helpful assistant. Use the tools provided to answer questions. If you don't know the answer, use your tools.",
    )
    
    # Define the graph with a state machine.
    workflow = StateGraph(AgentState)
    workflow.add_node("agent", agent)
    workflow.add_edge(START, "agent")
    workflow.add_edge(
        "agent", END
    )  # In a simple case, the agent node can go directly to END
    
    # Compile the graph
    app = workflow.compile()      
  3. You only pay for tokens sent to the underlying model. There’s no separate “state storage” cost from LangGraph itself. Your total cost depends on how much context you rehydrate per turn (and any DB/Platform you choose for persistence).
  4. You register tools (functions/schemas) and LangGraph drives the reason → act → observe loop. Tools can be simple Python callables or LangChain @tools. Prebuilt ReAct agents or custom graphs will invoke tools and feed results back to the model, with support for loops, branching, retries, timeouts, and parallelization via concurrent branches/map nodes—under your control.
  5. Rich observability. LangGraph Studio (local or Platform) provides a visual graph, step-level inputs/outputs, token/cost traces, and checkpoint “time-travel” to replay from any step. Plays well with your logging/metrics stack.
  6. Vendor-agnostic. Use OpenAI, AWS Bedrock (Claude), Google, local/Ollama, etc., through LangChain adapters; swap models without rewriting your graph.

What “good” looks like for an agent framework

Your bar should be simple: pick the stack that’s composable, observable, portable, and cheap to change. 

  • Plug-and-play with the rest of your stack
    Must integrate cleanly with eval (Ragas/DeepEval), observability (Langfuse/Helicone/OTel), and your data/vectors (pgvector, Weaviate, Pinecone, Redis), without adapters that fight each other.
  • Standard, first-class observability
    Step-level traces, token/cost accounting, latency/error breakdowns, replay/time-travel, export to OpenTelemetry. If you can’t answer “what happened and why?” in one place, it won’t survive prod.
  • Model-agnostic
    Swap OpenAI ↔ Bedrock/Anthropic ↔ local (Ollama/vLLM) with minimal code changes.
    • Model routing / cascades - Use a small/fast model for easy cases; fallback to Claude/GPT only when needed.
    • Distillation - Have a big model generate labeled data, then train an open, smaller model (e.g., 7–13B) on it. You own/serve this smaller model (and can quantize it) for big savings.
  • Cloud-agnostic
    Run anywhere (local, k8s, AWS/Azure/GCP). No hard vendor lock-in for core logic. If you leave a cloud, your agent should come with you.
  • Lightweight & composable (“LEGO-style”)
    Small primitives you can rearrange. Clear boundaries between reason → act → observe, easy to add/remove tools, and simple to test.