Rahul Raj: AI

Showing posts with label AI. Show all posts

Aug 30, 2025

Building Smarter AI Agents

When building production-ready AI agents, the model is just one part of the story. Equally important are the evaluation frameworks, logging/observability tools, lightweight client libraries, and prompt orchestration frameworks.

Here are the key packages I use, what they do, and other similar options in the ecosystem.

RAGAS (RAG Assessment & Scoring)

What it is:
Ragas is a framework to evaluate retrieval-augmented generation (RAG) systems. It provides automatic metrics for faithfulness, answer relevance, retrieval precision/recall, and more.
Use case:
When testing RAG pipelines, I can automatically score how well my system retrieves documents and whether the model’s answers stick to the evidence.
Similar libraries:
DeepEval – generic LLM evaluation toolkit.
TruLens – for evaluating and monitoring LLM apps (esp. RAG).
Evalchemy – simpler eval DSL for agents and RAG pipelines.

Langfuse

What it is:
Langfuse is an observability and logging platform for LLM applications. It captures traces, spans, prompts, model outputs, tool calls, and lets you replay & analyze runs.
Use case:
I use it to debug agent workflows, track cost/latency, and visualize multi-tool execution. It’s like “OpenTelemetry for LLMs.”

LiteLLM

What it is:
LiteLLM is a unified API wrapper for >100 LLM providers (OpenAI, Anthropic, Bedrock, Ollama, Azure, etc.).
Use case:
Lets me switch between models (Claude, GPT-4, Llama, etc.) without rewriting my code. Also supports rate-limiting, retries, logging, and cost tracking.

Aug 7, 2025

Deep Dive into Strands Agent Trace

When debugging or understanding LLM agents like Strands, tracing is critical. Recently, I ran a simple prompt — “greet rahul using tool” — and captured the trace emitted by Strands Agent using OpenTelemetry. Even though the use case was simple, the trace revealed the elegance of how the agent plans, executes, and finalizes its response in structured steps.

Here’s a breakdown of the key spans and why they exist, with a focus on the two chat spans and two execute_event_loop_cycle spans, and how they tie together.

What Happened?

The user instruction was: "greet rahul using tool"

Strands Agent, powered by Claude 3 Sonnet, interpreted this and:

Planned a toolUse of the greet tool with input "rahul".
Executed the tool.
Used the result to send back a human-friendly message.

This simple interaction resulted in a two-turn conversation, captured in two event loop cycles.

Trace Summary (at a glance)

Span Name	Duration (s)	Key Events	Role
`execute_event_loop_cycle #1`	2.28	`toolUse`, `toolResult`	Plans and executes tool
`chat #1`	2.15	LLM decides: use greet on `"rahul"`	Claude generates plan
`execute_tool greet`	0.13	Input: `"rahul"` → Output: `"Hello, rahul"`	Tool executes
`execute_event_loop_cycle #2`	3.06	Final reply from model	Uses tool result to complete
`chat #2`	3.06	LLM returns final message	Claude wraps up
`invoke_agent`	5.33	All agent activity spans	Wraps both cycles
`agent.run`	5.33	Full request lifecycle	Top-level root span

agent.run (INTERNAL)

└── invoke_agent "Strands Agents" (CLIENT)

├── execute_event_loop_cycle (INTERNAL)

│ ├── chat (LLM planning) (CLIENT)

│ └── execute_tool greet (INTERNAL)

└── execute_event_loop_cycle (INTERNAL)

└── chat (LLM response) (CLIENT)

Why Two execute_event_loop_cycle?

Strands Agent works in planning cycles — each one wraps:

Observation of the current context (user input, tool results)
A model call (chat)
Optional tool execution (execute_tool)
A choice of whether to continue, end, or act again

Cycle 1:

Interprets "greet rahul using tool"
Decides to call greet({ name: "rahul" })
Tool responds with "Hello, rahul"

Cycle 2:

Receives the tool result
Generates a final assistant message:

Why Two chat Spans?

Each chat span is a model call. Here’s how they differ:

chat #1:

Input:

Raw user message
Full conversation history: Previous user inputs, assistant replies, tool calls (toolUse) and tool results (toolResult) are automatically replayed in the context window so the model has memory of the ongoing session.
Tool schemas: Strands injects all available tools (from gen_ai.agent.tools) into the prompt as structured JSON function specs. Even if only one tool is used, the model sees them all so it can reason about the best fit.

{
  "name": "greet",
  "description": "Greets a person by name.",
  "parameters": {
    "type": "object",
    "properties": {
      "name": { "type": "string", "description": "The name of the person to greet." }
    },
    "required": ["name"]
  }
}

Decision: The model (e.g. Claude or GPT) then decides which tool to use based on the prompt and context, in this case its greet tool
Output: Tool plan: toolUseId = tooluse_jN8iuxwKRC6eji43nu64OQ

[
  {
    "text": "I'll greet Rahul using the greet tool right away."
  },
  {
    "toolUse": {
      "toolUseId": "tooluse_jN8iuxwKRC6eji43nu64OQ",
      "name": "greet",
      "input": {
        "name": "rahul"
      }
    }
  }
]

chat #2:

Input: Includes the toolResult as context
Output: Final message to the user

This separation is deliberate: tool planning and result-based reply generation are split for clarity, control, and extensibility.

Final Thoughts

This trace, from a seemingly trivial agent command, illustrates the powerful architecture of Strands Agent:

Planning is explicit and looped (via execute_event_loop_cycle)
Decisions and actions are clearly separated (chat, execute_tool)
Full observability is built-in with OpenTelemetry

If you're designing agents or trying to debug them, traces like this are your best lens into what’s really going on — and how LLMs, planning logic, and tools interact.

Jun 29, 2025

Agentic AI Framework

Building an AI agent that can reason and use tools requires more than just a powerful LLM. Ever wondered what's happening behind the scenes of a conversational AI agent? This post breaks down—to help you understand where your agent's conversation state lives, how tools are managed, and which solution offers the most flexibility.

OpenAI Responses API

Open AI Responses API is a unified interface for building powerful, agent-like applications. Its an evolution of Chat Completions which doesn;t have server-side state, so you need to resend history.

  client.responses.create(
    model="gpt-4o-mini",
    input=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use tools when needed.",
        },
        {"role": "user", "content": user_question},
    ],
    tools=tools,
    parallel_tool_calls=True,
)

Conversation state lives On OpenAI’s servers (for OpenAI-hosted models) when you pass previous_response_id. If you point your OpenAI client at a proxy (e.g., LiteLLM), state (if any) is maintained by the proxy, not OpenAI. Pass previous_response_id to link to prior turns without resending them.
```
client.responses.create(
    model="gpt-4o-mini",
    input=tool_outputs,
    tools=tools,  # keep tools if follow-up calls might happen
    previous_response_id=resp1.id,  # <— important
)
```
Each turn re-bills the effective prompt (prior items + new items). You may get prompt caching discounts for repeated prefixes.

You execute tools and feed results back in a follow-up call.

request → model returns function_call → you run the tool(s) → send function_call_output → repeat until no more tool calls → final answer.

  if response_output["type"] == "function_call":
	function_name = response_output["name"]
	function_args = response_output["arguments"]
  if response_output["type"] == "message":
  	#an assistant message with content blocks

SDK provides built-in tracing & run history
Native only to OpenAI (and Azure OpenAI). For other LLMs you’d need a proxy that emulates the Responses API.

AWS Bedrock Agents

Fully managed AWS service, configured via console
When you invoke a Bedrock Agent (via API or console), AWS establishes a runtime session for that user/conversation. The conversation history (prior user inputs, model responses, tool invocations, intermediate results) is stored on AWS infrastructure associated with that sessionId.When the agent calls a tool, the outputs are persisted in session state.
Pay-per-use on AWS (per token + infra integration). Costs tied to Bedrock pricing. AWS runtime decides what minimal state to pass back into the LLM — e.g., compressed summaries, selected tool outputs, prior reasoning steps. You don’t control (or see) the exact serialization, but the idea is that AWS optimizes the context window management for you.
Tools are configured in AWS console (e.g., Lambdas, Step Functions). Execution handled natively by Bedrock runtime.
Integrated with AWS CloudWatch/X-Ray
Bedrock-hosted models (Anthropic, Llama, Claude, Mistral, etc.)

Strands Agent

A Python agent runtime (SDK) that runs in your process. You bring any model (OpenAI, Bedrock via LiteLLM, etc.), and Strands orchestrates prompts, tools, and streaming.
```
    agent=Agent(
        model=model, tools=tools, system_prompt=system_prompt
    )
    answer = agent(user_input)
    
```
Conversation state lives in your app. Strands holds the working memory/trace during a run; you decide what to persist (DB, Redis, files). If your underlying model/proxy also supports server state, you can choose to use it, but Strands doesn’t require it.
You only pay for tokens you actually send to the underlying model. No separate “state storage” cost; total cost depends on how much context you include.
you register functions (schemas), Strands drives the reason→act→observe loop, runs tools, and feeds results back to the model. Parallelization is under your control.
Rich observability (structured logs, OpenTelemetry)
Vendor-agnostic. Use OpenAI, Bedrock (Claude), local models, etc. via adapters (e.g., LiteLLMModel).

LangGraph

A Python agent graph runtime (SDK) that runs in your process. You model the agent as a graph (nodes = LLM/tool/human steps; edges = control flow). Use prebuilt agents like create_react_agent or compose your own nodes/routers. Works fine with OpenAI, Bedrock, local LLMs, etc.

Conversation state lives in your app via LangGraph checkpointers. You pass a thread_id and a checkpointer (in-memory, SQLite/Postgres, or the hosted LangGraph Platform). LangGraph restores prior turns/working memory automatically. If your model/proxy has server state, you can use it, but LangGraph doesn’t require it—you choose what to persist (messages, summaries, tool outputs).

agent = create_react_agent(
    model=llm,
    tools=tools,
    prompt="You are a helpful assistant. Use the tools provided to answer questions. If you don't know the answer, use your tools.",
)

# Define the graph with a state machine.
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent)
workflow.add_edge(START, "agent")
workflow.add_edge(
    "agent", END
)  # In a simple case, the agent node can go directly to END

# Compile the graph
app = workflow.compile()

You only pay for tokens sent to the underlying model. There’s no separate “state storage” cost from LangGraph itself. Your total cost depends on how much context you rehydrate per turn (and any DB/Platform you choose for persistence).
You register tools (functions/schemas) and LangGraph drives the reason → act → observe loop. Tools can be simple Python callables or LangChain @tools. Prebuilt ReAct agents or custom graphs will invoke tools and feed results back to the model, with support for loops, branching, retries, timeouts, and parallelization via concurrent branches/map nodes—under your control.
Rich observability. LangGraph Studio (local or Platform) provides a visual graph, step-level inputs/outputs, token/cost traces, and checkpoint “time-travel” to replay from any step. Plays well with your logging/metrics stack.
Vendor-agnostic. Use OpenAI, AWS Bedrock (Claude), Google, local/Ollama, etc., through LangChain adapters; swap models without rewriting your graph.

What “good” looks like for an agent framework

Your bar should be simple: pick the stack that’s composable, observable, portable, and cheap to change.

Plug-and-play with the rest of your stack
Must integrate cleanly with eval (Ragas/DeepEval), observability (Langfuse/Helicone/OTel), and your data/vectors (pgvector, Weaviate, Pinecone, Redis), without adapters that fight each other.
Standard, first-class observability
Step-level traces, token/cost accounting, latency/error breakdowns, replay/time-travel, export to OpenTelemetry. If you can’t answer “what happened and why?” in one place, it won’t survive prod.
Model-agnostic
Swap OpenAI ↔ Bedrock/Anthropic ↔ local (Ollama/vLLM) with minimal code changes.

Model routing / cascades - Use a small/fast model for easy cases; fallback to Claude/GPT only when needed.
Distillation - Have a big model generate labeled data, then train an open, smaller model (e.g., 7–13B) on it. You own/serve this smaller model (and can quantize it) for big savings.

Cloud-agnostic
Run anywhere (local, k8s, AWS/Azure/GCP). No hard vendor lock-in for core logic. If you leave a cloud, your agent should come with you.
Lightweight & composable (“LEGO-style”)
Small primitives you can rearrange. Clear boundaries between reason → act → observe, easy to add/remove tools, and simple to test.

Jun 22, 2025

How AI Agents Work: LLMs, Tool Use, MCP, and Agent Frameworks

As large language models (LLMs) become increasingly capable, AI agents have emerged as powerful systems that combine language understanding with real-world action. But what exactly is an AI agent? How do LLMs fit into the picture? And how can developers build agents that are modular, secure, and adaptable?

Let’s break it down—from the fundamentals of LLM-powered agents to protocols like MCP and frameworks like Strands and LangGraph.

What Is an AI Agent?

An AI agent is a system designed to execute tasks on behalf of a user. It combines a reasoning engine (typically an LLM) with an action layer (tools, APIs, databases, etc.) to understand instructions and carry out operations.

In this setup, the LLM acts as the agent’s “brain.” It interprets the user’s goal, breaks it down into logical steps, and decides which tools are needed to fulfill the task. The agent, in turn, sends the user’s goal to the LLM—along with a list of available tools such as vector search APIs, HTTP endpoints, or email services.

The LLM plans the workflow and returns instructions: which tool to call, what parameters to pass, and in what sequence to proceed. The agent executes those tool calls, collects results, and loops back to the LLM for further planning. This iterative loop continues until the task is fully completed.

Importantly, agents maintain context over time—tracking prior steps, user input, and intermediate outputs—enabling them to handle complex, multi-turn tasks with coherence and adaptability.

Strands: A Model-Driven Agent Framework

The Strands Agent follows a model-driven approach, where the LLM is in charge of the logic and flow.

Instead of writing hardcoded logic like if this, then do that, the developer provides the LLM with:

Clear system and user prompts
A list of tools the agent can access
The overall task context

The LLM uses its reasoning and planning capabilities to decide which tools to call, how to call them, and in what sequence. This makes the agent dynamic and adaptive, rather than rigidly tied to predefined control paths.

In Strands, the agent's core responsibility is to execute tool calls, maintain memory, and facilitate the LLM's decisions. The LLM, in turn, drives the workflow using instructions encoded in each step.

A Program-Driven Flow Engine (LangGraph)

LangGraph is a state machine framework built on top of LangChain that allows developers to define agent workflows as directed graphs. Each node in the graph represents a function—often LLM-powered—and edges define how data flows from one node to the next.

By default, LangGraph follows a program-driven approach. The developer defines:

The graph structure (workflow)
The behavior of each node (LLM call, tool call, decision logic)
The conditions that determine transitions between nodes

While LLMs can be used inside nodes to reason or generate text, they do not control the overall execution flow. That logic is handled programmatically, making LangGraph ideal for scenarios where control, reliability, and testing are critical.

Program-Driven Agents Without a Framework

Not all agents need a dedicated framework. In many cases, developers can build lightweight, program-driven agents using plain code and selective use of LLMs.

In this approach:

The developer writes the full control logic
The LLM is used at specific points—for summarization, classification, interpretation, etc.
All tool interactions (e.g., API calls, database queries) are handled directly in code
The LLM does not control which tools to use or what happens next

This model gives developers maximum control and is well-suited for building LLM-in-the-loop systems, where the language model acts more like a helper than a planner.

MCP: A Standard Protocol for Tool Use

As agents grow more sophisticated, one challenge becomes clear: How do you standardize how agents call tools, especially across different frameworks or LLMs?

That’s where MCP (Model Context Protocol) comes in.

MCP standardizes how an AI agent interacts with tools, providing a consistent interface for invoking external systems. Whether the agent is built in Python, JavaScript, or another environment—and whether it uses GPT-4, Claude, or another LLM—MCP allows all of them to access the same tools in a uniform way.

An MCP server can also enforce important security and operational rules: access controls, rate limits, input/output validation, and more. Developers can build reusable libraries of MCP-compatible tools that plug seamlessly into any agent. Without MCP, every tool would require a custom integration, making development slower, error-prone, and hard to scale.