Building an AI agent that can reason and use tools requires more than just a powerful LLM. Ever wondered what's happening behind the scenes of a conversational AI agent? This post breaks down—to help you understand where your agent's conversation state lives, how tools are managed, and which solution offers the most flexibility.
OpenAI Responses API
- Open AI Responses API is a unified interface for building powerful, agent-like applications. Its an evolution of Chat Completions which doesn;t have server-side state, so you need to resend history.
client.responses.create( model="gpt-4o-mini", input=[ { "role": "system", "content": "You are a helpful assistant. Use tools when needed.", }, {"role": "user", "content": user_question}, ], tools=tools, parallel_tool_calls=True, )
- Conversation state lives On OpenAI’s servers (for OpenAI-hosted models) when you pass previous_response_id. If you point your OpenAI client at a proxy (e.g., LiteLLM), state (if any) is maintained by the proxy, not OpenAI. Pass previous_response_id to link to prior turns without resending them.
client.responses.create( model="gpt-4o-mini", input=tool_outputs, tools=tools, # keep tools if follow-up calls might happen previous_response_id=resp1.id, # <— important )
- Each turn re-bills the effective prompt (prior items + new items). You may get prompt caching discounts for repeated prefixes.
- You execute tools and feed results back in a follow-up call.
request → model returnsfunction_call
→ you run the tool(s) → sendfunction_call_output
→ repeat until no more tool calls → final answer.if response_output["type"] == "function_call": function_name = response_output["name"] function_args = response_output["arguments"] if response_output["type"] == "message": #an assistant message with content blocks
- SDK provides built-in tracing & run history
- Native only to OpenAI (and Azure OpenAI). For other LLMs you’d need a proxy that emulates the Responses API.
AWS Bedrock Agents
- Fully managed AWS service, configured via console
- When you invoke a Bedrock Agent (via API or console), AWS establishes a runtime session for that user/conversation. The conversation history (prior user inputs, model responses, tool invocations, intermediate results) is stored on AWS infrastructure associated with that
sessionId
.When the agent calls a tool, the outputs are persisted in session state. - Pay-per-use on AWS (per token + infra integration). Costs tied to Bedrock pricing. AWS runtime decides what minimal state to pass back into the LLM — e.g., compressed summaries, selected tool outputs, prior reasoning steps. You don’t control (or see) the exact serialization, but the idea is that AWS optimizes the context window management for you.
- Tools are configured in AWS console (e.g., Lambdas, Step Functions). Execution handled natively by Bedrock runtime.
- Integrated with AWS CloudWatch/X-Ray
- Bedrock-hosted models (Anthropic, Llama, Claude, Mistral, etc.)
Strands Agent
- A Python agent runtime (SDK) that runs in your process. You bring any model (OpenAI, Bedrock via LiteLLM, etc.), and Strands orchestrates prompts, tools, and streaming.
agent=Agent( model=model, tools=tools, system_prompt=system_prompt ) answer = agent(user_input)
- Conversation state lives in your app. Strands holds the working memory/trace during a run; you decide what to persist (DB, Redis, files). If your underlying model/proxy also supports server state, you can choose to use it, but Strands doesn’t require it.
- You only pay for tokens you actually send to the underlying model. No separate “state storage” cost; total cost depends on how much context you include.
- you register functions (schemas), Strands drives the reason→act→observe loop, runs tools, and feeds results back to the model. Parallelization is under your control.
- Rich observability (structured logs, OpenTelemetry)
- Vendor-agnostic. Use OpenAI, Bedrock (Claude), local models, etc. via adapters (e.g., LiteLLMModel).
LangGraph
- A Python agent graph runtime (SDK) that runs in your process. You model the agent as a graph (nodes = LLM/tool/human steps; edges = control flow). Use prebuilt agents like create_react_agent or compose your own nodes/routers. Works fine with OpenAI, Bedrock, local LLMs, etc.
- Conversation state lives in your app via LangGraph checkpointers. You pass a thread_id and a checkpointer (in-memory, SQLite/Postgres, or the hosted LangGraph Platform). LangGraph restores prior turns/working memory automatically. If your model/proxy has server state, you can use it, but LangGraph doesn’t require it—you choose what to persist (messages, summaries, tool outputs).
agent = create_react_agent( model=llm, tools=tools, prompt="You are a helpful assistant. Use the tools provided to answer questions. If you don't know the answer, use your tools.", ) # Define the graph with a state machine. workflow = StateGraph(AgentState) workflow.add_node("agent", agent) workflow.add_edge(START, "agent") workflow.add_edge( "agent", END ) # In a simple case, the agent node can go directly to END # Compile the graph app = workflow.compile()
- You only pay for tokens sent to the underlying model. There’s no separate “state storage” cost from LangGraph itself. Your total cost depends on how much context you rehydrate per turn (and any DB/Platform you choose for persistence).
- You register tools (functions/schemas) and LangGraph drives the reason → act → observe loop. Tools can be simple Python callables or LangChain @tools. Prebuilt ReAct agents or custom graphs will invoke tools and feed results back to the model, with support for loops, branching, retries, timeouts, and parallelization via concurrent branches/map nodes—under your control.
- Rich observability. LangGraph Studio (local or Platform) provides a visual graph, step-level inputs/outputs, token/cost traces, and checkpoint “time-travel” to replay from any step. Plays well with your logging/metrics stack.
- Vendor-agnostic. Use OpenAI, AWS Bedrock (Claude), Google, local/Ollama, etc., through LangChain adapters; swap models without rewriting your graph.
What “good” looks like for an agent framework
Your bar should be simple: pick the stack that’s composable, observable, portable, and cheap to change.
- Plug-and-play with the rest of your stack
Must integrate cleanly with eval (Ragas/DeepEval), observability (Langfuse/Helicone/OTel), and your data/vectors (pgvector, Weaviate, Pinecone, Redis), without adapters that fight each other. - Standard, first-class observability
Step-level traces, token/cost accounting, latency/error breakdowns, replay/time-travel, export to OpenTelemetry. If you can’t answer “what happened and why?” in one place, it won’t survive prod. - Model-agnostic
Swap OpenAI ↔ Bedrock/Anthropic ↔ local (Ollama/vLLM) with minimal code changes. - Model routing / cascades - Use a small/fast model for easy cases; fallback to Claude/GPT only when needed.
- Distillation - Have a big model generate labeled data, then train an open, smaller model (e.g., 7–13B) on it. You own/serve this smaller model (and can quantize it) for big savings.
- Cloud-agnostic
Run anywhere (local, k8s, AWS/Azure/GCP). No hard vendor lock-in for core logic. If you leave a cloud, your agent should come with you. - Lightweight & composable (“LEGO-style”)
Small primitives you can rearrange. Clear boundaries between reason → act → observe, easy to add/remove tools, and simple to test.
No comments:
Post a Comment