Jul 11, 2026

🔍 Choosing the Right Retrieval Strategy: BM25, Vector, Hybrid, SQL, or LLM?

One of the most common questions when building AI applications is:

Should I use BM25, Vector Search, Hybrid Search, SQL, or an LLM?

The answer is almost always:

It depends on what your users are trying to do.

Retrieval isn't a single problem.

Users have different intents, and each intent favors a different retrieval strategy.

Let's use a simple pizza menu to illustrate.

🍕 The Menu

Margherita Pizza
Pepperoni Special
Veggie Supreme
Cheese Lovers

Now imagine users searching in different ways.

Some know exactly what they want.

Others describe what they want.

Some ask questions.

Each requires a different retrieval strategy.

🎯 BM25: Exact Lookup

BM25 is traditional keyword search.

It excels when users already know what they're looking for.

Examples

Pizza #3
Margherita Pizza
Order #12345
Error Code 500

Strengths

Fast
Simple
Highly precise

Limitation

BM25 understands words—not meaning.

Searching for "cheese pizza" won't necessarily find pizzas that only mention mozzarella or parmesan.

The good news is that BM25 is often enhanced with features such as:

Fuzzy matching
Prefix matching
Stemming
Synonym expansion
Phonetic search

These improvements make keyword search much more forgiving, but they still don't provide true semantic understanding.

Best for

IDs
Product names
Error codes
Exact matches

🧠 Vector Search: Semantic Discovery

Vector search understands meaning rather than keywords.

Instead of matching words, it matches concepts.

For example:

"Something with cheese but no meat."

A vector search understands that mozzarella, parmesan, provolone, and cheddar are all forms of cheese.

Likewise,

"Vegetarian options"

finds pizzas without meat, even if the word vegetarian isn't explicitly present.

It also handles many spelling variations naturally.

"Margarita" → "Margherita"

Unlike BM25, these capabilities don't require manually defining synonyms or fuzzy rules.

Vector search is also commonly combined with metadata filtering to narrow results.

Best for

Natural language
Synonyms
Concept search
Discovery

⚖️ Hybrid Search: Best of Both

Real-world users don't all search the same way.

Some search:

Pizza #3

Others search:

Something spicy

Hybrid search combines BM25 with vector search.

BM25 provides precision for exact matches.

Vector search provides semantic understanding.

Together they produce better results than either approach alone.

For many production systems, hybrid search delivers the best overall user experience.

Best for

Customer-facing search
E-commerce
Enterprise knowledge bases
Mixed search behavior

🤖 LLM-Powered Retrieval

Sometimes retrieval isn't the problem.

Reasoning is.

Instead of querying an index directly, an LLM understands the user's request, expands it into multiple search strategies, executes those searches, and synthesizes the results.

For example:

"Find all places where authentication is implemented."

The model may search for:

login
authentication
OAuth
JWT
identity
authorization

and combine the results into a single answer.

This is similar to how tools like Claude Code search large codebases.

The trade-off is cost and latency.

Each query requires one or more LLM calls, making this approach significantly slower and more expensive than indexed retrieval.

Best for

Developer tools
Code exploration
Internal documentation
Research workflows

🗄️ Don't Forget SQL

Not every retrieval problem needs vectors.

If a user asks:

"Show all orders placed yesterday."

"List customers in California."

that's a structured query.

A relational database or metadata filter is often simpler, faster, and more accurate than semantic search.

One of the biggest mistakes in AI systems is using vector search where SQL is the better solution.

📋 Choosing the Right Retrieval Strategy

Example Query	Best Approach	Why	Helpful Features
Pizza #3	BM25	Exact identifier lookup where precision matters most	Fuzzy search, prefix matching
Margherita Pizza	BM25	User knows the exact name and expects an exact match	Synonyms, stemming
Vegetarian options	Vector Search	User is searching by concept rather than keywords	Metadata filtering
Something with cheese but no meat	Vector Search	Requires semantic understanding of ingredients and constraints	Metadata filtering
Margarita pizza	BM25 + Fuzzy or Vector Search	Needs typo tolerance while preserving relevance	Fuzzy matching or semantic similarity
Pizza under $15 with mushrooms	SQL + Vector Search	Price is structured data; description is unstructured	Structured filters + semantic search
Pizza #3 or recommend something similar	Hybrid Search	Combines exact lookup with semantic recommendations	BM25 + Vector fusion
Find all places where authentication is implemented	LLM-Powered Retrieval	Requires reasoning, query expansion, and synthesis across multiple sources	Multi-step reasoning

⚠️ Every Approach Has Trade-offs

There isn't a perfect retrieval strategy.

BM25

Doesn't understand meaning.
Requires additional features like fuzzy search and synonym expansion for better recall.

Vector Search

Understands semantics but may rank conceptually similar results above exact matches.

Hybrid Search

Delivers excellent results but requires tuning and balancing.

LLM-Powered Retrieval

Powerful reasoning but higher cost and latency.

SQL

Excellent for structured data but poor for semantic discovery.

Understanding where each approach fails is just as important as understanding where it succeeds.

🎯 Final Thought

The question shouldn't be:

"Which retrieval technology is the best?"

The better question is:

"What kind of retrieval problem am I trying to solve?"

Sometimes the answer is BM25.

Sometimes it's vector search.

Sometimes it's SQL.

Sometimes it's an LLM.

And increasingly, the best AI applications combine multiple retrieval strategies, using the right tool for the right job rather than forcing every query through the same pipeline.

Jul 4, 2026

🔍 Building Trust in AI: Reasoning Visibility and On-Demand Validation

One of the biggest challenges with AI isn't generating answers.

It's trusting them.

Most AI systems behave like black boxes—they provide an answer without showing how they arrived at it.

That leaves users with two choices:

Trust the answer blindly.
Verify everything manually.

Neither is a great experience.

We approached this with two complementary features:

Reasoning Visibility — show how the AI arrived at its answer.
On-Demand Validation — let users independently verify the answer.

Together they change the experience from:

"Trust me."

"Here's how I got there. Verify it if you'd like."

🧠 Feature 1: Reasoning Visibility

Instead of hiding execution, we make it available through a collapsible Show Reasoning section.

Users can see:

The tools the agent invoked
The actual requests that were executed
The reasoning between each step

For example, instead of simply saying:

"I checked the status of your order."

The UI shows the actual execution:

GET /orders?customer=Acme&status=pending
GET /shipments?customer=Acme

Along with the reasoning:

"I retrieved all pending orders, then checked their shipment status before generating the summary."

Users can immediately understand what the agent did, which data it used, and why it reached its conclusion. Of course, this must be balanced with security by exposing only what is appropriate and redacting sensitive implementation details.

📚 Think of It Like Showing Your Work

When we were in school, teachers didn't just grade the final answer.

They asked us to show our work.

Not because the final answer wasn't important, but because the reasoning revealed whether we actually understood the problem.

AI systems should work the same way.

The goal isn't to expose every internal token the model generates. It's to provide enough transparency that users can understand, debug, and trust the result.

✅ Feature 2: On-Demand Validation

Sometimes seeing the work isn't enough.

You still want to know:

"Is the answer actually correct?"

Think back to school.

Showing your work helped the teacher understand how you solved the problem.

But for important exams, your work might also be reviewed by another teacher or an independent grader.

The reason is simple:

You don't grade your own homework.

We apply the same principle to AI.

When users click Justify, a second independent AI model reviews the answer.

Instead of trusting the first model, it:

Re-queries the same data sources
Verifies the facts independently
Returns a verdict:
- ✅ Valid
- ⚠️ Partially Valid
- ❌ Invalid
Provides a confidence score
Explains any discrepancies

The second model isn't grading its own work.

It's independently verifying the answer before giving its opinion.

That additional layer of validation builds confidence, especially for high-impact decisions.

🤝 Why They Work Together

These two features solve different problems.

Reasoning Visibility answers:

"How did the AI arrive at this answer?"

On-Demand Validation answers:

"Is the answer actually correct?"

One provides transparency.

The other provides confidence.

Together they allow users to inspect the reasoning when they're curious and independently validate the answer when accuracy really matters.

🎯 Final Thought

AI systems shouldn't ask users to trust them blindly.

They should make it easy to understand how an answer was produced and simple to verify whether it's correct.

Reasoning visibility and independent validation don't eliminate mistakes.

They make mistakes visible, explainable, and verifiable.

That's how trust is built.

Jun 23, 2026

🧪 Testing AI Agents: Multi-Layer Evaluation Strategy

Testing AI agents is fundamentally different from testing traditional software.

Traditional systems follow a simple contract:

input → expected output → assertion

Agents don't.

They reason dynamically, choose tools, compose responses, and operate on data that changes constantly. Hardcoded expected answers quickly become stale and brittle.

To address this, we use a three-layer evaluation strategy that combines deterministic validation, semantic evaluation, and model migration testing.

🚫 No Hardcoded Expected Results

One of the biggest challenges in agent testing is avoiding stale expectations.

Market data changes.
Accounts change.
Reference data evolves.

A hardcoded answer is often outdated the moment it's written.

Instead of storing static expected responses, each test case points to one or more live reference APIs.

Before evaluation runs, the framework fetches current ground-truth data directly from upstream services and evaluates the agent response against that live data.

This shifts the goal from validating snapshots to validating reasoning and communication against current reality.

⚡ Layer 1: Deterministic Validation

The first layer performs fast, reproducible checks against the agent response.

Examples include:

Required values present
Expected identifiers returned
Missing fields detected
Error or fallback messages detected
Structured formats validated

A key capability is validating values against live reference data.

Expected values are extracted from the reference payload and normalized into multiple formats before checking whether they appear in the response.

For example:

659161818

might be recognized as:

659,161,818
$659.2M
659.2 million

This layer catches:

Empty responses
Missing information
Incorrect values
Malformed output
Tool execution failures

Fast, deterministic, and easy to debug.

🤖 Layer 2: LLM-as-Judge

Deterministic checks verify facts.

They cannot determine whether an answer is complete, coherent, or grounded.

For that, a second LLM acts as a judge.

The judge evaluates:

Completeness — Did the response answer everything that was asked?
Coherence — Is the response clear and logically structured?
Groundedness — Are all claims supported by reference data?

The judge receives:

The original question
The agent response
The live reference data

It reasons step-by-step before assigning scores.

Importantly, the judge does not validate numeric accuracy. That responsibility remains with the deterministic layer.

Both layers use the same reference data but in different ways:

Layer 1 verifies that required values appear in the response
Layer 2 verifies that claims made in the response are supported by the reference

This separation prevents overlap and conflicting evaluations.

🔄 Layer 3: Model Migration Testing

Sometimes the question isn't:

Is this answer correct?

It's:

Can I safely switch from one model to another?

For migration testing, each test case runs twice:

Baseline model
Candidate model

A judge then compares the two responses and classifies the candidate

This mode answers a fundamentally different question from the first two layers.

The first two layers compare responses against objective ground truth.

The migration layer compares a candidate model against the current production baseline.

🏗️ Evaluation Pipeline

Ground Truth Evaluation

invoke_agent
    ↓
fetch_live_reference
    ↓
deterministic_validation
    ↓
(optional) llm_judge

Model Migration Evaluation

baseline_model
        ↓
candidate_model
        ↓
pairwise_judge

This path focuses solely on migration safety and does not run the ground-truth evaluation layers.

🎯 Why Multiple Layers?

No single evaluation method is sufficient.

Layer 1 provides:

Speed
Exactness
Reproducibility

Layer 2 provides:

Semantic validation
Contextual reasoning
Groundedness checks

Layer 3 provides:

Safe model migration
Regression detection
Comparative evaluation

Together they provide a practical framework for testing AI agents without relying on brittle hardcoded outputs.

As agents become more autonomous and business-critical, having a robust evaluation strategy becomes just as important as the agent itself.Testing AI agents is fundamentally different from testing traditional software.

Jun 5, 2026

🔥 Chaos Engineering in Production: The Challenges Nobody Talks About

Part 2 of our multi-region failover series

Last week I wrote about running monthly chaos engineering exercises in production to validate our disaster recovery architecture.

The obvious follow-up:

"How do you get there safely?"

That's the right question.

Because chaos engineering in production is not where you start.

It's where you arrive — after building the operational maturity that makes failure survivable.

Here are the challenges you need to solve before you get there.

🔍 Challenge 1: Observability Gaps Will Expose You

Before triggering your first failover exercise, ask yourself:

If failover started right now, could you tell exactly what was happening?

Not after the fact.

In real time.

Can you see:

traffic shifting between regions?
application health in both regions?
user impact during the transition?

Most dashboards are built for normal operations. Failover creates a completely different signal spanning infrastructure, DNS, networking, and applications simultaneously.

If you can't observe the recovery process, you can't safely test it.

Maturity bar: Build dashboards specifically for failover scenarios, not just general infrastructure health.

❤️ Challenge 2: Health Checks That Lie

Does your health check confirm the application is ready to serve traffic — or just that the process is running?

Applications can report healthy while:

connection pools are still initializing
caches are cold
downstream services are unavailable

A failover mechanism that trusts shallow health checks can route traffic to a region that isn't actually ready.

Maturity bar: Validate readiness, not existence.

A lying health check is worse than no health check at all.

🌐 Challenge 3: DNS TTL Is Not Your Friend

Route 53 failover is not a switch.

Traffic does not instantly move from one region to another.

DNS caches.
Clients cache.
Resolvers cache.

Even with aggressive TTLs, some traffic continues flowing to the original region during transition.

Maturity bar: Understand your propagation behavior before running production exercises.

What looks like a failure may simply be DNS doing exactly what DNS does.

🚀 Challenge 4: Cold Start Reality vs. Cold Start Assumption

Recovery timelines often look great on architecture diagrams.

What they rarely account for:

connection pool initialization
cache warmup
downstream dependency stabilization
application readiness under real load

Container startup is only the beginning.

Maturity bar: Measure end-to-end recovery under load and let observed behavior define your recovery objectives.

🛑 Challenge 5: Blast Radius Without Boundaries

Chaos engineering gives automation authority over production infrastructure.

Without guardrails, a controlled exercise can become an actual incident.

Every exercise should have:

hard limits
abort conditions
rollback procedures
a designated kill switch

Maturity bar: Define the boundaries before the exercise starts.

💓 Challenge 6: The Traffic Shift Window Is Invisible Without a Heartbeat

During failover:

the primary region is degrading
the secondary region is coming online
DNS is propagating

The question isn't:

"Did failover work?"

The real question is:

"Did users experience downtime?"

To answer that, we run an external synthetic heartbeat every 30 seconds through the same public endpoint users access.

The heartbeat records:

success/failure
latency
timestamps

After every exercise we have evidence.

Not:

"We think there was no downtime."

But:

"Every heartbeat succeeded during the entire failover window."

The heartbeat doesn't prevent outages.

It proves the absence of them.

Maturity bar: Build an independent synthetic monitor before attempting production chaos.

✅ The Maturity Stack Before Production Chaos

Before running chaos engineering in production, you should have:

Full observability
Deep readiness-based health checks
External heartbeat monitoring
Tested recovery automation
Practiced failback procedures
Defined blast-radius controls
Team readiness and communication plans
Successful non-production validation

🎯 Final Thought

Chaos engineering in production is not about proving you're brave.

It's about proving your recovery process works.

The architecture matters.

The automation matters.

But confidence comes from continuous validation.

The heartbeat monitor proves users weren't impacted.

The recovery tests prove automation still works.

Together they turn disaster recovery from a theoretical capability into a continuously validated one.

That's not chaos.

That's engineering.

May 30, 2026

🌪️ Chaos Engineering for Disaster Recovery: Proving Multi-Region Failover Actually Works

Most disaster recovery architectures are built with a hidden assumption:

The failover process will work when we need it.

The problem is that assumptions don't survive outages.

Infrastructure changes.
Deployments drift.
Permissions break.
Health checks evolve.
Automation silently fails.

A disaster recovery strategy is only as good as the last time it was tested.

That's why we built chaos engineering directly into our multi-region architecture.

The Architecture

Like many organizations, we run a primary AWS region that handles all production traffic.

The secondary region is fully provisioned but runs with zero application tasks during normal operation.

When the primary region becomes unhealthy:

CloudWatch detects degradation
Lambda initiates recovery
The secondary region scales up
Health checks begin passing
Route 53 shifts traffic

The entire process is automated and completes in roughly 10 minutes.

This isn't designed for instant failover.

It's designed to provide a balance between resilience and cost efficiency.

The Real Challenge Isn't Failover

The real challenge is confidence.

Most teams test disaster recovery once during implementation and then assume it continues to work forever.

But recovery paths are software.

And software breaks.

The critical question becomes:

How do you know your failover automation still works six months from now?

Enter Chaos Engineering

Once a month we intentionally trigger a failover event in production.

Not a simulation.

A real failover.

We reduce capacity in the primary region and allow the system to respond naturally.

Alarms fire.
Recovery automation executes.
The secondary region activates.
Route 53 redirects traffic.
Production traffic runs from the backup region.

Several hours later we restore the primary region and validate failback behavior.

What Gets Validated

Each exercise validates the entire recovery chain:

✅ CloudWatch alarms

✅ Lambda execution

✅ Auto-scaling behavior

✅ Route 53 failover

✅ Application startup

✅ Service dependencies

✅ Recovery procedures

Instead of testing components individually, we're testing the complete system under real conditions.

The Detail That Prevents Downtime

One implementation detail made these exercises much safer.

During chaos testing, we don't scale the primary region to zero.

Instead, we reduce capacity by a single task.

That leaves enough healthy capacity to continue serving traffic while the secondary region comes online.

As DNS transitions occur, users continue receiving responses.

The recovery path is exercised without creating customer-visible downtime.

Why This Matters

The biggest risk in disaster recovery isn't infrastructure failure.

It's recovery procedures that haven't been tested recently.

A recovery plan sitting in a wiki isn't resilience.

A recovery plan executed successfully every month is.

🎯 Final Thought

Most organizations invest heavily in disaster recovery infrastructure.

Far fewer invest in continuously validating it.

Our multi-region architecture is intentionally cost-optimized, with the secondary region sitting idle most of the time.

But the real value isn't the architecture.

It's the confidence that comes from proving every month that failover still works.

Because in disaster recovery, the question isn't:

"Do we have a failover plan?"

It's:

"When was the last time we proved it actually works?"

May 18, 2026

🚦 Controlling Tool Output: Response Field Projection in Agent Workflows

One of the less obvious performance problems in agentic systems isn’t which tool gets called — it’s how much data comes back from it.

As agent workflows become more sophisticated, context growth can quietly become one of the biggest drivers of:

latency
token cost
reasoning instability

🧠 The Problem: Context Growth Across Cycles

In chained workflows, tool responses accumulate across reasoning cycles:


Cycle 1 → tool response (2,000 tokens)
Cycle 2 → tool response + prior context (4,500 tokens)
Cycle 3 → accumulated context (8,000+ tokens)

Most enterprise APIs are designed for systems integration, not LLM efficiency.

A financial data endpoint may return:

dozens of fields per record
nested metadata
audit attributes
internal identifiers
unused fields

But the agent may only need two or three fields to answer the user’s question.

When raw responses flow into the model unfiltered:

📈 token usage grows every cycle
🐢 latency increases as context expands
⚠️ field-selection mistakes become more common
🧾 prompt-level filtering becomes ineffective because tokens are already consumed before the instruction executes

A simple lookup can easily turn into thousands of unnecessary tokens.

🚀 The Fix: Projection at the Tool Layer

Instead of relying on the LLM to discard unnecessary fields after receiving the response, we moved the optimization into the tool layer itself.

We added a response_fields parameter to the HTTP request tool.

The agent specifies exactly which fields it needs before making the request, and the tool filters the response before returning it to the model.

Instead of:


Tool → large raw JSON → LLM → filter + reason + respond

We now use:


Tool → projected response → LLM → respond

The projection supports:

arrays and nested objects
dot-notation field selection
graceful fallback to full responses when projection is unavailable

✅ Minimal payloads
✅ Smaller context
✅ Faster reasoning

🧩 Closing the Loop

Projection only works if the agent knows which fields to request.

That knowledge can come from:

system prompts
tool metadata
endpoint descriptions
execution guidance
field-level documentation

The important part is that the agent identifies required fields before making the tool call instead of reasoning over a large payload afterward.

This shifts optimization from:

post-processing responses
to:
controlling responses at the source

🔄 Execution Pattern


Execution guidance
    → agent identifies required fields
        → tool-level response projection
            → minimal structured output into LLM context

Rather than continuously expanding context across cycles, the agent keeps context compact and purpose-driven.

📊 Results

Queries that previously returned thousands of tokens per cycle now return only a fraction of that.

For multi-step workflows, this:

💰 reduces token consumption
⚡ lowers latency
📏 stabilizes context growth
🎯 improves reasoning reliability

The pattern is similar to GraphQL:
the client declares what it needs, and only that data comes back.

In this case, the “client” is the LLM itself.

🎯 Final Thought

Efficient agents don’t just call the right tools.

They also control what data comes back from them.

In many production systems, optimizing tool output has a larger impact on performance and reliability than changing the model itself.

May 14, 2026

Reviewing an Agent: From “Works” to “Works Efficiently”

I recently reviewed an agent that was functionally correct — it answered queries, used tools properly, and produced accurate results.

👉 But it wasn’t efficient.

This is a quick summary of what was happening and what improved.

🧠 Initial Behavior: Reactive Execution

The agent followed a step-by-step execution loop:


Cycle 1 → resolve date via tool
Cycle 2 → fetch primary data
Cycle 3 → fetch additional data
Cycle 4 → fetch metadata
Cycle 5 → final response

What was happening?

No upfront planning
Data discovered incrementally
Each missing piece triggered another tool call
All operations executed sequentially

The agent was technically correct, but operationally inefficient.

⏱️ Why a Time Tool Existed

The agent handled queries like:

“first 10 days of last month”

Since LLMs don’t reliably know the current date, a time tool was added to:

ensure correct date calculations
avoid inconsistent outputs

👉 It solved correctness
👉 But added an extra execution cycle every time

🚨 Core Issue

The agent was reactive instead of planned.

Execution looked like:


Do something → discover missing data → do more → repeat

Instead of:


Understand requirements → plan execution → execute efficiently

🚀 Improvements

1. Provide Current Date Directly

Removed dependency on the time tool by injecting the current date into context.

✅ Eliminated one full execution cycle.

2. Add Upfront Planning

The agent now:

identifies required data first
plans execution before calling tools
understands dependencies early

3. Parallelize Independent Calls

Independent data fetches now execute together instead of sequentially.

This reduced unnecessary waiting between cycles.

4. Add Dependency Awareness

Execution flow became smarter:

independent data → parallel execution
dependent data → delayed until required

✅ Final Execution Patterns

No dependency


Cycle 1 → fetch all data (parallel)
Cycle 2 → final response

With dependency


Cycle 1 → fetch base data (parallel)
Cycle 2 → fetch dependent data
Cycle 3 → final response

🎯 Final Thought

Many agents already “work.”

The bigger challenge is making them:

efficient
predictable
low latency
cost aware

In agent systems, execution planning often matters as much as model quality.

Apr 12, 2026

Beyond Last-K Turns: Building Memory That Actually Thinks

Every multi-turn AI agent needs memory. The simplest implementation is obvious: load the last N turns of conversation before each call, then append the new turn after.

It works well enough to ship. But “last N turns” is a recency window—not a true memory system. This post explores where it falls short and what a more intelligent design looks like.

What Last N turns Pattern Does

On each agent invocation, the system loads recent conversation history:


recent_turns = memory_client.get_last_k_turns(
    actor_id=actor_id,
    session_id=session_id,
    k=4,
)

context_messages = build_context_messages(recent_turns)
agent.messages = context_messages

This reconstructs the last few user/assistant exchanges and prepends them to the current prompt. After the response, the new turn is saved.

This is the classic sliding window approach—and it has predictable limitations.

The Problems with a Fixed Window

1. Recency ≠ Relevance

The most recent turns are not necessarily the most useful.

Example:

“What’s the weather?”
“Thanks”
“Remind me what we discussed earlier”
“Never mind”

None of these help answer:
“Summarize the risks we identified last week.”

Yet they are always included, while older—but relevant—context is dropped.

2. All Turns Are Treated Equally

A one-word reply (“ok”) occupies the same space as a detailed explanation.

The system ignores information density, wasting context on low-value content.

3. Token Usage Is Uncontrolled

Even with a fixed number of turns, token usage can vary widely.

Short turns → small footprint
Long turns → thousands of tokens

Without token-aware selection, context can become:

Expensive
Noisy
Inefficient

4. Weak Cross-Session Memory

The same recent-window logic is applied regardless of time gaps.

A message sent seconds ago = treated the same as one weeks ago
Long-term continuity is effectively lost

5. No Persistent User Model

User preferences are not separated from conversation history.

If a user says:

“Always respond in bullet points”

That preference competes with normal messages and will eventually be forgotten.

A More Intelligent Design

The key shift:

Move from “what was said most recently” → “what is most useful right now”

This requires multiple memory layers with different roles.

Layer 1: Semantic Retrieval (Episodic Memory)

Instead of blindly loading recent turns, retrieve relevant ones based on meaning:


async def load_relevant_context(memory_client, query, token_budget=3000):
    query_embedding = await embed(query)

    candidates = memory_client.search_turns(
        embedding=query_embedding,
        limit=20,
    )

    selected = []
    used_tokens = 0

    for turn in candidates:
        tokens = count_tokens(turn["text"])
        if used_tokens + tokens > token_budget:
            break
        selected.append(turn)
        used_tokens += tokens

    return selected

This allows the system to pull in older but relevant context, regardless of when it occurred.

Layer 2: Working Memory (Short-Term Context)

You still need immediate continuity within a session.

Keep a small number of recent turns (e.g., last 2), then combine with semantic retrieval:


context = (
    last_k_turns(k=2)
    + retrieve_relevant(query, remaining_token_budget)
)

This preserves conversational flow while adding deeper context only when needed.

Layer 3: User Profile (Persistent Memory)

Separate who the user is from what they said.


This layer is:

Always loaded
Never evicted
Independent of conversation history

How It Updates

Explicit signals:

“Always use bullet points.”

Implicit signals:

Repeated edits
Frequent requests for shorter answers

Injecting the Profile


def build_system_prompt(base_prompt, profile):
    if not profile:
        return base_prompt

    profile_block = f"""
## User Preferences
- Format: {profile.preferred_format}
- Verbosity: {profile.verbosity}
- Expertise: {profile.domain_expertise}
- Notes: {profile.disliked_patterns}
"""
    return base_prompt + profile_block

This shapes responses without consuming conversation context.

Memory Architecture Summary


Each agent invocation:

├─ User Profile (always included)
│   Persistent preferences and traits

├─ Working Memory (last 2 turns)
│   Immediate conversational continuity

└─ Episodic Memory (semantic retrieval)
    Relevant past interactions within token budget

Key Takeaway

A fixed window isn’t wrong—it’s a good starting point.

But over time, it breaks in predictable ways:

Important context gets dropped
Irrelevant context is included
User preferences are forgotten

The core mistake is treating recency and relevance as the same thing.

A better system separates:

Short-term memory (recent turns)
Episodic memory (relevant past)
User model (persistent preferences)

When these layers are handled independently, memory stops being a buffer—and starts becoming something closer to reasoning.