Intermediate Optimization

Context Window Management

~30 min to implement 📦 Requires: Dakera v0.11+

Even with 200k-token context windows, naively dumping all recalled memories into every prompt is expensive, slow, and degrades response quality. The right approach is selective loading: score each candidate memory by a composite of relevance, recency, and importance, then fill the memory budget starting from the top — cutting off when the budget is exhausted. This pattern gives you maximum signal in minimum tokens.

Start Building Free →
Prerequisites
  • Dakera instance running (quickstart)
  • SDK installed: pip install dakera / npm i @dakera-ai/dakera
  • Understanding of your LLM's token limits and cost structure
  • A token counting strategy — either a dedicated tokenizer library or character-based approximation (1 token ≈ 4 characters)

The Problem

Production agents with persistent memory face a compounding pressure: memory stores grow over time, but context windows don't. Three concrete failure modes emerge:

  • Prompt cost explosion: An agent with 200 stored memories that naively injects all of them into every prompt spends 40,000+ tokens per call on context alone — at GPT-4 pricing, that's $0.12/call just for memories, before the actual user message and response.
  • Context dilution: When 200 memories are packed into a prompt, the LLM's effective attention on any individual memory drops. High-quality, critical memories compete for attention with low-relevance noise. Response quality degrades measurably once context passes 60% utilization.
  • Latency amplification: Every additional 1,000 tokens in the context window adds ~120ms to Time to First Token (TTFT) on most cloud providers. An unmanaged memory injection adding 20k tokens means 2.4 extra seconds of latency per response.

Architecture

This pattern implements a priority queue with a token budget cap. Dakera recalls a broad candidate set (top_k=50) ranked by semantic relevance. Each candidate is then scored by a composite function: score = (relevance_score × W_r) + (importance × W_i) + (recency_factor × W_rc). Memories are added to the context in score order until the memory token budget is exhausted. Critical memories (lessons, active goals) are pinned to always be included.

CONTEXT WINDOW BUDGET ALLOCATION (128k tokens) System 8k Pinned goals + lessons Top-K Memories scored + trimmed to budget Conversation history last 4 turns Reserved user msg + response 8k 12k 24k 20k 64k reserved MEMORY PRIORITY QUEUE (scored, trimmed at 24k budget) #1 score=0.96 "User is Maya Chen, Staff Engineer" — 380 tokens #2 score=0.88 "Working on distributed tracing system" — 290 tokens #3 score=0.77 "Prefers Rust for systems code" — 240 tokens #4 score=0.61 — TRIMMED (budget exhausted at 24k)

Context window budget allocation: system prompt + pinned memories + scored top-K memories + conversation history = total prompt. The scored memory block is trimmed when its cumulative token count hits the memory budget ceiling.

COMPOSITE SCORING & BUDGET FILL PROCESS Recall Broad top_k=50 min_importance=0.3 Score Each relevance × 0.5 importance × 0.3 recency × 0.2 sort descending Budget Fill add top memories until budget hit Inject into prompt context block Pinned memories (goals, lessons) always included before scored memories regardless of score

Composite scoring and budget fill: broad recall provides candidates; composite scoring ranks them; budget fill adds memories from top until token limit is hit. Pinned memories are always included first.

Implementation Steps

  • Define your token budget allocation
    Before writing any code, map out your context window allocation: system prompt (6–10k tokens), pinned memories — goals and lessons (8–15k), scored memories (15–30k depending on model), conversation history (15–20k), user message (2–5k), response reservation (20–40k). Your scored memory budget = total window - all other allocations. Document this and keep it as a named constant.
  • Implement the composite scoring function
    Dakera returns a score (semantic relevance) with each memory. Combine it with importance from the memory metadata and a recency factor computed from created_at: recency = max(0, 1 - hours_since_created / decay_hours). Default weights: relevance 50%, importance 30%, recency 20%. Tune these for your use case — long-document analysis agents may want to reduce recency weight.
  • Add budget-aware trimming with token counting
    Sort memories by composite score descending. Iterate through the list, adding each memory's token estimate to a running total. Stop adding when the total exceeds your memory budget. Use len(content) / 4 as a conservative character-to-token approximation for fast local counting, or integrate the tiktoken library for precise counting in production.
  • Pin critical memories outside the scored budget
    Some memories must always be included regardless of relevance score: active goals with near deadlines, behavioral lessons from corrections, identity facts (who the user is). Recall these with separate queries using min_importance=0.90 and add them to the prompt first — they don't count against the scored memory budget. Keep pinned memory allocation tight (under 15k tokens total).
  • Monitor hit rates and tune regularly
    Track what percentage of your recalled memories actually make it into the final context (the hit rate). A hit rate below 60% means your broad recall is returning too many irrelevant results — increase min_importance to 0.5 or reduce top_k. A hit rate above 95% means you may be leaving useful memories on the table — increase your memory budget or lower min_importance.
Tip: Reserve 30-40% of context for response

A common mistake is maximizing memory injection at the expense of response quality. For a 128k context model, reserve at least 30k tokens (23%) for the expected response — for long-form content generation, reserve 40–60k. Stuffing memories to 90% of context degrades output coherence and increases hallucination rates measurably. More memory is not always better.

Implementation

# Step 1: Broad recall for candidate set
curl "http://localhost:3300/v1/memory/recall?agent_id=doc-analysis-agent&query=document+analysis+context+user+preferences+project&top_k=50&min_importance=0.3" \
  -H "Authorization: Bearer dk-..."

# Step 2: Separate query for pinned memories (high importance critical context)
curl "http://localhost:3300/v1/memory/recall?agent_id=doc-analysis-agent&query=active+goals+lessons+user+identity+critical+context&top_k=10&min_importance=0.9" \
  -H "Authorization: Bearer dk-..."

# Notes on token counting:
# Dakera returns content strings -- count tokens client-side
# Budget allocation example for 128k model:
#   system_prompt: 8,000 tokens
#   pinned_memories: 12,000 tokens
#   scored_memories: 20,000 tokens
#   conversation: 18,000 tokens
#   user_message: 3,000 tokens
#   response_reserved: 67,000 tokens
#   TOTAL: 128,000 tokens
from dakera import DakeraClient
from datetime import datetime, timezone
from typing import NamedTuple
import math

client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")

# Token budget configuration for 128k context model
TOKEN_BUDGET = {
    "system_prompt": 8_000,
    "pinned_memories": 12_000,
    "scored_memories": 20_000,  # ← Our managed budget
    "conversation": 18_000,
    "user_message": 3_000,
    "response_reserved": 67_000,
}

# Scoring weights — tune for your use case
SCORING_WEIGHTS = {
    "relevance": 0.50,   # Dakera semantic similarity score
    "importance": 0.30,  # User-defined importance 0.0-1.0
    "recency": 0.20,     # Decay factor from created_at
}
RECENCY_DECAY_HOURS = 168  # 7 days: memories older than this get near-zero recency

class ScoredMemory(NamedTuple):
    content: str
    composite_score: float
    token_estimate: int
    memory_id: str

def estimate_tokens(text: str) -> int:
    """Conservative character-based token estimate. Use tiktoken for precision."""
    return max(1, len(text) // 4)

def compute_recency_factor(created_at_iso: str | None) -> float:
    """Returns 1.0 for fresh memories, decaying toward 0.0 over RECENCY_DECAY_HOURS."""
    if not created_at_iso:
        return 0.5  # Unknown age — neutral recency
    try:
        created = datetime.fromisoformat(created_at_iso.replace('Z', '+00:00'))
        now = datetime.now(timezone.utc)
        hours_old = (now - created).total_seconds() / 3600
        return max(0.0, 1.0 - hours_old / RECENCY_DECAY_HOURS)
    except (ValueError, TypeError):
        return 0.5

def score_memory(mem: dict, relevance_score: float) -> ScoredMemory:
    """Compute composite score for a single recalled memory."""
    importance = mem.get("importance", 0.5)
    recency = compute_recency_factor(mem.get("created_at"))

    composite = (
        relevance_score * SCORING_WEIGHTS["relevance"]
        + importance * SCORING_WEIGHTS["importance"]
        + recency * SCORING_WEIGHTS["recency"]
    )

    return ScoredMemory(
        content=mem["content"],
        composite_score=round(composite, 4),
        token_estimate=estimate_tokens(mem["content"]),
        memory_id=mem.get("id", "")
    )

def build_memory_context(
    agent_id: str,
    query: str,
    memory_budget_tokens: int = TOKEN_BUDGET["scored_memories"]
) -> dict:
    """
    Recall, score, and trim memories to fit within the token budget.
    Returns context string + usage statistics.
    """
    # Step 1: Pin critical memories (goals, lessons, identity)
    pinned_result = client.recall(
        agent_id=agent_id,
        query="active goals lessons critical identity constraints",
        top_k=10,
        min_importance=0.90
    )
    pinned_memories = pinned_result.get("memories", [])
    pinned_tokens = sum(estimate_tokens(m["content"]) for m in pinned_memories)

    # Step 2: Broad recall for scored candidates
    candidate_result = client.recall(
        agent_id=agent_id,
        query=query,
        top_k=50,
        min_importance=0.30
    )

    # Step 3: Score all candidates
    # Dakera returns memories in relevance order -- index 0 is most relevant
    # Approximate relevance scores: 1.0 for rank 1, decaying by rank
    scored: list[ScoredMemory] = []
    total_memories = len(candidate_result.get("memories", []))
    for i, mem in enumerate(candidate_result.get("memories", [])):
        approx_relevance = 1.0 - (i / max(total_memories, 1)) * 0.6
        scored.append(score_memory(mem, approx_relevance))

    # Step 4: Sort by composite score
    scored.sort(key=lambda m: m.composite_score, reverse=True)

    # Step 5: Fill budget greedily
    selected: list[ScoredMemory] = []
    tokens_used = 0
    trimmed_count = 0

    for mem in scored:
        if tokens_used + mem.token_estimate > memory_budget_tokens:
            trimmed_count += 1
            continue
        selected.append(mem)
        tokens_used += mem.token_estimate

    # Step 6: Build context strings
    pinned_context = "
".join(m["content"] for m in pinned_memories)
    scored_context = "
".join(m.content for m in selected)

    full_context = ""
    if pinned_context:
        full_context += f"[Critical context]
{pinned_context}

"
    if scored_context:
        full_context += f"[Relevant context]
{scored_context}"

    return {
        "context": full_context,
        "stats": {
            "pinned_memories": len(pinned_memories),
            "pinned_tokens": pinned_tokens,
            "scored_selected": len(selected),
            "scored_trimmed": trimmed_count,
            "scored_tokens": tokens_used,
            "total_tokens": pinned_tokens + tokens_used,
            "budget_utilization": round(tokens_used / memory_budget_tokens, 3),
        }
    }

# --- Usage in a long document analysis agent ---
def analyze_document(agent_id: str, document_text: str, question: str) -> str:
    # Build memory context scoped to the question
    context = build_memory_context(agent_id, query=question, memory_budget_tokens=18_000)

    stats = context["stats"]
    print(f"Memory context: {stats['scored_selected']} memories, "
          f"{stats['total_tokens']:,} tokens, "
          f"{stats['budget_utilization']:.0%} budget used, "
          f"{stats['scored_trimmed']} memories trimmed")

    system_prompt = f"""You are a document analysis assistant.

{context['context']}

Document excerpt:
{document_text[:10000]}"""

    # return llm.complete(system=system_prompt, user=question)
    return f"[Context loaded: {stats['total_tokens']} tokens] Ready to answer: {question}"

result = analyze_document("doc-analysis-agent", "...", "What are the key risk factors?")
print(result)
import { DakeraClient } from '@dakera-ai/dakera';

const client = new DakeraClient({ baseUrl: 'http://localhost:3300', apiKey: 'dk-...' });

const TOKEN_BUDGET = {
  systemPrompt: 8_000,
  pinnedMemories: 12_000,
  scoredMemories: 20_000,
  conversation: 18_000,
  userMessage: 3_000,
  responseReserved: 67_000,
};

const SCORING_WEIGHTS = { relevance: 0.5, importance: 0.3, recency: 0.2 };
const RECENCY_DECAY_HOURS = 168; // 7 days

function estimateTokens(text: string): number {
  return Math.max(1, Math.floor(text.length / 4));
}

function computeRecencyFactor(createdAt?: string): number {
  if (!createdAt) return 0.5;
  const hoursOld = (Date.now() - new Date(createdAt).getTime()) / 3_600_000;
  return Math.max(0, 1 - hoursOld / RECENCY_DECAY_HOURS);
}

interface ScoredMemory {
  content: string;
  compositeScore: number;
  tokenEstimate: number;
  memoryId: string;
}

function scoreMemory(mem: { content: string; importance?: number; created_at?: string; id?: string }, rankPosition: number, totalMemories: number): ScoredMemory {
  const approxRelevance = 1.0 - (rankPosition / Math.max(totalMemories, 1)) * 0.6;
  const importance = mem.importance ?? 0.5;
  const recency = computeRecencyFactor(mem.created_at);

  const composite =
    approxRelevance * SCORING_WEIGHTS.relevance +
    importance * SCORING_WEIGHTS.importance +
    recency * SCORING_WEIGHTS.recency;

  return {
    content: mem.content,
    compositeScore: Math.round(composite * 10000) / 10000,
    tokenEstimate: estimateTokens(mem.content),
    memoryId: mem.id ?? '',
  };
}

async function buildMemoryContext(
  agentId: string,
  query: string,
  memoryBudget = TOKEN_BUDGET.scoredMemories
): Promise<{ context: string; stats: Record<string, number> }> {
  // Pinned memories (always included)
  const pinnedResult = await client.recall(agentId, 'active goals lessons critical identity', {
    top_k: 10,
    min_importance: 0.9,
  });
  const pinnedTokens = pinnedResult.memories.reduce((sum, m) => sum + estimateTokens(m.content), 0);

  // Broad candidate recall
  const candidates = await client.recall(agentId, query, { top_k: 50, min_importance: 0.3 });

  // Score and sort
  const scored: ScoredMemory[] = candidates.memories
    .map((m, i) => scoreMemory(m, i, candidates.memories.length))
    .sort((a, b) => b.compositeScore - a.compositeScore);

  // Budget fill
  const selected: ScoredMemory[] = [];
  let tokensUsed = 0;
  let trimmedCount = 0;

  for (const mem of scored) {
    if (tokensUsed + mem.tokenEstimate > memoryBudget) { trimmedCount++; continue; }
    selected.push(mem);
    tokensUsed += mem.tokenEstimate;
  }

  const pinnedBlock = pinnedResult.memories.map(m => m.content).join('
');
  const scoredBlock = selected.map(m => m.content).join('
');
  const context = [
    pinnedBlock ? `[Critical context]
${pinnedBlock}` : '',
    scoredBlock ? `[Relevant context]
${scoredBlock}` : '',
  ].filter(Boolean).join('

');

  return {
    context,
    stats: {
      pinnedCount: pinnedResult.memories.length,
      pinnedTokens,
      selectedCount: selected.length,
      trimmedCount,
      scoredTokens: tokensUsed,
      totalTokens: pinnedTokens + tokensUsed,
      budgetUtilization: Math.round((tokensUsed / memoryBudget) * 1000) / 1000,
    },
  };
}

// Usage
const { context, stats } = await buildMemoryContext('doc-analysis-agent', 'key risk factors in contract', 18_000);
console.log(`Loaded ${stats.selectedCount} memories, ${stats.totalTokens} tokens (${(stats.budgetUtilization * 100).toFixed(0)}% of budget)`);
use dakera_rs::{Client, RecallRequest};
use std::time::{SystemTime, UNIX_EPOCH};

let client = Client::new("http://localhost:3300", "dk-...");

const MEMORY_BUDGET_TOKENS: usize = 20_000;
const RECENCY_DECAY_HOURS: f64 = 168.0;

fn estimate_tokens(text: &str) -> usize {
    (text.len() / 4).max(1)
}

fn compute_recency(hours_old: f64) -> f64 {
    (1.0 - hours_old / RECENCY_DECAY_HOURS).max(0.0)
}

// Broad candidate recall
let candidates = client.recall("doc-analysis-agent", RecallRequest {
    query: "document analysis context user preferences".into(),
    top_k: Some(50),
    min_importance: Some(0.3),
    ..Default::default()
}).await?;

// Score and rank memories
let mut scored: Vec<(f64, &str, usize)> = candidates.memories.iter()
    .enumerate()
    .map(|(i, mem)| {
        let total = candidates.memories.len() as f64;
        let relevance = 1.0 - (i as f64 / total.max(1.0)) * 0.6;
        let importance = mem.importance.unwrap_or(0.5) as f64;
        let recency = compute_recency(24.0); // Simplified -- use actual created_at in production
        let score = relevance * 0.5 + importance * 0.3 + recency * 0.2;
        let tokens = estimate_tokens(&mem.content);
        (score, mem.content.as_str(), tokens)
    })
    .collect();

scored.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap());

// Budget fill
let mut context_parts: Vec<&str> = Vec::new();
let mut tokens_used = 0usize;
let mut trimmed = 0usize;

for (_, content, tokens) in &scored {
    if tokens_used + tokens > MEMORY_BUDGET_TOKENS {
        trimmed += 1;
        continue;
    }
    context_parts.push(content);
    tokens_used += tokens;
}

let memory_context = context_parts.join("
");
println!("Selected {} memories, {} tokens, {} trimmed", context_parts.len(), tokens_used, trimmed);
client := dakera.NewClient("http://localhost:3300", "dk-...")
ctx := context.Background()

const memoryBudgetTokens = 20_000

func estimateTokens(text string) int {
    t := len(text) / 4
    if t < 1 { return 1 }
    return t
}

// Broad recall
candidates, _ := client.Recall(ctx, "doc-analysis-agent", dakera.RecallRequest{
    Query:         "document analysis context user preferences",
    TopK:          50,
    MinImportance: 0.3,
})

type scoredMem struct {
    content string
    score   float64
    tokens  int
}

total := float64(len(candidates.Memories))
scored := make([]scoredMem, len(candidates.Memories))
for i, mem := range candidates.Memories {
    relevance := 1.0 - float64(i)/math.Max(total, 1)*0.6
    importance := mem.Importance
    recency := 0.7 // Simplified -- compute from created_at in production
    compositeScore := relevance*0.5 + importance*0.3 + recency*0.2
    scored[i] = scoredMem{mem.Content, compositeScore, estimateTokens(mem.Content)}
}

sort.Slice(scored, func(i, j int) bool { return scored[i].score > scored[j].score })

// Budget fill
var selected []string
tokensUsed, trimmed := 0, 0
for _, m := range scored {
    if tokensUsed+m.tokens > memoryBudgetTokens {
        trimmed++
        continue
    }
    selected = append(selected, m.content)
    tokensUsed += m.tokens
}

memoryContext := strings.Join(selected, "
")
fmt.Printf("Selected %d memories, %d tokens, %d trimmed
", len(selected), tokensUsed, trimmed)

Stop paying for context you don't use

Smart memory loading with Dakera cuts your prompt token costs by up to 60% while improving response quality.

Try Free →

Real-World Scenario: Long Document Analysis Agent

A legal research agent analyzes contracts and regulatory documents. It serves 50 attorneys who each have months of interaction history. Naively loading all memories per request was costing $0.18/call and adding 3.2 seconds of latency. After implementing context window management:

  • Memory budget defined: 20k tokens for scored memories, 10k for pinned (active matters, user preferences), 20k for conversation history, 40k reserved for long-form output.
  • Scoring weights: Relevance 60%, importance 30%, recency 10%. Legal context is less time-sensitive — earlier case notes remain highly relevant.
  • Cost reduction: Average tokens per call dropped from 87k to 41k — a 53% reduction. Monthly API costs fell from $8,200 to $3,900.
  • Quality improvement: Attorney satisfaction scores improved from 3.8 to 4.5/5 because irrelevant context from unrelated matters no longer polluted responses.
  • Hit rate monitoring: The team tracks what percentage of pinned memories are always included (target: 100%) and scored memory hit rate (target: 70–85%). When hit rate drops below 70%, they increase min_importance.

Before / After Memory State

Before: Naive Memory Injection
// All 180 memories injected:
// Total: 87,400 tokens
// Cost per call: $0.18
// TTFT: 4.1 seconds
// Response quality: 3.8/5

// Context is 68% memories:
// LLM can't focus on the
// actual question
// High hallucination rate
// on complex queries
After: Budget-Managed Injection
// Budget-managed context:
// Pinned: 8 memories, 9.2k tokens
// Scored: 14 memories, 19.8k tokens
// Total memory: 29k tokens
// Cost per call: $0.08 (-53%)
// TTFT: 1.9 seconds (-54%)
// Response quality: 4.5/5
// Budget utilization: 81%

SDK Method Reference

MethodSDKPurpose in this pattern
recall(agent_id, query, top_k, min_importance)PythonBroad candidate recall + pinned memory retrieval
batch_recall(request)PythonParallel recall for pinned + scored in one call
recall(agentId, query, {top_k, min_importance})TypeScriptCandidate and pinned recall
batchRecall(request)TypeScriptParallel retrieval for efficiency
client.recall("agent", RecallRequest{...}).await?RustAsync broad candidate recall
client.Recall(ctx, "agent", RecallRequest{...})GoRecall with min_importance filter

Edge Cases and Gotchas

  • Recency weight traps: Setting recency weight too high (above 0.4) causes older but highly relevant memories to be trimmed in favor of recent but less relevant ones. For domains with stable knowledge (legal facts, technical specifications), keep recency below 0.2.
  • Token budget creep: If your pinned memory set grows unbounded (every goal, every lesson, forever), the pinned allocation will eventually crowd out scored memories. Implement a maximum for pinned memories — cap at 10–12 entries. Archive or compress lower-priority pinned memories using the memory compression pattern.
  • Parallel recall conflicts: If you run two simultaneous recall queries (pinned + scored) in parallel using batch_recall(), the same memory may appear in both results. Deduplicate by memory ID before building the final context string to avoid doubling token consumption.
  • Token counting mismatch: The character-to-token ratio varies significantly across languages (Chinese, Japanese, and Korean use approximately 1 character per token, not 4). If your users interact in multiple languages, use a dedicated tokenizer rather than the character approximation to avoid budget overflows.
  • High-importance low-relevance memories: A memory with importance=0.95 but near-zero relevance score to the current query will rank highly due to the importance weight. For memories that should only surface in specific contexts (e.g., a medical allergy note for a healthcare agent), add domain tags and filter by tag before scoring to prevent context pollution.
Warning: Never inject unscored raw recall into production prompts

It's tempting to skip the scoring step and just inject top_k=10 results directly. This works at small scale but breaks under load: as your memory store grows, Dakera returns increasingly diverse results for the same query. Without scoring and trimming, you lose control over what enters the context window. Always score, always budget — even in development.

Performance Considerations

-53%
Typical token cost reduction vs naive injection
~18ms
Scoring overhead for 50 candidates (Python)
70-85%
Target memory budget utilization rate
Advanced Configuration: Dynamic weight adjustment by query type

Different query types benefit from different scoring weights. Implement a query classifier that selects the right weight configuration:

WEIGHT_PROFILES = {
    "factual":      {"relevance": 0.70, "importance": 0.20, "recency": 0.10},
    "temporal":     {"relevance": 0.40, "importance": 0.20, "recency": 0.40},
    "preference":   {"relevance": 0.50, "importance": 0.40, "recency": 0.10},
    "emotional":    {"relevance": 0.40, "importance": 0.30, "recency": 0.30},
    "default":      {"relevance": 0.50, "importance": 0.30, "recency": 0.20},
}

def classify_query_type(query: str) -> str:
    """Simple keyword-based query classifier."""
    q = query.lower()
    if any(w in q for w in ["when", "date", "recently", "latest", "yesterday"]):
        return "temporal"
    if any(w in q for w in ["prefer", "like", "want", "style", "format"]):
        return "preference"
    if any(w in q for w in ["feel", "emotion", "frustrated", "happy", "stress"]):
        return "emotional"
    if any(w in q for w in ["what", "who", "how many", "define", "explain"]):
        return "factual"
    return "default"

def build_memory_context_dynamic(agent_id: str, query: str, budget: int) -> dict:
    query_type = classify_query_type(query)
    weights = WEIGHT_PROFILES[query_type]
    return build_memory_context_with_weights(agent_id, query, budget, weights)

Take control of your prompt token budget

Dakera makes it easy to recall, score, and trim memories to fit any context window — reducing costs and improving response quality simultaneously.

Start Building Free →