Intermediate Architecture

RAG-Augmented Memory

⏱ ~35 min to implement 📦 Requires: Dakera v0.11+

Pure RAG knows your documents but forgets your users. Pure memory knows your users but not your documents. Combine both in a single retrieval pipeline so your agent is simultaneously accurate and personal.

Get Started Free →

Prerequisites

Running Dakera server (see Quickstart guide)
Document chunking pipeline or existing vector store with content to index
LLM API access (Anthropic Claude, OpenAI GPT-4, etc.) for response generation
Dakera Python, TypeScript, Rust, or Go SDK installed

The Problem with Choosing Between RAG and Memory

Traditional RAG retrieves document chunks on every query — great for factual accuracy over a static knowledge base, but it has no awareness of who is asking, what they asked before, or what they already know. Conversely, a pure persistent memory system is personal and accumulates context over time, but has no access to your internal documentation or knowledge base.

Enterprise knowledge base assistants need both. A support agent should know that this particular user has a Pro subscription, already tried rebooting, and prefers concise answers — while simultaneously retrieving the correct troubleshooting steps from official documentation.

Dakera's approach: one store, two retrieval paths

Rather than maintaining a separate vector store for documents and a memory system for users, Dakera indexes both in the same store. Document chunks are stored as memories with metadata marking their source. The recall() call returns agent memories and document chunks ranked together by relevance — no merge code to write.

Architecture: Hybrid Retrieval Pipeline

On every query, two parallel retrieval paths execute and their results are ranked together before injection into the LLM prompt. This diagram shows the full pipeline from user query to LLM response:

Latency Breakdown

12ms

Memory recall (agent history + prefs) p50

80ms

Document chunk retrieval p50

5ms

Merge, deduplicate, rank results

Memory Update Loop: Learning from RAG Results

When the RAG pipeline surfaces a fact new to the agent's memory — a policy change, a product update, a user correction — that fact should be stored back into agent memory so future queries answer faster without re-retrieval. This "memory crystallization" loop is the key differentiator of RAG-augmented memory over plain RAG.

Tip: use semantic deduplication before crystallizing

Before storing a learned fact to memory, call search_memories() to check if a similar fact already exists. Dakera's hybrid search will surface near-duplicates with a cosine similarity score. Only store if the top result scores below 0.85 — this prevents memory bloat from near-identical document chunks being crystallized repeatedly.

Real-World Scenario: Enterprise Knowledge Base Assistant

An internal knowledge base assistant at a SaaS company handles HR policy questions, IT support, and product documentation queries from employees. The critical requirement: every answer must be both factually accurate (from official docs) and personally aware (remembering this employee's role, past questions, and preferences).

Index company documentation at startup

Chunk all policy documents, runbooks, and product guides into 400-600 token segments. Store each chunk as a memory with memory_type="semantic" and metadata marking the source document, version, and department. Dakera indexes them in the same store as user memories.
Store user context on first interaction

When an employee first uses the assistant, store their role, department, technical level, and communication preferences as high-importance semantic memories scoped to their agent ID. These persist across all future sessions.
Run parallel retrieval on every query

Execute two concurrent recall() calls: one filtered to user memories (preferences, past questions, their specific environment), and one filtered to documentation chunks. Run them in parallel — total added latency is the slower of the two, not the sum.
Merge, rank, and trim to token budget

Sort all retrieved results by relevance score. Apply token budget (typically 2000 tokens for context). Prioritize user-memory results over generic doc chunks when scores are close — personalization wins ties.
Crystallize new facts learned from successful answers

After the LLM generates a high-confidence answer that includes a specific policy fact, store that condensed fact back into the user's memory with TTL matching the document's expected update frequency. Future identical or similar questions skip doc retrieval entirely.

Ship a RAG+Memory pipeline in under an hour

Dakera handles chunked doc indexing, semantic recall, and memory persistence in one self-hosted API.

Get Started →

Implementation

# 1. Index a document chunk as a memory
curl -X POST http://localhost:3300/v1/memory/store \
  -H "Authorization: Bearer dk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "kb-assistant",
    "content": "SSO is available on Pro and Enterprise plans. To enable, go to Settings > Security > Single Sign-On and upload your IdP metadata XML.",
    "importance": 0.85,
    "memory_type": "semantic",
    "tags": ["doc", "sso", "security", "settings"],
    "metadata": {
      "source": "doc",
      "doc_id": "help-sso-setup",
      "version": "2024-Q4",
      "department": "IT"
    }
  }'

# 2. Store user context
curl -X POST http://localhost:3300/v1/memory/store \
  -H "Authorization: Bearer dk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "kb-assistant",
    "content": "User is a Systems Administrator on the Pro plan. Prefers step-by-step instructions with CLI examples where available.",
    "importance": 0.95,
    "memory_type": "semantic",
    "tags": ["user-profile", "user:alice"]
  }'

# 3. Recall combines both — sorted by relevance
curl "http://localhost:3300/v1/memory/recall?agent_id=kb-assistant&query=how+do+I+enable+SSO&top_k=8" \
  -H "Authorization: Bearer dk-..."

# 4. Recall only docs (for explicit doc-only path)
curl "http://localhost:3300/v1/memory/recall?agent_id=kb-assistant&query=SSO+setup&top_k=5&tags=doc" \
  -H "Authorization: Bearer dk-..."

import asyncio
from dakera import DakeraClient

client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")

# ── Step 1: Index documents at startup ──────────────────────────────────────

def index_document(doc_id: str, chunks: list[str], department: str, version: str):
    """Index a document as a series of memory chunks."""
    for i, chunk in enumerate(chunks):
        client.store_memory(
            agent_id="kb-assistant",
            content=chunk,
            importance=0.82,
            memory_type="semantic",
            tags=["doc", department.lower()],
            # TTL: auto-expire doc chunks after 90 days, force re-index on update
            ttl_seconds=90 * 24 * 3600
        )

# Index your SSO documentation
sso_chunks = [
    "SSO (Single Sign-On) is available on Pro and Enterprise plans only. Starter plan users cannot enable SSO.",
    "To enable SSO, navigate to Settings > Security > Single Sign-On. Upload your IdP metadata XML. Supported IdPs: Okta, Azure AD, Google Workspace, OneLogin.",
    "After uploading IdP metadata, test the SSO flow with a sandbox account before enabling for your entire organization. SSO enforcement will lock out users without IdP accounts.",
]
index_document("help-sso-setup", sso_chunks, "IT", "2024-Q4")

# ── Step 2: Store user preferences on first interaction ─────────────────────

client.store_memory(
    agent_id="kb-assistant",
    content="Alice Chen is a Systems Administrator on the Pro plan. Prefers step-by-step instructions. Has SAML experience. Uses Okta as IdP.",
    importance=0.95,
    memory_type="semantic",
    tags=["user-profile", "user:alice"]
)

# ── Step 3: Parallel retrieval on each query ────────────────────────────────

async def retrieve_context(query: str, user_tag: str) -> dict:
    """Run memory recall and doc recall in parallel."""
    # Both calls to the same Dakera instance; run concurrently
    mem_task = asyncio.to_thread(
        client.recall,
        agent_id="kb-assistant",
        query=query,
        top_k=4,
        min_importance=0.7,
        tags=[user_tag]  # user-specific memories only
    )
    doc_task = asyncio.to_thread(
        client.recall,
        agent_id="kb-assistant",
        query=query,
        top_k=6,
        min_importance=0.6,
        tags=["doc"]  # documentation chunks only
    )
    mem_results, doc_results = await asyncio.gather(mem_task, doc_task)
    return {"memories": mem_results["memories"], "docs": doc_results["memories"]}

# ── Step 4: Merge and rank results ──────────────────────────────────────────

def merge_context(memories: list, docs: list, token_budget: int = 2000) -> str:
    """Merge memory + doc results, rank by score, trim to token budget."""
    # Tag source for the LLM
    tagged = [
        {"content": m["content"], "score": m["score"], "source": "memory"}
        for m in memories
    ] + [
        {"content": d["content"], "score": d["score"], "source": "doc"}
        for d in docs
    ]
    # Sort by score descending; user memories win ties (memory source sorted first)
    tagged.sort(key=lambda x: (x["score"], x["source"] == "memory"), reverse=True)

    context_parts = []
    token_count = 0
    for item in tagged:
        tokens = len(item["content"].split()) * 1.3  # rough estimate
        if token_count + tokens > token_budget:
            break
        prefix = "[USER CONTEXT]" if item["source"] == "memory" else "[DOCUMENTATION]"
        context_parts.append(f"{prefix}
{item['content']}")
        token_count += tokens

    return "

".join(context_parts)

# ── Step 5: Crystallize learned facts back to memory ────────────────────────

def crystallize_fact(agent_id: str, fact: str, user_tag: str, ttl_days: int = 30):
    """Store a learned fact from RAG results into persistent memory."""
    # Check for near-duplicate first
    existing = client.search_memories(agent_id=agent_id, query=fact)
    if existing["memories"] and existing["memories"][0]["score"] > 0.85:
        return  # Already known — skip

    client.store_memory(
        agent_id=agent_id,
        content=fact,
        importance=0.75,
        memory_type="semantic",
        tags=["learned", "doc-derived", user_tag],
        ttl_seconds=ttl_days * 24 * 3600
    )

# Example usage
async def answer_question(user_query: str):
    context_data = await retrieve_context(user_query, "user:alice")
    context_str = merge_context(context_data["memories"], context_data["docs"])

    # Build system prompt with merged context
    system_prompt = f"""You are a knowledge base assistant.
Use the context below to answer accurately and personally.

{context_str}"""

    # ... call your LLM here ...
    # After response, crystallize high-value facts:
    for doc in context_data["docs"]:
        if doc["score"] > 0.90:  # High-confidence relevant doc chunk
            crystallize_fact("kb-assistant", doc["content"], "user:alice")

import { DakeraClient } from '@dakera-ai/dakera';
import Anthropic from '@anthropic-ai/sdk';

const client = new DakeraClient({ baseUrl: 'http://localhost:3300', apiKey: 'dk-...' });
const anthropic = new Anthropic();

// ── Index documentation chunks ──────────────────────────────────────────────

async function indexDocument(docId: string, chunks: string[], tags: string[]) {
  await Promise.all(chunks.map((chunk, i) =>
    client.storeMemory('kb-assistant', {
      content: chunk,
      importance: 0.82,
      memoryType: 'semantic',
      tags: ['doc', ...tags],
      ttl_seconds: 90 * 24 * 3600,
    })
  ));
}

// Index SSO docs
await indexDocument('help-sso-setup', [
  'SSO is available on Pro and Enterprise plans. Go to Settings > Security > SSO to enable.',
  'Supported IdPs: Okta, Azure AD, Google Workspace, OneLogin. Upload your IdP metadata XML.',
  'Test with a sandbox account before enforcing SSO organization-wide to avoid lockouts.',
], ['it', 'sso', 'security']);

// ── Store user profile ──────────────────────────────────────────────────────

await client.storeMemory('kb-assistant', {
  content: 'Alice Chen is a Systems Administrator on Pro plan. Uses Okta IdP. Prefers step-by-step instructions.',
  importance: 0.95,
  memoryType: 'semantic',
  tags: ['user-profile', 'user:alice'],
});

// ── Parallel retrieval ──────────────────────────────────────────────────────

async function retrieveContext(query: string, userTag: string) {
  const [memResults, docResults] = await Promise.all([
    client.recall('kb-assistant', query, { top_k: 4, min_importance: 0.7, memory_type: 'semantic' }),
    client.recall('kb-assistant', query, { top_k: 6, min_importance: 0.6, memory_type: 'semantic' }),
  ]);

  return {
    memories: memResults.memories.filter((m: any) => m.tags?.includes(userTag)),
    docs: docResults.memories.filter((m: any) => m.tags?.includes('doc')),
  };
}

// ── Merge results ───────────────────────────────────────────────────────────

function mergeContext(
  memories: any[],
  docs: any[],
  tokenBudget = 2000
): string {
  const tagged = [
    ...memories.map(m => ({ ...m, source: 'memory' as const })),
    ...docs.map(d => ({ ...d, source: 'doc' as const })),
  ].sort((a, b) => b.score - a.score);

  const parts: string[] = [];
  let tokens = 0;
  for (const item of tagged) {
    const est = item.content.split(' ').length * 1.3;
    if (tokens + est > tokenBudget) break;
    const prefix = item.source === 'memory' ? '[USER CONTEXT]' : '[DOCUMENTATION]';
    parts.push(`${prefix}
${item.content}`);
    tokens += est;
  }
  return parts.join('

');
}

// ── Crystallize facts ───────────────────────────────────────────────────────

async function crystallizeFact(fact: string, userTag: string) {
  const existing = await client.searchMemories('kb-assistant', fact, { top_k: 1 });
  if (existing.memories[0]?.score > 0.85) return; // Already known

  await client.storeMemory('kb-assistant', {
    content: fact,
    importance: 0.75,
    memoryType: 'semantic',
    tags: ['learned', 'doc-derived', userTag],
    ttl_seconds: 30 * 24 * 3600,
  });
}

// ── Full pipeline ───────────────────────────────────────────────────────────

async function answerQuestion(userQuery: string) {
  const { memories, docs } = await retrieveContext(userQuery, 'user:alice');
  const context = mergeContext(memories, docs);

  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: `You are a knowledge base assistant. Use the context below.

${context}`,
    messages: [{ role: 'user', content: userQuery }],
  });

  // Crystallize highly relevant doc chunks for faster future recall
  for (const doc of docs.filter((d: any) => d.score > 0.90)) {
    await crystallizeFact(doc.content, 'user:alice');
  }

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

use dakera_rs::{Client, StoreMemoryRequest, RecallRequest};
use tokio::join;

let client = Client::new("http://localhost:3300", "dk-...");

// Index document chunks
client.store_memory("kb-assistant", StoreMemoryRequest {
    content: "SSO is available on Pro and Enterprise plans. Settings > Security > SSO.".into(),
    importance: Some(0.82),
    memory_type: Some("semantic".into()),
    tags: Some(vec!["doc".into(), "sso".into()]),
    ttl_seconds: Some(90 * 24 * 3600),
    ..Default::default()
}).await?;

// Store user context
client.store_memory("kb-assistant", StoreMemoryRequest {
    content: "Alice is a sysadmin on Pro plan, uses Okta, prefers step-by-step guides.".into(),
    importance: Some(0.95),
    memory_type: Some("semantic".into()),
    tags: Some(vec!["user-profile".into(), "user:alice".into()]),
    ..Default::default()
}).await?;

// Parallel recall: user memories + doc chunks
let query = "How do I enable SSO?";
let (mem_results, doc_results) = join!(
    client.recall("kb-assistant", RecallRequest {
        query: query.into(),
        top_k: Some(4),
        min_importance: Some(0.7),
        ..Default::default()
    }),
    client.recall("kb-assistant", RecallRequest {
        query: query.into(),
        top_k: Some(6),
        min_importance: Some(0.6),
        ..Default::default()
    })
);

let (mem, docs) = (mem_results?, doc_results?);

// Build context string
let mut context = String::new();
for m in &mem.memories {
    context.push_str(&format!("[USER CONTEXT]
{}

", m.content));
}
for d in &docs.memories {
    context.push_str(&format!("[DOCUMENTATION]
{}

", d.content));
}

println!("Context for LLM:
{}", context);

package main

import (
    "context"
    "fmt"
    "sync"
    dakera "github.com/dakera-ai/dakera-go"
)

func main() {
    client := dakera.NewClient("http://localhost:3300", "dk-...")
    ctx := context.Background()

    // Index documentation chunk
    client.StoreMemory(ctx, "kb-assistant", dakera.StoreMemoryRequest{
        Content:    "SSO available on Pro and Enterprise. Settings > Security > SSO. Upload IdP metadata XML.",
        Importance: 0.82,
        MemoryType: "semantic",
        Tags:       []string{"doc", "sso"},
        TTLSeconds: 90 * 24 * 3600,
    })

    // Store user profile
    client.StoreMemory(ctx, "kb-assistant", dakera.StoreMemoryRequest{
        Content:    "Alice Chen: sysadmin, Pro plan, Okta IdP, prefers step-by-step instructions.",
        Importance: 0.95,
        MemoryType: "semantic",
        Tags:       []string{"user-profile", "user:alice"},
    })

    // Parallel retrieval
    var (
        memResults *dakera.RecallResponse
        docResults *dakera.RecallResponse
        wg         sync.WaitGroup
    )
    query := "How do I enable SSO?"

    wg.Add(2)
    go func() {
        defer wg.Done()
        memResults, _ = client.Recall(ctx, "kb-assistant", dakera.RecallRequest{
            Query:         query,
            TopK:          4,
            MinImportance: 0.7,
        })
    }()
    go func() {
        defer wg.Done()
        docResults, _ = client.Recall(ctx, "kb-assistant", dakera.RecallRequest{
            Query:         query,
            TopK:          6,
            MinImportance: 0.6,
        })
    }()
    wg.Wait()

    // Build merged context
    contextStr := ""
    for _, m := range memResults.Memories {
        contextStr += fmt.Sprintf("[USER CONTEXT]
%s

", m.Content)
    }
    for _, d := range docResults.Memories {
        contextStr += fmt.Sprintf("[DOCUMENTATION]
%s

", d.Content)
    }
    fmt.Println("Merged context:", contextStr)
}

Before / After: Pure RAG vs. RAG + Dakera Memory

Before: Pure RAG only

Query: "How do I enable SSO?"

Retrieved chunks:
- SSO setup guide (generic)
- SSO troubleshooting (generic)
- IdP metadata format spec

Generated answer:
"To enable SSO, go to Settings >
Security > Single Sign-On and
upload your IdP metadata XML.
Supported IdPs: Okta, Azure AD,
Google Workspace, OneLogin."

Problems:
- Doesn't know Alice is already
  on Pro plan (no upsell friction)
- Doesn't know she uses Okta
  (misses Okta-specific steps)
- Same generic answer every time
- No memory of past questions
  (may answer the same thing 5x)

After: RAG + Dakera Memory

Query: "How do I enable SSO?"

Memory recall:
- Alice: Pro plan, Okta IdP,
  prefers step-by-step guides
- Alice asked about SSO last week;
  showed her the Settings page

Doc recall:
- SSO setup guide (generic)
- Okta-specific SAML config steps
- Pro plan feature confirmation

Generated answer:
"Since you're on Pro plan, SSO is
available. For Okta specifically:
1. In Okta Admin, create a new
   SAML 2.0 application
2. Download the metadata XML
3. In Dakera Settings > Security >
   SSO, upload the XML
Last time we spoke about this you
were on the Settings page — the
SSO tab is in the left nav."

Personal, accurate, and contextual.

Edge Cases

1. Conflicting RAG vs. Memory Facts

A policy document might say "SSO requires Enterprise plan" while a crystallized memory from 6 months ago says "SSO is available on Pro plan" (because the policy changed). When scores are close and sources conflict, always prefer the document chunk — it reflects the current source of truth. Flag the stale memory for update using update_importance() to demote it.

# Detect conflict: doc says X, memory says Y on same topic
# Demote stale memory importance so doc wins future rankings
client.update_importance(
    agent_id="kb-assistant",
    memory_id="mem_stale_sso_policy",
    importance=0.2  # demote so doc chunk ranks higher
)

2. Document Staleness After Updates

Documents change. Crystallized memories derived from outdated chunks persist until their TTL expires or you force-update them. Set TTL on all doc-derived memories to match your document update cadence. For fast-changing docs (weekly releases), use 7-day TTLs. For stable policy docs, 90 days is safe.

3. Chunking Strategy Affects Recall Quality

Chunks that are too small (under 100 tokens) lack enough context for accurate semantic matching. Chunks that are too large (over 800 tokens) dilute the embedding and return irrelevant sentences. The optimal range is 350–550 tokens with 50-token overlaps between adjacent chunks to prevent context loss at boundaries.

4. Token Budget Overflow

When both memory recall and doc retrieval return high-scoring results, the merged context can overflow the LLM's context window. Implement strict token budgets: allocate 40% to user memories (high personalization value), 50% to doc chunks (factual grounding), and reserve 10% for the system prompt and user message.

5. Cold Start: No User Memory Yet

On a user's first interaction, memory recall returns nothing. Fall back gracefully to doc-only retrieval with a wider top_k. After the first turn, store a minimal user profile memory to bootstrap personalization for the second turn. Never show degraded behavior to the user — the transition should be seamless.

Performance Considerations

Operation	p50	p99	Optimization
Memory recall (user context)	12ms	28ms	Tag filter reduces search space
Doc chunk recall (1k chunks indexed)	80ms	160ms	Hybrid BM25+vector search
Doc chunk recall (50k chunks indexed)	95ms	220ms	Sub-linear scaling with HNSW index
Parallel recall (both paths)	82ms	165ms	Parallelism eliminates additive cost
Merge + deduplicate + rank	5ms	12ms	In-process; no network hop
Crystallization (store_memory)	18ms	40ms	Async post-response for zero UX impact

Crystallize asynchronously

Always run the crystallization store_memory() call after returning the response to the user, not before. It adds 18-40ms to the user-facing latency if you block on it. Queue it as a background task — the user should never wait for memory writes.

SDK Reference

Method	SDK	Purpose
`store_memory(agent_id, content, importance, memory_type, tags, ttl_seconds)`	Python	Index doc chunks or store user memories
`storeMemory(agentId, {content, importance, memoryType, tags, ttl_seconds})`	TypeScript	Index doc chunks or store user memories
`recall(agent_id, query, top_k, min_importance)`	Python	Retrieve ranked memories + doc chunks
`recall(agentId, query, {top_k, min_importance, memory_type})`	TypeScript	Retrieve ranked memories + doc chunks
`search_memories(agent_id, query)`	Python	Semantic search to detect near-duplicates before crystallization
`searchMemories(agentId, query, {top_k})`	TypeScript	Semantic search for deduplication check
`update_importance(agent_id, memory_id, importance)`	Python	Demote stale crystallized facts when docs are updated
`updateImportance(agentId, request)`	TypeScript	Demote stale memories
`forget(agent_id, memory_id)`	Python	Remove outdated doc chunks when document is replaced
`batch_recall(request)`	Python	Recall from multiple agent IDs in one call for multi-user scenarios

Advanced Configuration: Chunking, TTL, and Index Tuning

Recommended chunking parameters

from dakera import DakeraClient
import tiktoken

client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")
enc = tiktoken.get_encoding("cl100k_base")

def chunk_document(text: str, chunk_tokens: int = 450, overlap_tokens: int = 50):
    """Chunk text with overlap to prevent context boundary loss."""
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_tokens, len(tokens))
        chunk_text = enc.decode(tokens[start:end])
        chunks.append(chunk_text)
        start += chunk_tokens - overlap_tokens  # slide with overlap
    return chunks

TTL strategy by document type

# Changelog / release notes: short TTL — changes frequently
client.store_memory("kb-assistant", content=chunk, importance=0.8, ttl_seconds=7*24*3600)

# Policy documents: medium TTL — quarterly updates
client.store_memory("kb-assistant", content=chunk, importance=0.85, ttl_seconds=90*24*3600)

# Core product docs: long TTL — stable unless major version change
client.store_memory("kb-assistant", content=chunk, importance=0.85, ttl_seconds=365*24*3600)

# Crystallized facts learned from high-score RAG results: 30 days
client.store_memory("kb-assistant", content=fact, importance=0.75, ttl_seconds=30*24*3600)

Hybrid search tuning

# In docker/.env: tune BM25 vs vector weight for doc retrieval
# Higher BM25 weight = better for keyword-heavy technical docs
# Higher vector weight = better for conversational / semantic queries
DAKERA_HYBRID_BM25_WEIGHT=0.4
DAKERA_HYBRID_VECTOR_WEIGHT=0.6

Combine your docs with your users' context

Dakera makes it trivial to index document chunks alongside agent memories and retrieve both in a single ranked call — no separate vector store required.

Deploy Dakera Free →