RAG-Augmented Memory
Pure RAG knows your documents but forgets your users. Pure memory knows your users but not your documents. Combine both in a single retrieval pipeline so your agent is simultaneously accurate and personal.
Get Started Free →- Running Dakera server (see Quickstart guide)
- Document chunking pipeline or existing vector store with content to index
- LLM API access (Anthropic Claude, OpenAI GPT-4, etc.) for response generation
- Dakera Python, TypeScript, Rust, or Go SDK installed
The Problem with Choosing Between RAG and Memory
Traditional RAG retrieves document chunks on every query — great for factual accuracy over a static knowledge base, but it has no awareness of who is asking, what they asked before, or what they already know. Conversely, a pure persistent memory system is personal and accumulates context over time, but has no access to your internal documentation or knowledge base.
Enterprise knowledge base assistants need both. A support agent should know that this particular user has a Pro subscription, already tried rebooting, and prefers concise answers — while simultaneously retrieving the correct troubleshooting steps from official documentation.
Rather than maintaining a separate vector store for documents and a memory system for users, Dakera indexes both in the same store. Document chunks are stored as memories with metadata marking their source. The recall() call returns agent memories and document chunks ranked together by relevance — no merge code to write.
Architecture: Hybrid Retrieval Pipeline
On every query, two parallel retrieval paths execute and their results are ranked together before injection into the LLM prompt. This diagram shows the full pipeline from user query to LLM response:
Latency Breakdown
Memory Update Loop: Learning from RAG Results
When the RAG pipeline surfaces a fact new to the agent's memory — a policy change, a product update, a user correction — that fact should be stored back into agent memory so future queries answer faster without re-retrieval. This "memory crystallization" loop is the key differentiator of RAG-augmented memory over plain RAG.
Before storing a learned fact to memory, call search_memories() to check if a similar fact already exists. Dakera's hybrid search will surface near-duplicates with a cosine similarity score. Only store if the top result scores below 0.85 — this prevents memory bloat from near-identical document chunks being crystallized repeatedly.
Real-World Scenario: Enterprise Knowledge Base Assistant
An internal knowledge base assistant at a SaaS company handles HR policy questions, IT support, and product documentation queries from employees. The critical requirement: every answer must be both factually accurate (from official docs) and personally aware (remembering this employee's role, past questions, and preferences).
-
Index company documentation at startupChunk all policy documents, runbooks, and product guides into 400-600 token segments. Store each chunk as a memory with
memory_type="semantic"and metadata marking the source document, version, and department. Dakera indexes them in the same store as user memories. -
Store user context on first interactionWhen an employee first uses the assistant, store their role, department, technical level, and communication preferences as high-importance semantic memories scoped to their agent ID. These persist across all future sessions.
-
Run parallel retrieval on every queryExecute two concurrent
recall()calls: one filtered to user memories (preferences, past questions, their specific environment), and one filtered to documentation chunks. Run them in parallel — total added latency is the slower of the two, not the sum. -
Merge, rank, and trim to token budgetSort all retrieved results by relevance score. Apply token budget (typically 2000 tokens for context). Prioritize user-memory results over generic doc chunks when scores are close — personalization wins ties.
-
Crystallize new facts learned from successful answersAfter the LLM generates a high-confidence answer that includes a specific policy fact, store that condensed fact back into the user's memory with TTL matching the document's expected update frequency. Future identical or similar questions skip doc retrieval entirely.
Ship a RAG+Memory pipeline in under an hour
Dakera handles chunked doc indexing, semantic recall, and memory persistence in one self-hosted API.
Implementation
# 1. Index a document chunk as a memory
curl -X POST http://localhost:3300/v1/memory/store \
-H "Authorization: Bearer dk-..." \
-H "Content-Type: application/json" \
-d '{
"agent_id": "kb-assistant",
"content": "SSO is available on Pro and Enterprise plans. To enable, go to Settings > Security > Single Sign-On and upload your IdP metadata XML.",
"importance": 0.85,
"memory_type": "semantic",
"tags": ["doc", "sso", "security", "settings"],
"metadata": {
"source": "doc",
"doc_id": "help-sso-setup",
"version": "2024-Q4",
"department": "IT"
}
}'
# 2. Store user context
curl -X POST http://localhost:3300/v1/memory/store \
-H "Authorization: Bearer dk-..." \
-H "Content-Type: application/json" \
-d '{
"agent_id": "kb-assistant",
"content": "User is a Systems Administrator on the Pro plan. Prefers step-by-step instructions with CLI examples where available.",
"importance": 0.95,
"memory_type": "semantic",
"tags": ["user-profile", "user:alice"]
}'
# 3. Recall combines both — sorted by relevance
curl "http://localhost:3300/v1/memory/recall?agent_id=kb-assistant&query=how+do+I+enable+SSO&top_k=8" \
-H "Authorization: Bearer dk-..."
# 4. Recall only docs (for explicit doc-only path)
curl "http://localhost:3300/v1/memory/recall?agent_id=kb-assistant&query=SSO+setup&top_k=5&tags=doc" \
-H "Authorization: Bearer dk-..."import asyncio
from dakera import DakeraClient
client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")
# ── Step 1: Index documents at startup ──────────────────────────────────────
def index_document(doc_id: str, chunks: list[str], department: str, version: str):
"""Index a document as a series of memory chunks."""
for i, chunk in enumerate(chunks):
client.store_memory(
agent_id="kb-assistant",
content=chunk,
importance=0.82,
memory_type="semantic",
tags=["doc", department.lower()],
# TTL: auto-expire doc chunks after 90 days, force re-index on update
ttl_seconds=90 * 24 * 3600
)
# Index your SSO documentation
sso_chunks = [
"SSO (Single Sign-On) is available on Pro and Enterprise plans only. Starter plan users cannot enable SSO.",
"To enable SSO, navigate to Settings > Security > Single Sign-On. Upload your IdP metadata XML. Supported IdPs: Okta, Azure AD, Google Workspace, OneLogin.",
"After uploading IdP metadata, test the SSO flow with a sandbox account before enabling for your entire organization. SSO enforcement will lock out users without IdP accounts.",
]
index_document("help-sso-setup", sso_chunks, "IT", "2024-Q4")
# ── Step 2: Store user preferences on first interaction ─────────────────────
client.store_memory(
agent_id="kb-assistant",
content="Alice Chen is a Systems Administrator on the Pro plan. Prefers step-by-step instructions. Has SAML experience. Uses Okta as IdP.",
importance=0.95,
memory_type="semantic",
tags=["user-profile", "user:alice"]
)
# ── Step 3: Parallel retrieval on each query ────────────────────────────────
async def retrieve_context(query: str, user_tag: str) -> dict:
"""Run memory recall and doc recall in parallel."""
# Both calls to the same Dakera instance; run concurrently
mem_task = asyncio.to_thread(
client.recall,
agent_id="kb-assistant",
query=query,
top_k=4,
min_importance=0.7,
tags=[user_tag] # user-specific memories only
)
doc_task = asyncio.to_thread(
client.recall,
agent_id="kb-assistant",
query=query,
top_k=6,
min_importance=0.6,
tags=["doc"] # documentation chunks only
)
mem_results, doc_results = await asyncio.gather(mem_task, doc_task)
return {"memories": mem_results["memories"], "docs": doc_results["memories"]}
# ── Step 4: Merge and rank results ──────────────────────────────────────────
def merge_context(memories: list, docs: list, token_budget: int = 2000) -> str:
"""Merge memory + doc results, rank by score, trim to token budget."""
# Tag source for the LLM
tagged = [
{"content": m["content"], "score": m["score"], "source": "memory"}
for m in memories
] + [
{"content": d["content"], "score": d["score"], "source": "doc"}
for d in docs
]
# Sort by score descending; user memories win ties (memory source sorted first)
tagged.sort(key=lambda x: (x["score"], x["source"] == "memory"), reverse=True)
context_parts = []
token_count = 0
for item in tagged:
tokens = len(item["content"].split()) * 1.3 # rough estimate
if token_count + tokens > token_budget:
break
prefix = "[USER CONTEXT]" if item["source"] == "memory" else "[DOCUMENTATION]"
context_parts.append(f"{prefix}
{item['content']}")
token_count += tokens
return "
".join(context_parts)
# ── Step 5: Crystallize learned facts back to memory ────────────────────────
def crystallize_fact(agent_id: str, fact: str, user_tag: str, ttl_days: int = 30):
"""Store a learned fact from RAG results into persistent memory."""
# Check for near-duplicate first
existing = client.search_memories(agent_id=agent_id, query=fact)
if existing["memories"] and existing["memories"][0]["score"] > 0.85:
return # Already known — skip
client.store_memory(
agent_id=agent_id,
content=fact,
importance=0.75,
memory_type="semantic",
tags=["learned", "doc-derived", user_tag],
ttl_seconds=ttl_days * 24 * 3600
)
# Example usage
async def answer_question(user_query: str):
context_data = await retrieve_context(user_query, "user:alice")
context_str = merge_context(context_data["memories"], context_data["docs"])
# Build system prompt with merged context
system_prompt = f"""You are a knowledge base assistant.
Use the context below to answer accurately and personally.
{context_str}"""
# ... call your LLM here ...
# After response, crystallize high-value facts:
for doc in context_data["docs"]:
if doc["score"] > 0.90: # High-confidence relevant doc chunk
crystallize_fact("kb-assistant", doc["content"], "user:alice")import { DakeraClient } from '@dakera-ai/dakera';
import Anthropic from '@anthropic-ai/sdk';
const client = new DakeraClient({ baseUrl: 'http://localhost:3300', apiKey: 'dk-...' });
const anthropic = new Anthropic();
// ── Index documentation chunks ──────────────────────────────────────────────
async function indexDocument(docId: string, chunks: string[], tags: string[]) {
await Promise.all(chunks.map((chunk, i) =>
client.storeMemory('kb-assistant', {
content: chunk,
importance: 0.82,
memoryType: 'semantic',
tags: ['doc', ...tags],
ttl_seconds: 90 * 24 * 3600,
})
));
}
// Index SSO docs
await indexDocument('help-sso-setup', [
'SSO is available on Pro and Enterprise plans. Go to Settings > Security > SSO to enable.',
'Supported IdPs: Okta, Azure AD, Google Workspace, OneLogin. Upload your IdP metadata XML.',
'Test with a sandbox account before enforcing SSO organization-wide to avoid lockouts.',
], ['it', 'sso', 'security']);
// ── Store user profile ──────────────────────────────────────────────────────
await client.storeMemory('kb-assistant', {
content: 'Alice Chen is a Systems Administrator on Pro plan. Uses Okta IdP. Prefers step-by-step instructions.',
importance: 0.95,
memoryType: 'semantic',
tags: ['user-profile', 'user:alice'],
});
// ── Parallel retrieval ──────────────────────────────────────────────────────
async function retrieveContext(query: string, userTag: string) {
const [memResults, docResults] = await Promise.all([
client.recall('kb-assistant', query, { top_k: 4, min_importance: 0.7, memory_type: 'semantic' }),
client.recall('kb-assistant', query, { top_k: 6, min_importance: 0.6, memory_type: 'semantic' }),
]);
return {
memories: memResults.memories.filter((m: any) => m.tags?.includes(userTag)),
docs: docResults.memories.filter((m: any) => m.tags?.includes('doc')),
};
}
// ── Merge results ───────────────────────────────────────────────────────────
function mergeContext(
memories: any[],
docs: any[],
tokenBudget = 2000
): string {
const tagged = [
...memories.map(m => ({ ...m, source: 'memory' as const })),
...docs.map(d => ({ ...d, source: 'doc' as const })),
].sort((a, b) => b.score - a.score);
const parts: string[] = [];
let tokens = 0;
for (const item of tagged) {
const est = item.content.split(' ').length * 1.3;
if (tokens + est > tokenBudget) break;
const prefix = item.source === 'memory' ? '[USER CONTEXT]' : '[DOCUMENTATION]';
parts.push(`${prefix}
${item.content}`);
tokens += est;
}
return parts.join('
');
}
// ── Crystallize facts ───────────────────────────────────────────────────────
async function crystallizeFact(fact: string, userTag: string) {
const existing = await client.searchMemories('kb-assistant', fact, { top_k: 1 });
if (existing.memories[0]?.score > 0.85) return; // Already known
await client.storeMemory('kb-assistant', {
content: fact,
importance: 0.75,
memoryType: 'semantic',
tags: ['learned', 'doc-derived', userTag],
ttl_seconds: 30 * 24 * 3600,
});
}
// ── Full pipeline ───────────────────────────────────────────────────────────
async function answerQuestion(userQuery: string) {
const { memories, docs } = await retrieveContext(userQuery, 'user:alice');
const context = mergeContext(memories, docs);
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: `You are a knowledge base assistant. Use the context below.
${context}`,
messages: [{ role: 'user', content: userQuery }],
});
// Crystallize highly relevant doc chunks for faster future recall
for (const doc of docs.filter((d: any) => d.score > 0.90)) {
await crystallizeFact(doc.content, 'user:alice');
}
return response.content[0].type === 'text' ? response.content[0].text : '';
}use dakera_rs::{Client, StoreMemoryRequest, RecallRequest};
use tokio::join;
let client = Client::new("http://localhost:3300", "dk-...");
// Index document chunks
client.store_memory("kb-assistant", StoreMemoryRequest {
content: "SSO is available on Pro and Enterprise plans. Settings > Security > SSO.".into(),
importance: Some(0.82),
memory_type: Some("semantic".into()),
tags: Some(vec!["doc".into(), "sso".into()]),
ttl_seconds: Some(90 * 24 * 3600),
..Default::default()
}).await?;
// Store user context
client.store_memory("kb-assistant", StoreMemoryRequest {
content: "Alice is a sysadmin on Pro plan, uses Okta, prefers step-by-step guides.".into(),
importance: Some(0.95),
memory_type: Some("semantic".into()),
tags: Some(vec!["user-profile".into(), "user:alice".into()]),
..Default::default()
}).await?;
// Parallel recall: user memories + doc chunks
let query = "How do I enable SSO?";
let (mem_results, doc_results) = join!(
client.recall("kb-assistant", RecallRequest {
query: query.into(),
top_k: Some(4),
min_importance: Some(0.7),
..Default::default()
}),
client.recall("kb-assistant", RecallRequest {
query: query.into(),
top_k: Some(6),
min_importance: Some(0.6),
..Default::default()
})
);
let (mem, docs) = (mem_results?, doc_results?);
// Build context string
let mut context = String::new();
for m in &mem.memories {
context.push_str(&format!("[USER CONTEXT]
{}
", m.content));
}
for d in &docs.memories {
context.push_str(&format!("[DOCUMENTATION]
{}
", d.content));
}
println!("Context for LLM:
{}", context);package main
import (
"context"
"fmt"
"sync"
dakera "github.com/dakera-ai/dakera-go"
)
func main() {
client := dakera.NewClient("http://localhost:3300", "dk-...")
ctx := context.Background()
// Index documentation chunk
client.StoreMemory(ctx, "kb-assistant", dakera.StoreMemoryRequest{
Content: "SSO available on Pro and Enterprise. Settings > Security > SSO. Upload IdP metadata XML.",
Importance: 0.82,
MemoryType: "semantic",
Tags: []string{"doc", "sso"},
TTLSeconds: 90 * 24 * 3600,
})
// Store user profile
client.StoreMemory(ctx, "kb-assistant", dakera.StoreMemoryRequest{
Content: "Alice Chen: sysadmin, Pro plan, Okta IdP, prefers step-by-step instructions.",
Importance: 0.95,
MemoryType: "semantic",
Tags: []string{"user-profile", "user:alice"},
})
// Parallel retrieval
var (
memResults *dakera.RecallResponse
docResults *dakera.RecallResponse
wg sync.WaitGroup
)
query := "How do I enable SSO?"
wg.Add(2)
go func() {
defer wg.Done()
memResults, _ = client.Recall(ctx, "kb-assistant", dakera.RecallRequest{
Query: query,
TopK: 4,
MinImportance: 0.7,
})
}()
go func() {
defer wg.Done()
docResults, _ = client.Recall(ctx, "kb-assistant", dakera.RecallRequest{
Query: query,
TopK: 6,
MinImportance: 0.6,
})
}()
wg.Wait()
// Build merged context
contextStr := ""
for _, m := range memResults.Memories {
contextStr += fmt.Sprintf("[USER CONTEXT]
%s
", m.Content)
}
for _, d := range docResults.Memories {
contextStr += fmt.Sprintf("[DOCUMENTATION]
%s
", d.Content)
}
fmt.Println("Merged context:", contextStr)
}Before / After: Pure RAG vs. RAG + Dakera Memory
Query: "How do I enable SSO?"
Retrieved chunks:
- SSO setup guide (generic)
- SSO troubleshooting (generic)
- IdP metadata format spec
Generated answer:
"To enable SSO, go to Settings >
Security > Single Sign-On and
upload your IdP metadata XML.
Supported IdPs: Okta, Azure AD,
Google Workspace, OneLogin."
Problems:
- Doesn't know Alice is already
on Pro plan (no upsell friction)
- Doesn't know she uses Okta
(misses Okta-specific steps)
- Same generic answer every time
- No memory of past questions
(may answer the same thing 5x)
Query: "How do I enable SSO?"
Memory recall:
- Alice: Pro plan, Okta IdP,
prefers step-by-step guides
- Alice asked about SSO last week;
showed her the Settings page
Doc recall:
- SSO setup guide (generic)
- Okta-specific SAML config steps
- Pro plan feature confirmation
Generated answer:
"Since you're on Pro plan, SSO is
available. For Okta specifically:
1. In Okta Admin, create a new
SAML 2.0 application
2. Download the metadata XML
3. In Dakera Settings > Security >
SSO, upload the XML
Last time we spoke about this you
were on the Settings page — the
SSO tab is in the left nav."
Personal, accurate, and contextual.
Edge Cases
1. Conflicting RAG vs. Memory Facts
A policy document might say "SSO requires Enterprise plan" while a crystallized memory from 6 months ago says "SSO is available on Pro plan" (because the policy changed). When scores are close and sources conflict, always prefer the document chunk — it reflects the current source of truth. Flag the stale memory for update using update_importance() to demote it.
# Detect conflict: doc says X, memory says Y on same topic
# Demote stale memory importance so doc wins future rankings
client.update_importance(
agent_id="kb-assistant",
memory_id="mem_stale_sso_policy",
importance=0.2 # demote so doc chunk ranks higher
)
2. Document Staleness After Updates
Documents change. Crystallized memories derived from outdated chunks persist until their TTL expires or you force-update them. Set TTL on all doc-derived memories to match your document update cadence. For fast-changing docs (weekly releases), use 7-day TTLs. For stable policy docs, 90 days is safe.
3. Chunking Strategy Affects Recall Quality
Chunks that are too small (under 100 tokens) lack enough context for accurate semantic matching. Chunks that are too large (over 800 tokens) dilute the embedding and return irrelevant sentences. The optimal range is 350–550 tokens with 50-token overlaps between adjacent chunks to prevent context loss at boundaries.
4. Token Budget Overflow
When both memory recall and doc retrieval return high-scoring results, the merged context can overflow the LLM's context window. Implement strict token budgets: allocate 40% to user memories (high personalization value), 50% to doc chunks (factual grounding), and reserve 10% for the system prompt and user message.
5. Cold Start: No User Memory Yet
On a user's first interaction, memory recall returns nothing. Fall back gracefully to doc-only retrieval with a wider top_k. After the first turn, store a minimal user profile memory to bootstrap personalization for the second turn. Never show degraded behavior to the user — the transition should be seamless.
Performance Considerations
| Operation | p50 | p99 | Optimization |
|---|---|---|---|
| Memory recall (user context) | 12ms | 28ms | Tag filter reduces search space |
| Doc chunk recall (1k chunks indexed) | 80ms | 160ms | Hybrid BM25+vector search |
| Doc chunk recall (50k chunks indexed) | 95ms | 220ms | Sub-linear scaling with HNSW index |
| Parallel recall (both paths) | 82ms | 165ms | Parallelism eliminates additive cost |
| Merge + deduplicate + rank | 5ms | 12ms | In-process; no network hop |
| Crystallization (store_memory) | 18ms | 40ms | Async post-response for zero UX impact |
Always run the crystallization store_memory() call after returning the response to the user, not before. It adds 18-40ms to the user-facing latency if you block on it. Queue it as a background task — the user should never wait for memory writes.
SDK Reference
| Method | SDK | Purpose |
|---|---|---|
store_memory(agent_id, content, importance, memory_type, tags, ttl_seconds) | Python | Index doc chunks or store user memories |
storeMemory(agentId, {content, importance, memoryType, tags, ttl_seconds}) | TypeScript | Index doc chunks or store user memories |
recall(agent_id, query, top_k, min_importance) | Python | Retrieve ranked memories + doc chunks |
recall(agentId, query, {top_k, min_importance, memory_type}) | TypeScript | Retrieve ranked memories + doc chunks |
search_memories(agent_id, query) | Python | Semantic search to detect near-duplicates before crystallization |
searchMemories(agentId, query, {top_k}) | TypeScript | Semantic search for deduplication check |
update_importance(agent_id, memory_id, importance) | Python | Demote stale crystallized facts when docs are updated |
updateImportance(agentId, request) | TypeScript | Demote stale memories |
forget(agent_id, memory_id) | Python | Remove outdated doc chunks when document is replaced |
batch_recall(request) | Python | Recall from multiple agent IDs in one call for multi-user scenarios |
Advanced Configuration: Chunking, TTL, and Index Tuning
Recommended chunking parameters
from dakera import DakeraClient
import tiktoken
client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")
enc = tiktoken.get_encoding("cl100k_base")
def chunk_document(text: str, chunk_tokens: int = 450, overlap_tokens: int = 50):
"""Chunk text with overlap to prevent context boundary loss."""
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_tokens, len(tokens))
chunk_text = enc.decode(tokens[start:end])
chunks.append(chunk_text)
start += chunk_tokens - overlap_tokens # slide with overlap
return chunks
TTL strategy by document type
# Changelog / release notes: short TTL — changes frequently
client.store_memory("kb-assistant", content=chunk, importance=0.8, ttl_seconds=7*24*3600)
# Policy documents: medium TTL — quarterly updates
client.store_memory("kb-assistant", content=chunk, importance=0.85, ttl_seconds=90*24*3600)
# Core product docs: long TTL — stable unless major version change
client.store_memory("kb-assistant", content=chunk, importance=0.85, ttl_seconds=365*24*3600)
# Crystallized facts learned from high-score RAG results: 30 days
client.store_memory("kb-assistant", content=fact, importance=0.75, ttl_seconds=30*24*3600)
Hybrid search tuning
# In docker/.env: tune BM25 vs vector weight for doc retrieval
# Higher BM25 weight = better for keyword-heavy technical docs
# Higher vector weight = better for conversational / semantic queries
DAKERA_HYBRID_BM25_WEIGHT=0.4
DAKERA_HYBRID_VECTOR_WEIGHT=0.6
Combine your docs with your users' context
Dakera makes it trivial to index document chunks alongside agent memories and retrieve both in a single ranked call — no separate vector store required.
Deploy Dakera Free →