DAKERA
Home Docs Blog Integrations Benchmark
GitHub ↗ Try Free →
LoCoMo Benchmark · v0.11.54

Dakera on LoCoMo: 87.6%
Full methodology, category breakdown, and reproduction steps

Dakera scores 87.6% on the full LoCoMo benchmark — 50 sessions, 1,540 questions, no LLM reranking or synthesis. Here is every number, broken down by category, with the evaluation code so you can verify it yourself.

Dakera v0.11.54
87.6%
Full dataset — 50 sessions, 1,540Q — no LLM reranking
Single-hop (Cat1)
86.9%
Direct fact retrieval
Multi-hop (Cat2)
85.4%
Cross-session reasoning chains
Open-domain (Cat4)
91.0%
Mixed topics and entities

Four question categories

LoCoMo tests four distinct recall challenges. Dakera leads on three; temporal inference is our hardest category and an active area of improvement.

Single-hop recall
86.9%
Direct facts from recent or distant sessions. Tests faithful storage and retrieval — the baseline for any memory system.
Cat1 · Dakera v0.11.54
Open-domain recall
91.0%
Free-form and mixed recall questions spanning multiple topics and entities. Tests breadth and versatility of the retrieval system.
Cat4 · Improved by hybrid retrieval + knowledge graph traversal
Multi-hop reasoning
85.4%
Questions requiring inference across two or more stored memories. Tests whether recall chains compound correctly.
Cat2 · Entity graph traversal enables multi-hop inference
Temporal inference
73.9%
Questions about sequences, durations, and "before/after" relationships. Tests temporal reasoning across stored memories.
Cat3 · Hardest category industry-wide · active roadmap item

Dakera's LoCoMo scores — full breakdown

All numbers below are Dakera's own results on the full LoCoMo dataset (v0.11.54, May 2026). Reproducible via dakera-bench.

Category Score Questions Description
Overall 87.6% 1,540 (full dataset, 50 sessions) Standard single-pass evaluation, no LLM reranking
Cat1 — Single-hop recall 86.9% 282 questions Direct facts from recent or distant sessions
Cat2 — Multi-hop reasoning 85.4% 321 questions Cross-session reasoning chains requiring multiple recall steps
Cat3 — Temporal inference 73.9% 92 questions Time-anchored questions — our most challenging category, actively improving
Cat4 — Open-domain 91.0% 841 questions Mixed topics and entities spanning multiple sessions

Evaluation methodology

All scores use the full LoCoMo dataset — 50 simulated long-term conversations, 1,540 questions. We do not use sampled subsets. Standard single-pass retrieval: no LLM reranking, no post-processing synthesis step. Results are scored by LLM judge on the LoCoMo framework.

Version: Dakera v0.11.54, evaluated May 2026. The benchmark harness is open source at github.com/Dakera-AI/dakera-bench. Run it yourself with one command.

How we evaluate

We publish our evaluation methodology in full so you can reproduce, audit, or challenge the results.

Dataset: Full LoCoMo benchmark — 50 simulated long-term conversations, 1,540 questions across all four categories. We do not use sampled subsets. Two percentage points on a 100-question eval is statistically noise; on 1,540 questions it represents a real signal.

1
Ingest conversations
Each LoCoMo conversation is ingested into Dakera via the POST /v1/memory API. Session boundaries are preserved. No preprocessing or summarization.
2
Run recall queries
Each of the 1,540 questions is issued as a POST /v1/recall query. Dakera returns its top-k memories using HNSW + BM25 hybrid retrieval with temporal re-ranking.
3
LLM judge scoring
Retrieved memories are passed to an LLM judge alongside the original question and ground-truth answer. The judge scores recall accuracy on a binary correct/incorrect basis. We use the standard LoCoMo evaluation prompt.
4
Aggregate and report
Scores are aggregated per category and overall. The 87.6% figure is the unweighted mean across all 1,540 questions. Category-level scores are approximate (±1%) due to question distribution variance.

The full evaluation script and dataset ingestion pipeline is documented in our benchmark methodology post. Reproducibility is a first-class requirement — if you find a discrepancy, open an issue on GitHub.

What 87.6% means for your agent

Benchmark scores translate directly to agent behavior. Memory recall failures cause agents to repeat questions, forget context, and give inconsistent answers.

Fewer repeated questions

At 87.6% recall, your agent remembers what the user told it — across sessions. No "as I mentioned earlier" failures.

Long-horizon context

91.0% open-domain recall means agents carry context across days, weeks, and months — not just within a single conversation.

Production latency

Sub-10ms recall at P99. No LLM rerank post-pass. Your agent gets the right memory fast enough to use it in real-time.

Run the benchmark yourself

Dakera is open core — the engine, SDKs, CLI, and MCP server are MIT-licensed and self-hostable. Spin up a local instance, ingest the LoCoMo dataset, and reproduce these results against your own deployment.