What is the LoCoMo benchmark?

LoCoMo (Long-Context Memory) is an academic benchmark introduced by Google researchers to evaluate how well AI memory systems retain and recall information from long conversational histories. It uses 50 simulated conversations and 1,540 questions across four categories: Cat1 single-hop fact retrieval, Cat2 multi-hop reasoning chains, Cat3 temporal inference, and Cat4 open-domain recall.

How does Dakera score on LoCoMo?

Dakera v0.11.54 scores 87.6% on the full LoCoMo dataset (50 sessions, 1,540 questions) without LLM post-processing. This is the highest accuracy among self-hostable memory engines on standard single-pass evaluation. Category breakdown: Cat1 single-hop 86.9%, Cat2 multi-hop 85.4%, Cat3 temporal 73.9%, Cat4 open-domain 91.0%.

What does Dakera score on the LoCoMo benchmark?

Dakera v0.11.54 scores 87.6% on the full LoCoMo benchmark (50 sessions, 1,540 questions) without LLM post-processing. Category breakdown: Cat1 single-hop 86.9%, Cat2 multi-hop 85.4%, Cat3 temporal 73.9%, Cat4 open-domain 91.0%. Evaluated May 2026. The full benchmark harness is open source at github.com/Dakera-AI/dakera-bench.

What does standard LoCoMo evaluation mean?

Standard LoCoMo evaluation uses direct vector retrieval and recall — no LLM reranking post-processing. This reflects real-world agent performance at production latency. Some approaches apply LLM reranking as a second pass, which can boost benchmark scores but adds 200–500ms latency and significant API cost per recall operation.

LoCoMo Benchmark · v0.11.54

Dakera on LoCoMo: 87.6%
Full methodology, category breakdown, and reproduction steps

Dakera scores 87.6% on the full LoCoMo benchmark — 50 sessions, 1,540 questions, no LLM reranking or synthesis. Here is every number, broken down by category, with the evaluation code so you can verify it yourself.

Dakera v0.11.54

87.6%

Full dataset — 50 sessions, 1,540Q — no LLM reranking

Single-hop (Cat1)

86.9%

Direct fact retrieval

Multi-hop (Cat2)

85.4%

Cross-session reasoning chains

Open-domain (Cat4)

91.0%

Mixed topics and entities

Score breakdown

Four question categories

LoCoMo tests four distinct recall challenges. Dakera leads on three; temporal inference is our hardest category and an active area of improvement.

Single-hop recall

86.9%

Direct facts from recent or distant sessions. Tests faithful storage and retrieval — the baseline for any memory system.

Cat1 · Dakera v0.11.54

Open-domain recall

91.0%

Free-form and mixed recall questions spanning multiple topics and entities. Tests breadth and versatility of the retrieval system.

Cat4 · Improved by hybrid retrieval + knowledge graph traversal

Multi-hop reasoning

85.4%

Questions requiring inference across two or more stored memories. Tests whether recall chains compound correctly.

Cat2 · Entity graph traversal enables multi-hop inference

Temporal inference

73.9%

Questions about sequences, durations, and "before/after" relationships. Tests temporal reasoning across stored memories.

Cat3 · Hardest category industry-wide · active roadmap item

Evaluation details

Dakera's LoCoMo scores — full breakdown

All numbers below are Dakera's own results on the full LoCoMo dataset (v0.11.54, May 2026). Reproducible via dakera-bench.

Category	Score	Questions	Description
Overall	87.6%	1,540 (full dataset, 50 sessions)	Standard single-pass evaluation, no LLM reranking
Cat1 — Single-hop recall	86.9%	282 questions	Direct facts from recent or distant sessions
Cat2 — Multi-hop reasoning	85.4%	321 questions	Cross-session reasoning chains requiring multiple recall steps
Cat3 — Temporal inference	73.9%	92 questions	Time-anchored questions — our most challenging category, actively improving
Cat4 — Open-domain	91.0%	841 questions	Mixed topics and entities spanning multiple sessions

Evaluation methodology

All scores use the full LoCoMo dataset — 50 simulated long-term conversations, 1,540 questions. We do not use sampled subsets. Standard single-pass retrieval: no LLM reranking, no post-processing synthesis step. Results are scored by LLM judge on the LoCoMo framework.

Version: Dakera v0.11.54, evaluated May 2026. The benchmark harness is open source at github.com/Dakera-AI/dakera-bench. Run it yourself with one command.

Methodology

How we evaluate

We publish our evaluation methodology in full so you can reproduce, audit, or challenge the results.

Dataset: Full LoCoMo benchmark — 50 simulated long-term conversations, 1,540 questions across all four categories. We do not use sampled subsets. Two percentage points on a 100-question eval is statistically noise; on 1,540 questions it represents a real signal.

Ingest conversations

Each LoCoMo conversation is ingested into Dakera via the POST /v1/memory API. Session boundaries are preserved. No preprocessing or summarization.

Run recall queries

Each of the 1,540 questions is issued as a POST /v1/recall query. Dakera returns its top-k memories using HNSW + BM25 hybrid retrieval with temporal re-ranking.

LLM judge scoring

Retrieved memories are passed to an LLM judge alongside the original question and ground-truth answer. The judge scores recall accuracy on a binary correct/incorrect basis. We use the standard LoCoMo evaluation prompt.

Aggregate and report

Scores are aggregated per category and overall. The 87.6% figure is the unweighted mean across all 1,540 questions. Category-level scores are approximate (±1%) due to question distribution variance.

The full evaluation script and dataset ingestion pipeline is documented in our benchmark methodology post. Reproducibility is a first-class requirement — if you find a discrepancy, open an issue on GitHub.

Why it matters

What 87.6% means for your agent

Benchmark scores translate directly to agent behavior. Memory recall failures cause agents to repeat questions, forget context, and give inconsistent answers.

Fewer repeated questions

At 87.6% recall, your agent remembers what the user told it — across sessions. No "as I mentioned earlier" failures.

Long-horizon context

91.0% open-domain recall means agents carry context across days, weeks, and months — not just within a single conversation.

Production latency

Sub-10ms recall at P99. No LLM rerank post-pass. Your agent gets the right memory fast enough to use it in real-time.

Dakera on LoCoMo: 87.6%Full methodology, category breakdown, and reproduction steps

Four question categories

Dakera's LoCoMo scores — full breakdown

Evaluation methodology

How we evaluate

What 87.6% means for your agent

Run the benchmark yourself

Dakera on LoCoMo: 87.6%
Full methodology, category breakdown, and reproduction steps