Tool Usage Learning
Agents with access to multiple tools make costly mistakes — calling expensive external APIs when a cached lookup suffices, using slow tools when fast alternatives exist, or repeatedly failing with the wrong tool type. This pattern teaches your agent to remember what works by persisting every tool invocation outcome and recalling successful strategies before each new decision.
Start Building Free →- Dakera instance running (quickstart)
- SDK installed:
pip install dakera/npm i @dakera-ai/dakera - An agent with two or more tools available (search, code execution, database lookup, etc.)
- A mechanism to capture tool execution results (success/failure, latency, output quality)
The Problem
Without memory, multi-tool agents are stateless selectors — they pick tools using only the current prompt and model priors. This leads to three failure modes:
- Repeated failures: The agent tries the same failing tool strategy across sessions because it has no record of prior failures.
- Cost inefficiency: A web search API costs 100x more than a cached knowledge recall, but the agent can't learn which queries are best served locally.
- Latency spikes: Without knowing that
database_lookupreturns in 30ms whileweb_scrapertakes 8 seconds, the agent can't optimize for speed.
The result is an agent that never improves — it makes the same suboptimal choices on day 100 that it made on day 1.
How It Works
Every tool invocation produces a structured outcome record: the tool name, the task type that triggered it, the result quality, and the latency. These outcomes are stored in Dakera with importance scores proportional to the quality of the result. Before choosing a tool for a new task, the agent recalls the most relevant prior outcomes and uses the success-rate pattern to inform selection.
Tool success rate tracker: outcomes recalled for "deployment tasks" rank tools by historical effectiveness before each selection decision.
Feedback-to-memory loop: every tool execution writes a structured outcome record that improves the next tool selection for similar tasks.
Implementation Steps
-
Instrument tool calls to capture structured outcomesWrap every tool invocation with timing and result capture. Record success/failure, latency in milliseconds, and a short description of the outcome. This is the raw signal your memory will learn from.
-
Store outcomes with importance proportional to result qualitySuccessful, fast outcomes get importance 0.8–0.95. Failed or slow outcomes get 0.4–0.6 — low enough to not dominate recall, but high enough to serve as negative signal. Always include the task type in the content for semantic search.
-
Recall tool history before each decisionBefore selecting a tool, call
recall()with a query describing the current task type. Retrieve the top 5–10 prior outcomes. Parse the memories to compute per-tool success rates and average latencies. -
Score and rank available toolsCompute a composite score per tool:
score = success_rate * 0.7 + (1 / avg_latency_normalized) * 0.3. Present the top-ranked tool to the LLM as the recommended selection, but let the model override if the task has unusual characteristics. -
Decay stale outcomes over timeTools change — a previously broken API gets fixed, a once-fast service degrades. Use
ttl_seconds(e.g. 30 days = 2,592,000s) on outcome records so the agent naturally forgets stale patterns and re-evaluates tools after significant time.
Store deployment outcomes under agent_id="devops-deploy", search outcomes under agent_id="devops-search". This prevents cross-contamination where deployment success rates pollute search tool recommendations, and keeps recall fast by limiting the memory pool per query.
Implementation
# Store a successful tool outcome
curl -X POST http://localhost:3300/v1/memory/store \
-H "Authorization: Bearer dk-..." \
-H "Content-Type: application/json" \
-d '{
"agent_id": "devops-agent",
"content": "kubectl_deploy tool succeeded for production deployment task. Deployed v2.4.1 to prod cluster in 1.9s. Zero errors.",
"importance": 0.88,
"memory_type": "semantic",
"tags": ["tool:kubectl_deploy", "task:deploy", "outcome:success"],
"metadata": {
"tool": "kubectl_deploy",
"task_type": "production_deploy",
"outcome": "success",
"latency_ms": 1900,
"environment": "production"
}
}'
# Store a failure outcome (lower importance -- still useful as negative signal)
curl -X POST http://localhost:3300/v1/memory/store \
-H "Authorization: Bearer dk-..." \
-H "Content-Type: application/json" \
-d '{
"agent_id": "devops-agent",
"content": "ansible_run tool failed for production deployment task. Timeout after 45s. SSH key rotation may have broken connectivity.",
"importance": 0.45,
"memory_type": "semantic",
"tags": ["tool:ansible_run", "task:deploy", "outcome:failure"],
"metadata": {
"tool": "ansible_run",
"task_type": "production_deploy",
"outcome": "failure",
"latency_ms": 45000,
"error": "SSH_TIMEOUT"
}
}'
# Recall tool outcomes before choosing tool for a deployment task
curl "http://localhost:3300/v1/memory/recall?agent_id=devops-agent&query=tool+outcomes+for+production+deployment+tasks&top_k=8&min_importance=0.3" \
-H "Authorization: Bearer dk-..."import time
from dakera import DakeraClient
from typing import Optional
client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")
def record_tool_outcome(
agent_id: str,
tool: str,
task_type: str,
outcome: str, # "success" | "failure" | "partial"
notes: str,
latency_ms: int,
error: Optional[str] = None
) -> None:
"""Persist a tool invocation outcome for future learning."""
# Success = high importance, failure = low (still recalled as negative signal)
importance_map = {"success": 0.88, "partial": 0.65, "failure": 0.42}
importance = importance_map.get(outcome, 0.6)
content = (
f"{tool} tool {outcome} for {task_type} task. "
f"{notes}. Latency: {latency_ms}ms."
)
if error:
content += f" Error: {error}."
client.store_memory(
agent_id=agent_id,
content=content,
importance=importance,
memory_type="semantic",
tags=[f"tool:{tool}", f"task:{task_type}", f"outcome:{outcome}"],
ttl_seconds=2_592_000 # 30 days -- stale patterns decay naturally
)
def get_best_tool(agent_id: str, task_type: str, available_tools: list[str]) -> str:
"""Recall tool history and return the highest-scoring tool for this task type."""
memories = client.recall(
agent_id=agent_id,
query=f"tool outcomes for {task_type} tasks",
top_k=10,
min_importance=0.3
)
# Aggregate success rates per tool
tool_stats: dict[str, dict] = {t: {"successes": 0, "total": 0, "total_latency": 0} for t in available_tools}
for mem in memories.get("memories", []):
content = mem["content"]
meta = mem.get("metadata", {})
tool_name = meta.get("tool")
if tool_name not in tool_stats:
continue
tool_stats[tool_name]["total"] += 1
if meta.get("outcome") == "success":
tool_stats[tool_name]["successes"] += 1
tool_stats[tool_name]["total_latency"] += meta.get("latency_ms", 5000)
# Score: 70% success rate + 30% speed factor (normalized)
best_tool = available_tools[0]
best_score = -1.0
for tool, stats in tool_stats.items():
if stats["total"] == 0:
continue # No history -- let model decide
success_rate = stats["successes"] / stats["total"]
avg_latency = stats["total_latency"] / stats["total"]
speed_score = max(0, 1 - (avg_latency / 60_000)) # normalize to 60s max
score = success_rate * 0.7 + speed_score * 0.3
if score > best_score:
best_score = score
best_tool = tool
return best_tool
# --- Usage in a DevOps agent ---
TOOLS = ["kubectl_deploy", "ansible_run", "terraform_apply", "manual_ssh"]
def deploy_to_production(service: str, version: str) -> dict:
"""Deploy a service using the historically most effective tool."""
best = get_best_tool("devops-agent", "production_deploy", TOOLS)
print(f"Using {best} based on historical success rate")
start = time.time()
try:
# result = execute_tool(best, service=service, version=version)
result = {"status": "success", "deployed": f"{service}:{version}"}
latency_ms = int((time.time() - start) * 1000)
record_tool_outcome(
"devops-agent", best, "production_deploy",
"success",
f"Deployed {service} v{version} to production cluster",
latency_ms
)
return result
except Exception as e:
latency_ms = int((time.time() - start) * 1000)
record_tool_outcome(
"devops-agent", best, "production_deploy",
"failure",
f"Failed to deploy {service} v{version}",
latency_ms,
error=str(e)
)
raiseimport { DakeraClient } from '@dakera-ai/dakera';
const client = new DakeraClient({ baseUrl: 'http://localhost:3300', apiKey: 'dk-...' });
type Outcome = 'success' | 'failure' | 'partial';
interface ToolOutcome {
tool: string;
taskType: string;
outcome: Outcome;
notes: string;
latencyMs: number;
error?: string;
}
async function recordToolOutcome(agentId: string, o: ToolOutcome): Promise<void> {
const importanceMap: Record<Outcome, number> = {
success: 0.88,
partial: 0.65,
failure: 0.42,
};
const content = [
`${o.tool} tool ${o.outcome} for ${o.taskType} task.`,
o.notes,
`Latency: ${o.latencyMs}ms.`,
o.error ? `Error: ${o.error}.` : '',
].filter(Boolean).join(' ');
await client.storeMemory(agentId, {
content,
importance: importanceMap[o.outcome],
memoryType: 'semantic',
tags: [`tool:${o.tool}`, `task:${o.taskType}`, `outcome:${o.outcome}`],
ttl_seconds: 2_592_000, // 30 days
});
}
async function getBestTool(
agentId: string,
taskType: string,
availableTools: string[]
): Promise<string> {
const result = await client.recall(
agentId,
`tool outcomes for ${taskType} tasks`,
{ top_k: 10, min_importance: 0.3 }
);
const stats: Record<string, { successes: number; total: number; totalLatency: number }> = {};
for (const tool of availableTools) {
stats[tool] = { successes: 0, total: 0, totalLatency: 0 };
}
for (const mem of result.memories) {
const tool = mem.metadata?.tool as string;
if (!stats[tool]) continue;
stats[tool].total++;
if (mem.metadata?.outcome === 'success') stats[tool].successes++;
stats[tool].totalLatency += (mem.metadata?.latency_ms as number) ?? 5000;
}
let bestTool = availableTools[0];
let bestScore = -1;
for (const [tool, s] of Object.entries(stats)) {
if (s.total === 0) continue;
const successRate = s.successes / s.total;
const speedScore = Math.max(0, 1 - s.totalLatency / s.total / 60_000);
const score = successRate * 0.7 + speedScore * 0.3;
if (score > bestScore) { bestScore = score; bestTool = tool; }
}
return bestTool;
}
// --- Usage ---
const tools = ['kubectl_deploy', 'ansible_run', 'terraform_apply'];
async function deployToProduction(service: string, version: string) {
const best = await getBestTool('devops-agent', 'production_deploy', tools);
console.log(`Using ${best} (highest historical success rate)`);
const start = Date.now();
try {
// const result = await executeTool(best, { service, version });
const latencyMs = Date.now() - start;
await recordToolOutcome('devops-agent', {
tool: best,
taskType: 'production_deploy',
outcome: 'success',
notes: `Deployed ${service} v${version} to production`,
latencyMs,
});
} catch (err) {
const latencyMs = Date.now() - start;
await recordToolOutcome('devops-agent', {
tool: best,
taskType: 'production_deploy',
outcome: 'failure',
notes: `Failed to deploy ${service} v${version}`,
latencyMs,
error: String(err),
});
throw err;
}
}use dakera_rs::{Client, StoreMemoryRequest, RecallRequest};
use std::collections::HashMap;
use std::time::Instant;
let client = Client::new("http://localhost:3300", "dk-...");
async fn record_tool_outcome(
client: &Client,
agent_id: &str,
tool: &str,
task_type: &str,
outcome: &str,
notes: &str,
latency_ms: u64,
) -> anyhow::Result<()> {
let importance = match outcome {
"success" => 0.88,
"partial" => 0.65,
_ => 0.42,
};
let content = format!(
"{} tool {} for {} task. {}. Latency: {}ms.",
tool, outcome, task_type, notes, latency_ms
);
client.store_memory(agent_id, StoreMemoryRequest {
content,
importance: Some(importance),
memory_type: "semantic".into(),
tags: vec![
format!("tool:{}", tool),
format!("task:{}", task_type),
format!("outcome:{}", outcome),
],
ttl_seconds: Some(2_592_000),
..Default::default()
}).await?;
Ok(())
}
// Recall and score tools before selection
let memories = client.recall("devops-agent", RecallRequest {
query: "tool outcomes for production deployment tasks".into(),
top_k: Some(10),
min_importance: Some(0.3),
..Default::default()
}).await?;
// Aggregate per-tool statistics
let mut tool_stats: HashMap<String, (u32, u32, u64)> = HashMap::new(); // (successes, total, total_latency_ms)
for mem in &memories.memories {
if let Some(meta) = &mem.metadata {
let tool = meta.get("tool").and_then(|v| v.as_str()).unwrap_or("unknown");
let outcome = meta.get("outcome").and_then(|v| v.as_str()).unwrap_or("");
let latency = meta.get("latency_ms").and_then(|v| v.as_u64()).unwrap_or(5000);
let entry = tool_stats.entry(tool.to_string()).or_insert((0, 0, 0));
entry.1 += 1;
entry.2 += latency;
if outcome == "success" { entry.0 += 1; }
}
}
// Choose best tool by composite score
let best_tool = tool_stats.iter()
.filter(|(_, (_, total, _))| *total > 0)
.map(|(tool, (successes, total, total_latency))| {
let success_rate = *successes as f64 / *total as f64;
let avg_latency = *total_latency as f64 / *total as f64;
let speed_score = (1.0 - avg_latency / 60_000.0).max(0.0);
let score = success_rate * 0.7 + speed_score * 0.3;
(tool, score)
})
.max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
.map(|(tool, _)| tool.as_str())
.unwrap_or("kubectl_deploy");
println!("Selecting {} based on historical outcomes", best_tool);package main
import (
"context"
"fmt"
"time"
dakera "github.com/dakera-ai/dakera-go"
)
func recordToolOutcome(ctx context.Context, client *dakera.Client, agentID, tool, taskType, outcome, notes string, latencyMs int64) error {
importanceMap := map[string]float64{
"success": 0.88,
"partial": 0.65,
"failure": 0.42,
}
importance, ok := importanceMap[outcome]
if !ok {
importance = 0.5
}
content := fmt.Sprintf("%s tool %s for %s task. %s. Latency: %dms.", tool, outcome, taskType, notes, latencyMs)
_, err := client.StoreMemory(ctx, agentID, dakera.StoreMemoryRequest{
Content: content,
Importance: importance,
MemoryType: "semantic",
Tags: []string{fmt.Sprintf("tool:%s", tool), fmt.Sprintf("task:%s", taskType), fmt.Sprintf("outcome:%s", outcome)},
TTLSeconds: 2_592_000,
Metadata: map[string]interface{}{
"tool": tool,
"task_type": taskType,
"outcome": outcome,
"latency_ms": latencyMs,
},
})
return err
}
func getBestTool(ctx context.Context, client *dakera.Client, agentID, taskType string, tools []string) string {
result, _ := client.Recall(ctx, agentID, dakera.RecallRequest{
Query: fmt.Sprintf("tool outcomes for %s tasks", taskType),
TopK: 10,
})
type stats struct{ successes, total int; totalLatency int64 }
toolStats := map[string]*stats{}
for _, t := range tools {
toolStats[t] = &stats{}
}
for _, mem := range result.Memories {
tool, _ := mem.Metadata["tool"].(string)
if _, ok := toolStats[tool]; !ok {
continue
}
toolStats[tool].total++
if outcome, _ := mem.Metadata["outcome"].(string); outcome == "success" {
toolStats[tool].successes++
}
if lat, ok := mem.Metadata["latency_ms"].(float64); ok {
toolStats[tool].totalLatency += int64(lat)
}
}
best, bestScore := tools[0], -1.0
for tool, s := range toolStats {
if s.total == 0 { continue }
successRate := float64(s.successes) / float64(s.total)
avgLatency := float64(s.totalLatency) / float64(s.total)
speedScore := max(0, 1-avgLatency/60_000)
score := successRate*0.7 + speedScore*0.3
if score > bestScore { bestScore = score; best = tool }
}
return best
}
func max(a, b float64) float64 {
if a > b { return a }
return b
}See tool learning in action
Spin up a Dakera instance in 5 minutes and watch your agent optimize tool selection in real time.
Real-World Scenario: DevOps Agent
A DevOps automation agent at a mid-size SaaS company manages deployments across 12 microservices. It has access to four deployment tools: kubectl_deploy, ansible_run, terraform_apply, and helm_upgrade.
Without tool learning, the agent defaults to ansible_run (its training prior) even though the company migrated to Kubernetes six months ago. With tool learning:
- Week 1: Agent stores 47 deployment outcomes.
kubectl_deployaccumulates a 91% success rate;ansible_rundrops to 34% after repeated SSH timeout failures. - Week 2: The agent automatically routes all production deploys to
kubectl_deploy. Mean deployment time drops from 28s to 4.2s. - Week 4: A new tool
helm_upgradeis added for chart-based services. The agent starts routing helm chart deployments correctly within 8 invocations as it learns the new specialization. - Month 3:
terraform_applyrecords fail-then-recover patterns on Fridays (change freeze). The agent learns to prefer other tools on Friday afternoons without any manual configuration.
Before / After Memory State
// Memory store: empty
// Agent must pick tools
// based on LLM priors alone
// Every session resets:
// "Which tool should I use
// for production deploys?"
// No record of:
// - ansible_run failing 66%
// - kubectl_deploy at 91%
// - Friday change freeze
// Result: random/prior-based
// tool selection, repeated
// failures, $2,400/mo in
// failed API retries
// Memory store contains:
{
"content": "kubectl_deploy succeeded for prod deploy. Latency 1.9s.",
"importance": 0.88,
"tags": ["tool:kubectl_deploy", "outcome:success"]
}
{
"content": "ansible_run failed for prod deploy. SSH timeout 45s.",
"importance": 0.42,
"tags": ["tool:ansible_run", "outcome:failure"]
}
// Agent recalls 8 outcomes,
// computes: kubectl=91% success
// Selects kubectl automatically.
// Deployment time: 4.2s avg
SDK Method Reference
| Method | SDK | Purpose in this pattern |
|---|---|---|
store_memory(agent_id, content, importance, memory_type, tags, ttl_seconds) | Python | Persist tool outcome with structured metadata |
recall(agent_id, query, top_k, min_importance) | Python | Retrieve past tool outcomes for a task type |
storeMemory(agentId, {content, importance, memoryType, tags, ttl_seconds}) | TypeScript | Persist tool outcome with TTL for decay |
recall(agentId, query, {top_k, min_importance}) | TypeScript | Retrieve historical outcomes before tool selection |
client.store_memory("agent", StoreMemoryRequest{...}).await? | Rust | Async outcome persistence with tags |
client.recall("agent", RecallRequest{...}).await? | Rust | Async recall of tool history |
client.StoreMemory(ctx, "agent", StoreMemoryRequest{...}) | Go | Store outcome with metadata map |
client.Recall(ctx, "agent", RecallRequest{...}) | Go | Recall tool history for scoring |
Edge Cases and Gotchas
- Cold start — no history yet: When an agent has fewer than 3 outcomes per tool, success rates are statistically unreliable. Default to uniform random selection or your configured prior until you have at least 5 data points per tool.
- Environment drift: A tool that worked in staging may fail in production. Always include the environment (
staging,production) in both the content and metadata. Query separately:"tool outcomes for production_deploy tasks"vs"staging_deploy". - Correlated failures: Multiple tools can fail simultaneously due to an infrastructure outage (not tool quality). If all tools show failure outcomes within a 30-minute window, flag this as an infrastructure event rather than updating individual tool scores.
- Tool retirement: When you decommission a tool, use
forget()orbatch_forget()to remove its historical outcomes. Stale memories of a retired tool will pollute recall results until their TTL expires. - Importance inflation: If every successful outcome gets importance 0.9, the memory store fills with high-importance entries and recall becomes less discriminating. Vary importance by outcome quality: a 500ms success scores higher than a 55-second success.
Tool learning improves efficiency, but for operations with irreversible consequences (database drops, production deletions), always require explicit human confirmation regardless of historical success rates. Memory tells you what worked before — it cannot predict novel failure modes.
Performance Considerations
- Recall adds ~12ms to task processing at top_k=10 — negligible vs. any real tool execution time.
- After ~50 stored outcomes across 3–5 tools, selection accuracy stabilizes above 85% in benchmarks.
- Use
min_importance=0.3to include failure signals — dropping below 0.3 returns mostly noise. - Cap
top_kat 15–20 for tool selection; higher values add latency without improving decision quality. - Set
ttl_seconds=2_592_000(30 days) on all tool outcome memories to prevent outdated patterns from persisting indefinitely.
Advanced Configuration: Weighted scoring and A/B testing
For production DevOps agents, you can implement multi-armed bandit-style exploration to occasionally test lower-ranked tools and discover improvements:
import random
EXPLORATION_RATE = 0.1 # 10% of the time, try a non-optimal tool
def select_tool_with_exploration(agent_id: str, task_type: str, tools: list) -> str:
if random.random() < EXPLORATION_RATE:
# Explore: pick a random tool (excluding the known best)
best = get_best_tool(agent_id, task_type, tools)
others = [t for t in tools if t != best]
return random.choice(others) if others else best
# Exploit: use the historically best tool
return get_best_tool(agent_id, task_type, tools)
# Separate scoring weights for latency-sensitive tasks
def get_latency_optimized_tool(agent_id: str, task_type: str, tools: list) -> str:
"""When speed matters more than reliability, shift the scoring weight."""
# Modify scoring: 40% success rate + 60% speed (vs default 70/30)
# Implement by adjusting your score calculation in get_best_tool
pass
# Use tags to segregate memory by team (avoid cross-team contamination)
def record_team_outcome(team: str, **kwargs):
record_tool_outcome(
agent_id=f"devops-{team}", # "devops-platform", "devops-backend"
**kwargs
)
The exploration rate of 10% ensures you discover tool improvements within ~20 invocations while maintaining reliable performance for the other 90%.
Ready to build a self-improving agent?
Dakera gives your agents persistent memory that learns from every tool invocation. Get started in minutes with a free instance.
Start Building Free →