Intermediate Agent Behavior

Tool Usage Learning

⏱ ~30 min to implement 📦 Requires: Dakera v0.11+

Agents with access to multiple tools make costly mistakes — calling expensive external APIs when a cached lookup suffices, using slow tools when fast alternatives exist, or repeatedly failing with the wrong tool type. This pattern teaches your agent to remember what works by persisting every tool invocation outcome and recalling successful strategies before each new decision.

Start Building Free →

Prerequisites

Dakera instance running (quickstart)
SDK installed: pip install dakera / npm i @dakera-ai/dakera
An agent with two or more tools available (search, code execution, database lookup, etc.)
A mechanism to capture tool execution results (success/failure, latency, output quality)

The Problem

Without memory, multi-tool agents are stateless selectors — they pick tools using only the current prompt and model priors. This leads to three failure modes:

Repeated failures: The agent tries the same failing tool strategy across sessions because it has no record of prior failures.
Cost inefficiency: A web search API costs 100x more than a cached knowledge recall, but the agent can't learn which queries are best served locally.
Latency spikes: Without knowing that database_lookup returns in 30ms while web_scraper takes 8 seconds, the agent can't optimize for speed.

The result is an agent that never improves — it makes the same suboptimal choices on day 100 that it made on day 1.

How It Works

Every tool invocation produces a structured outcome record: the tool name, the task type that triggered it, the result quality, and the latency. These outcomes are stored in Dakera with importance scores proportional to the quality of the result. Before choosing a tool for a new task, the agent recalls the most relevant prior outcomes and uses the success-rate pattern to inform selection.

Tool success rate tracker: outcomes recalled for "deployment tasks" rank tools by historical effectiveness before each selection decision.

Feedback-to-memory loop: every tool execution writes a structured outcome record that improves the next tool selection for similar tasks.

Implementation Steps

Instrument tool calls to capture structured outcomes

Wrap every tool invocation with timing and result capture. Record success/failure, latency in milliseconds, and a short description of the outcome. This is the raw signal your memory will learn from.
Store outcomes with importance proportional to result quality

Successful, fast outcomes get importance 0.8–0.95. Failed or slow outcomes get 0.4–0.6 — low enough to not dominate recall, but high enough to serve as negative signal. Always include the task type in the content for semantic search.
Recall tool history before each decision

Before selecting a tool, call recall() with a query describing the current task type. Retrieve the top 5–10 prior outcomes. Parse the memories to compute per-tool success rates and average latencies.
Score and rank available tools

Compute a composite score per tool: score = success_rate * 0.7 + (1 / avg_latency_normalized) * 0.3. Present the top-ranked tool to the LLM as the recommended selection, but let the model override if the task has unusual characteristics.
Decay stale outcomes over time

Tools change — a previously broken API gets fixed, a once-fast service degrades. Use ttl_seconds (e.g. 30 days = 2,592,000s) on outcome records so the agent naturally forgets stale patterns and re-evaluates tools after significant time.

Tip: Separate namespaces per tool category

Store deployment outcomes under agent_id="devops-deploy", search outcomes under agent_id="devops-search". This prevents cross-contamination where deployment success rates pollute search tool recommendations, and keeps recall fast by limiting the memory pool per query.

Implementation

# Store a successful tool outcome
curl -X POST http://localhost:3300/v1/memory/store \
  -H "Authorization: Bearer dk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "devops-agent",
    "content": "kubectl_deploy tool succeeded for production deployment task. Deployed v2.4.1 to prod cluster in 1.9s. Zero errors.",
    "importance": 0.88,
    "memory_type": "semantic",
    "tags": ["tool:kubectl_deploy", "task:deploy", "outcome:success"],
    "metadata": {
      "tool": "kubectl_deploy",
      "task_type": "production_deploy",
      "outcome": "success",
      "latency_ms": 1900,
      "environment": "production"
    }
  }'

# Store a failure outcome (lower importance -- still useful as negative signal)
curl -X POST http://localhost:3300/v1/memory/store \
  -H "Authorization: Bearer dk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "devops-agent",
    "content": "ansible_run tool failed for production deployment task. Timeout after 45s. SSH key rotation may have broken connectivity.",
    "importance": 0.45,
    "memory_type": "semantic",
    "tags": ["tool:ansible_run", "task:deploy", "outcome:failure"],
    "metadata": {
      "tool": "ansible_run",
      "task_type": "production_deploy",
      "outcome": "failure",
      "latency_ms": 45000,
      "error": "SSH_TIMEOUT"
    }
  }'

# Recall tool outcomes before choosing tool for a deployment task
curl "http://localhost:3300/v1/memory/recall?agent_id=devops-agent&query=tool+outcomes+for+production+deployment+tasks&top_k=8&min_importance=0.3" \
  -H "Authorization: Bearer dk-..."

import time
from dakera import DakeraClient
from typing import Optional

client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")

def record_tool_outcome(
    agent_id: str,
    tool: str,
    task_type: str,
    outcome: str,      # "success" | "failure" | "partial"
    notes: str,
    latency_ms: int,
    error: Optional[str] = None
) -> None:
    """Persist a tool invocation outcome for future learning."""
    # Success = high importance, failure = low (still recalled as negative signal)
    importance_map = {"success": 0.88, "partial": 0.65, "failure": 0.42}
    importance = importance_map.get(outcome, 0.6)

    content = (
        f"{tool} tool {outcome} for {task_type} task. "
        f"{notes}. Latency: {latency_ms}ms."
    )
    if error:
        content += f" Error: {error}."

    client.store_memory(
        agent_id=agent_id,
        content=content,
        importance=importance,
        memory_type="semantic",
        tags=[f"tool:{tool}", f"task:{task_type}", f"outcome:{outcome}"],
        ttl_seconds=2_592_000  # 30 days -- stale patterns decay naturally
    )

def get_best_tool(agent_id: str, task_type: str, available_tools: list[str]) -> str:
    """Recall tool history and return the highest-scoring tool for this task type."""
    memories = client.recall(
        agent_id=agent_id,
        query=f"tool outcomes for {task_type} tasks",
        top_k=10,
        min_importance=0.3
    )

    # Aggregate success rates per tool
    tool_stats: dict[str, dict] = {t: {"successes": 0, "total": 0, "total_latency": 0} for t in available_tools}

    for mem in memories.get("memories", []):
        content = mem["content"]
        meta = mem.get("metadata", {})
        tool_name = meta.get("tool")
        if tool_name not in tool_stats:
            continue
        tool_stats[tool_name]["total"] += 1
        if meta.get("outcome") == "success":
            tool_stats[tool_name]["successes"] += 1
        tool_stats[tool_name]["total_latency"] += meta.get("latency_ms", 5000)

    # Score: 70% success rate + 30% speed factor (normalized)
    best_tool = available_tools[0]
    best_score = -1.0
    for tool, stats in tool_stats.items():
        if stats["total"] == 0:
            continue  # No history -- let model decide
        success_rate = stats["successes"] / stats["total"]
        avg_latency = stats["total_latency"] / stats["total"]
        speed_score = max(0, 1 - (avg_latency / 60_000))  # normalize to 60s max
        score = success_rate * 0.7 + speed_score * 0.3
        if score > best_score:
            best_score = score
            best_tool = tool

    return best_tool

# --- Usage in a DevOps agent ---
TOOLS = ["kubectl_deploy", "ansible_run", "terraform_apply", "manual_ssh"]

def deploy_to_production(service: str, version: str) -> dict:
    """Deploy a service using the historically most effective tool."""
    best = get_best_tool("devops-agent", "production_deploy", TOOLS)
    print(f"Using {best} based on historical success rate")

    start = time.time()
    try:
        # result = execute_tool(best, service=service, version=version)
        result = {"status": "success", "deployed": f"{service}:{version}"}
        latency_ms = int((time.time() - start) * 1000)

        record_tool_outcome(
            "devops-agent", best, "production_deploy",
            "success",
            f"Deployed {service} v{version} to production cluster",
            latency_ms
        )
        return result
    except Exception as e:
        latency_ms = int((time.time() - start) * 1000)
        record_tool_outcome(
            "devops-agent", best, "production_deploy",
            "failure",
            f"Failed to deploy {service} v{version}",
            latency_ms,
            error=str(e)
        )
        raise

import { DakeraClient } from '@dakera-ai/dakera';

const client = new DakeraClient({ baseUrl: 'http://localhost:3300', apiKey: 'dk-...' });

type Outcome = 'success' | 'failure' | 'partial';

interface ToolOutcome {
  tool: string;
  taskType: string;
  outcome: Outcome;
  notes: string;
  latencyMs: number;
  error?: string;
}

async function recordToolOutcome(agentId: string, o: ToolOutcome): Promise<void> {
  const importanceMap: Record<Outcome, number> = {
    success: 0.88,
    partial: 0.65,
    failure: 0.42,
  };

  const content = [
    `${o.tool} tool ${o.outcome} for ${o.taskType} task.`,
    o.notes,
    `Latency: ${o.latencyMs}ms.`,
    o.error ? `Error: ${o.error}.` : '',
  ].filter(Boolean).join(' ');

  await client.storeMemory(agentId, {
    content,
    importance: importanceMap[o.outcome],
    memoryType: 'semantic',
    tags: [`tool:${o.tool}`, `task:${o.taskType}`, `outcome:${o.outcome}`],
    ttl_seconds: 2_592_000, // 30 days
  });
}

async function getBestTool(
  agentId: string,
  taskType: string,
  availableTools: string[]
): Promise<string> {
  const result = await client.recall(
    agentId,
    `tool outcomes for ${taskType} tasks`,
    { top_k: 10, min_importance: 0.3 }
  );

  const stats: Record<string, { successes: number; total: number; totalLatency: number }> = {};
  for (const tool of availableTools) {
    stats[tool] = { successes: 0, total: 0, totalLatency: 0 };
  }

  for (const mem of result.memories) {
    const tool = mem.metadata?.tool as string;
    if (!stats[tool]) continue;
    stats[tool].total++;
    if (mem.metadata?.outcome === 'success') stats[tool].successes++;
    stats[tool].totalLatency += (mem.metadata?.latency_ms as number) ?? 5000;
  }

  let bestTool = availableTools[0];
  let bestScore = -1;

  for (const [tool, s] of Object.entries(stats)) {
    if (s.total === 0) continue;
    const successRate = s.successes / s.total;
    const speedScore = Math.max(0, 1 - s.totalLatency / s.total / 60_000);
    const score = successRate * 0.7 + speedScore * 0.3;
    if (score > bestScore) { bestScore = score; bestTool = tool; }
  }

  return bestTool;
}

// --- Usage ---
const tools = ['kubectl_deploy', 'ansible_run', 'terraform_apply'];

async function deployToProduction(service: string, version: string) {
  const best = await getBestTool('devops-agent', 'production_deploy', tools);
  console.log(`Using ${best} (highest historical success rate)`);

  const start = Date.now();
  try {
    // const result = await executeTool(best, { service, version });
    const latencyMs = Date.now() - start;
    await recordToolOutcome('devops-agent', {
      tool: best,
      taskType: 'production_deploy',
      outcome: 'success',
      notes: `Deployed ${service} v${version} to production`,
      latencyMs,
    });
  } catch (err) {
    const latencyMs = Date.now() - start;
    await recordToolOutcome('devops-agent', {
      tool: best,
      taskType: 'production_deploy',
      outcome: 'failure',
      notes: `Failed to deploy ${service} v${version}`,
      latencyMs,
      error: String(err),
    });
    throw err;
  }
}

use dakera_rs::{Client, StoreMemoryRequest, RecallRequest};
use std::collections::HashMap;
use std::time::Instant;

let client = Client::new("http://localhost:3300", "dk-...");

async fn record_tool_outcome(
    client: &Client,
    agent_id: &str,
    tool: &str,
    task_type: &str,
    outcome: &str,
    notes: &str,
    latency_ms: u64,
) -> anyhow::Result<()> {
    let importance = match outcome {
        "success" => 0.88,
        "partial" => 0.65,
        _ => 0.42,
    };

    let content = format!(
        "{} tool {} for {} task. {}. Latency: {}ms.",
        tool, outcome, task_type, notes, latency_ms
    );

    client.store_memory(agent_id, StoreMemoryRequest {
        content,
        importance: Some(importance),
        memory_type: "semantic".into(),
        tags: vec![
            format!("tool:{}", tool),
            format!("task:{}", task_type),
            format!("outcome:{}", outcome),
        ],
        ttl_seconds: Some(2_592_000),
        ..Default::default()
    }).await?;

    Ok(())
}

// Recall and score tools before selection
let memories = client.recall("devops-agent", RecallRequest {
    query: "tool outcomes for production deployment tasks".into(),
    top_k: Some(10),
    min_importance: Some(0.3),
    ..Default::default()
}).await?;

// Aggregate per-tool statistics
let mut tool_stats: HashMap<String, (u32, u32, u64)> = HashMap::new(); // (successes, total, total_latency_ms)
for mem in &memories.memories {
    if let Some(meta) = &mem.metadata {
        let tool = meta.get("tool").and_then(|v| v.as_str()).unwrap_or("unknown");
        let outcome = meta.get("outcome").and_then(|v| v.as_str()).unwrap_or("");
        let latency = meta.get("latency_ms").and_then(|v| v.as_u64()).unwrap_or(5000);
        let entry = tool_stats.entry(tool.to_string()).or_insert((0, 0, 0));
        entry.1 += 1;
        entry.2 += latency;
        if outcome == "success" { entry.0 += 1; }
    }
}

// Choose best tool by composite score
let best_tool = tool_stats.iter()
    .filter(|(_, (_, total, _))| *total > 0)
    .map(|(tool, (successes, total, total_latency))| {
        let success_rate = *successes as f64 / *total as f64;
        let avg_latency = *total_latency as f64 / *total as f64;
        let speed_score = (1.0 - avg_latency / 60_000.0).max(0.0);
        let score = success_rate * 0.7 + speed_score * 0.3;
        (tool, score)
    })
    .max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
    .map(|(tool, _)| tool.as_str())
    .unwrap_or("kubectl_deploy");

println!("Selecting {} based on historical outcomes", best_tool);

package main

import (
    "context"
    "fmt"
    "time"

    dakera "github.com/dakera-ai/dakera-go"
)

func recordToolOutcome(ctx context.Context, client *dakera.Client, agentID, tool, taskType, outcome, notes string, latencyMs int64) error {
    importanceMap := map[string]float64{
        "success": 0.88,
        "partial": 0.65,
        "failure": 0.42,
    }
    importance, ok := importanceMap[outcome]
    if !ok {
        importance = 0.5
    }

    content := fmt.Sprintf("%s tool %s for %s task. %s. Latency: %dms.", tool, outcome, taskType, notes, latencyMs)

    _, err := client.StoreMemory(ctx, agentID, dakera.StoreMemoryRequest{
        Content:    content,
        Importance: importance,
        MemoryType: "semantic",
        Tags:       []string{fmt.Sprintf("tool:%s", tool), fmt.Sprintf("task:%s", taskType), fmt.Sprintf("outcome:%s", outcome)},
        TTLSeconds: 2_592_000,
        Metadata: map[string]interface{}{
            "tool":       tool,
            "task_type":  taskType,
            "outcome":    outcome,
            "latency_ms": latencyMs,
        },
    })
    return err
}

func getBestTool(ctx context.Context, client *dakera.Client, agentID, taskType string, tools []string) string {
    result, _ := client.Recall(ctx, agentID, dakera.RecallRequest{
        Query: fmt.Sprintf("tool outcomes for %s tasks", taskType),
        TopK:  10,
    })

    type stats struct{ successes, total int; totalLatency int64 }
    toolStats := map[string]*stats{}
    for _, t := range tools {
        toolStats[t] = &stats{}
    }

    for _, mem := range result.Memories {
        tool, _ := mem.Metadata["tool"].(string)
        if _, ok := toolStats[tool]; !ok {
            continue
        }
        toolStats[tool].total++
        if outcome, _ := mem.Metadata["outcome"].(string); outcome == "success" {
            toolStats[tool].successes++
        }
        if lat, ok := mem.Metadata["latency_ms"].(float64); ok {
            toolStats[tool].totalLatency += int64(lat)
        }
    }

    best, bestScore := tools[0], -1.0
    for tool, s := range toolStats {
        if s.total == 0 { continue }
        successRate := float64(s.successes) / float64(s.total)
        avgLatency := float64(s.totalLatency) / float64(s.total)
        speedScore := max(0, 1-avgLatency/60_000)
        score := successRate*0.7 + speedScore*0.3
        if score > bestScore { bestScore = score; best = tool }
    }
    return best
}

func max(a, b float64) float64 {
    if a > b { return a }
    return b
}

See tool learning in action

Spin up a Dakera instance in 5 minutes and watch your agent optimize tool selection in real time.

Try Free →

Real-World Scenario: DevOps Agent

A DevOps automation agent at a mid-size SaaS company manages deployments across 12 microservices. It has access to four deployment tools: kubectl_deploy, ansible_run, terraform_apply, and helm_upgrade.

Without tool learning, the agent defaults to ansible_run (its training prior) even though the company migrated to Kubernetes six months ago. With tool learning:

Week 1: Agent stores 47 deployment outcomes. kubectl_deploy accumulates a 91% success rate; ansible_run drops to 34% after repeated SSH timeout failures.
Week 2: The agent automatically routes all production deploys to kubectl_deploy. Mean deployment time drops from 28s to 4.2s.
Week 4: A new tool helm_upgrade is added for chart-based services. The agent starts routing helm chart deployments correctly within 8 invocations as it learns the new specialization.
Month 3: terraform_apply records fail-then-recover patterns on Fridays (change freeze). The agent learns to prefer other tools on Friday afternoons without any manual configuration.

Before / After Memory State

Before: No Tool Memory

// Memory store: empty
// Agent must pick tools
// based on LLM priors alone

// Every session resets:
// "Which tool should I use
// for production deploys?"

// No record of:
// - ansible_run failing 66%
// - kubectl_deploy at 91%
// - Friday change freeze

// Result: random/prior-based
// tool selection, repeated
// failures, $2,400/mo in
// failed API retries

After: Tool Learning Active

// Memory store contains:
{
  "content": "kubectl_deploy succeeded for prod deploy. Latency 1.9s.",
  "importance": 0.88,
  "tags": ["tool:kubectl_deploy", "outcome:success"]
}
{
  "content": "ansible_run failed for prod deploy. SSH timeout 45s.",
  "importance": 0.42,
  "tags": ["tool:ansible_run", "outcome:failure"]
}
// Agent recalls 8 outcomes,
// computes: kubectl=91% success
// Selects kubectl automatically.
// Deployment time: 4.2s avg

SDK Method Reference

Method	SDK	Purpose in this pattern
`store_memory(agent_id, content, importance, memory_type, tags, ttl_seconds)`	Python	Persist tool outcome with structured metadata
`recall(agent_id, query, top_k, min_importance)`	Python	Retrieve past tool outcomes for a task type
`storeMemory(agentId, {content, importance, memoryType, tags, ttl_seconds})`	TypeScript	Persist tool outcome with TTL for decay
`recall(agentId, query, {top_k, min_importance})`	TypeScript	Retrieve historical outcomes before tool selection
`client.store_memory("agent", StoreMemoryRequest{...}).await?`	Rust	Async outcome persistence with tags
`client.recall("agent", RecallRequest{...}).await?`	Rust	Async recall of tool history
`client.StoreMemory(ctx, "agent", StoreMemoryRequest{...})`	Go	Store outcome with metadata map
`client.Recall(ctx, "agent", RecallRequest{...})`	Go	Recall tool history for scoring

Edge Cases and Gotchas

Cold start — no history yet: When an agent has fewer than 3 outcomes per tool, success rates are statistically unreliable. Default to uniform random selection or your configured prior until you have at least 5 data points per tool.
Environment drift: A tool that worked in staging may fail in production. Always include the environment (staging, production) in both the content and metadata. Query separately: "tool outcomes for production_deploy tasks" vs "staging_deploy".
Correlated failures: Multiple tools can fail simultaneously due to an infrastructure outage (not tool quality). If all tools show failure outcomes within a 30-minute window, flag this as an infrastructure event rather than updating individual tool scores.
Tool retirement: When you decommission a tool, use forget() or batch_forget() to remove its historical outcomes. Stale memories of a retired tool will pollute recall results until their TTL expires.
Importance inflation: If every successful outcome gets importance 0.9, the memory store fills with high-importance entries and recall becomes less discriminating. Vary importance by outcome quality: a 500ms success scores higher than a 55-second success.

Warning: Don't rely solely on memory for safety-critical tool selection

Tool learning improves efficiency, but for operations with irreversible consequences (database drops, production deletions), always require explicit human confirmation regardless of historical success rates. Memory tells you what worked before — it cannot predict novel failure modes.

Performance Considerations

~12ms

Recall latency (top_k=10)

91%

Tool selection accuracy after 50 outcomes

30 days

Recommended TTL to prevent stale patterns

Recall adds ~12ms to task processing at top_k=10 — negligible vs. any real tool execution time.
After ~50 stored outcomes across 3–5 tools, selection accuracy stabilizes above 85% in benchmarks.
Use min_importance=0.3 to include failure signals — dropping below 0.3 returns mostly noise.
Cap top_k at 15–20 for tool selection; higher values add latency without improving decision quality.
Set ttl_seconds=2_592_000 (30 days) on all tool outcome memories to prevent outdated patterns from persisting indefinitely.

Advanced Configuration: Weighted scoring and A/B testing

For production DevOps agents, you can implement multi-armed bandit-style exploration to occasionally test lower-ranked tools and discover improvements:

import random

EXPLORATION_RATE = 0.1  # 10% of the time, try a non-optimal tool

def select_tool_with_exploration(agent_id: str, task_type: str, tools: list) -> str:
    if random.random() < EXPLORATION_RATE:
        # Explore: pick a random tool (excluding the known best)
        best = get_best_tool(agent_id, task_type, tools)
        others = [t for t in tools if t != best]
        return random.choice(others) if others else best

    # Exploit: use the historically best tool
    return get_best_tool(agent_id, task_type, tools)

# Separate scoring weights for latency-sensitive tasks
def get_latency_optimized_tool(agent_id: str, task_type: str, tools: list) -> str:
    """When speed matters more than reliability, shift the scoring weight."""
    # Modify scoring: 40% success rate + 60% speed (vs default 70/30)
    # Implement by adjusting your score calculation in get_best_tool
    pass

# Use tags to segregate memory by team (avoid cross-team contamination)
def record_team_outcome(team: str, **kwargs):
    record_tool_outcome(
        agent_id=f"devops-{team}",  # "devops-platform", "devops-backend"
        **kwargs
    )

The exploration rate of 10% ensures you discover tool improvements within ~20 invocations while maintaining reliable performance for the other 90%.

Ready to build a self-improving agent?

Dakera gives your agents persistent memory that learns from every tool invocation. Get started in minutes with a free instance.