// Tool Use / Function Calling

The complete engineering guide — from mental model to production architecture

01. Concept Explanation

LLMs generate text — that is their only primitive operation. They cannot query databases, call APIs, read files, or perform any real-world action. Tool use (also called function calling) bridges this gap by letting the model declare intent to call a function, which your code then executes.

The Core Mental Model

Think of the LLM as a very smart dispatcher in a microservices architecture. It reads the user's request, decides which backend service to call, and formats the request payload — but your code actually executes the call and returns the result.

User prompt -> LLM reasons -> outputs structured tool call (name + args)
    -> Your code executes the function -> result fed back to LLM
    -> LLM generates final response incorporating the result

The model does NOT run code. It produces a JSON object like:

{
  "type": "tool_use",
  "name": "get_weather",
  "input": {"city": "Tokyo", "unit": "celsius"}
}

Your application parses this, calls the real get_weather() function, and sends the result back. The model then incorporates that result into its natural language answer.

Tool Use vs. RAG vs. Prompting

Approach	When data is injected	Who decides what data	LLM's role
Prompting	Before generation, statically	Developer hardcodes it	Reason over fixed context
RAG	Before generation, dynamically	Retrieval system (embeddings)	Reason over retrieved context
Tool Use	During generation, on demand	The LLM itself decides	Decide what to fetch, then reason

RAG retrieves context before the model runs. Tool use happens during generation — the model decides mid-response that it needs external data and explicitly requests it. This is the fundamental difference: tool use gives the model agency.

The Analogy for Backend Developers

If you have built REST APIs, you already understand tool use:

Tool Definition     =  OpenAPI/Swagger spec      (what the endpoint does)
Tool Implementation =  Route handler             (the actual business logic)
Tool Call           =  HTTP request              (model calls the endpoint)
Tool Result         =  HTTP response             (data sent back to model)
Agentic Loop        =  Saga/Workflow orchestrator (multi-step coordination)

02. Why It Matters in Real Systems

From Text Generator to Action Taker

Without tool use, an LLM is a fancy autocomplete — it can only generate text based on its training data. With tool use, it becomes an agent that can observe and act on the real world.

Without tools:
  User: "What's my order status?"
  LLM:  "I don't have access to your order information."  -- useless

With tools:
  User: "What's my order status?"
  LLM:  [calls search_orders(email="user@example.com")]
  LLM:  "Your keyboard (ORD-1001) was delivered, and your
         USB hub (ORD-1042) is shipped."  -- useful

RAG vs. Tool Use — Different Problems

Scenario	RAG	Tool Use	Why
"What's our refund policy?"	Yes	No	Static knowledge, retrieve from docs
"Process a refund for order #1234"	No	Yes	Requires action + real-time data
"What's the weather in Tokyo?"	No	Yes	Real-time data, not in training set
"Summarize our API documentation"	Yes	No	Existing documents, no action needed
"Find similar tickets, then create a Jira issue"	Yes	Yes	Needs both retrieval AND action

Where Companies Use Tool Use

Company Type	Tool Use Application
Customer support	Look up order status, initiate refunds, update accounts
Code assistants	Read files, run tests, search codebases, apply edits
Data analysts	Query SQL databases, run calculations, generate charts
Booking platforms	Search flights, check availability, make reservations
DevOps copilots	Check deploys, roll back releases, scale infrastructure
Financial services	Retrieve balances, execute trades, generate compliance reports

Business impact: Accuracy (no hallucinated facts when tools can look them up), freshness (real-time data from APIs), action (the system can do things, not just say things), composability (small tested functions combine into complex workflows), and cost efficiency (one LLM + tools replaces building custom UIs for every operation).

03. Internal Mechanics

How the Model "Learns" to Call Tools

During fine-tuning (SFT + RLHF), models are trained on thousands of examples where the correct response is a structured tool call rather than plain text. The model learns a decision boundary: "When the user asks about weather, emit a tool_use block instead of guessing the temperature." This is essentially classification — the model classifies each turn as either "answer directly" or "call tool X with arguments Y."

What the Model Sees (Context Injection)

Your tool definitions are serialized into the model's context window. When you send tool schemas, the API serializes them into the system portion of the prompt. The model sees something like:

You have access to the following tools:

Tool: get_weather
Description: Get current weather for a city...
Parameters:
  - city (string, required): City name
  - unit (string, optional): celsius or fahrenheit

When you need to use a tool, respond with a tool_use block.

This means tool definitions consume input tokens on every API call. Ten tools with verbose descriptions can easily add 2,000-3,000 tokens to every request.

The API Protocol Flow

Step 1: Define tools as JSON schemas (name, description, input_schema)

Step 2: Send user message + tool definitions to API
  POST /messages { model, tools: [...], messages: [{role: "user", ...}] }

Step 3: Model returns stop_reason: "tool_use"
  content: [{type: "text", text: "I'll check..."}, {type: "tool_use", id, name, input}]

Step 4: YOUR CODE executes the function
  result = get_weather("Tokyo", "celsius") -> "22C, Partly Cloudy"

Step 5: Send tool_result back to the model
  messages: [..., {role: "user", content: [{type: "tool_result", tool_use_id, content}]}]

Step 6: Model generates final natural language response
  stop_reason: "end_turn"

The Agentic Loop

Every tool-use interaction follows this precise loop. The model never executes code — it expresses intent, and your runtime fulfills it.

-> API Request

messages: [user_msg] + tools: [schemas]
You send the user query AND tool definitions

<- Model Response

stop_reason: "tool_use"
content: [{type: "tool_use", name, id, input}]
Model declares: "I want to call X with args Y"

REPEAT until stop_reason = "end_turn"

Your Code Executes

dispatch(tool_name, tool_input)
-> DB query, API call, calculation, etc.
This is the security/validation boundary

-> Feed Result Back

messages: [..., {tool_result, tool_use_id, content}]
Send execution output back to the model

<- Model Synthesizes or Calls Again

Either: stop_reason: "end_turn" -> final answer
Or: stop_reason: "tool_use" -> another call (loop)

Parallel Tool Calls

Claude can emit multiple tool_use blocks in a single response. This is the model saying "I need data from multiple tools simultaneously." You must execute ALL tool calls and send ALL results back in a single user message. This is a significant performance optimization — instead of separate loop iterations, the model gets multiple results in one round trip.

The tool_choice Parameter

# Auto (default): model decides whether to call a tool
tool_choice={"type": "auto"}

# Any: model MUST call some tool (but it picks which one)
tool_choice={"type": "any"}

# Specific: model MUST call THIS tool
tool_choice={"type": "tool", "name": "get_weather"}

Use "any" or specific tool when you know a tool call is needed and don't want the model to skip it. Useful in pipelines where the LLM's job is specifically to extract structured data for a known tool.

Message Structure Rules

The Claude API enforces strict rules for tool use:

1. Messages must alternate: user -> assistant -> user -> assistant
2. A tool_use block must appear in an assistant message
3. A tool_result block must appear in the user message that immediately follows
4. Every tool_result must reference a tool_use_id from the preceding assistant message
5. Every tool_use block must have a corresponding tool_result

Violating any of these rules produces an API error. This is the most common source of bugs when implementing tool use.

04. Practical Example — E-Commerce Support Agent

Problem: Build a support agent that can look up orders, check weather for delivery estimates, and perform calculations — handling multi-step requests autonomously.

User says: "I'm alice@example.com. Can you check my orders and calculate the total cost of everything?"

Iteration 1:
  Model thinks: "I need to look up Alice's orders first"
  Model emits:  tool_use -> search_orders(email="alice@example.com")

Your code:
  Executes search_orders -> returns order list with prices

Iteration 2:
  Model sees: ORD-1001 ($149.99), ORD-1042 ($49.99)
  Model thinks: "Now I need to calculate the total"
  Model emits:  tool_use -> calculate(expression="149.99 + 49.99")

Your code:
  Executes calculate -> returns "199.98"

Iteration 3:
  Model sees: calculation result
  Model generates: "Hi Alice! You have 2 orders:
    1. Mechanical Keyboard (ORD-1001) -- $149.99, delivered
    2. USB-C Hub (ORD-1042) -- $49.99, shipped
    Your total is $199.98."
  stop_reason: "end_turn"

Key engineering decisions driven by tool use understanding:

Decision	Reasoning
Separate tool definitions from implementations	Mirrors OpenAPI spec vs handler — testable independently
Max iterations guard on the loop	Prevents runaway costs if the model loops endlessly
Log every tool call with input/output	Essential for debugging when the model calls the wrong tool
Return tool call history in the response	Client-side observability — users can see what happened
Execute tools sequentially, not concurrently	Simpler error handling; parallelize only when needed

05. Hands-on Implementation

Project Structure

03-tool-use/
  app.py              # Standalone demo — agentic loop with tool use
  tools.py            # Tool definitions (JSON schemas) + implementations

Setup

# From the repository root
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY

# Run the standalone demo
cd 03-tool-use
python app.py

Key Implementation Pattern

The core of every tool-use system is this loop. Here is the minimal version:

# The agentic loop — this is the pattern to memorize
while round < MAX_ROUNDS:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        tools=TOOLS,          # JSON schemas
        messages=messages,    # Full conversation + tool results
    )

    if response.stop_reason == "tool_use":
        # 1. Append assistant's response (with tool_use blocks)
        messages.append({"role": "assistant", "content": response.content})

        # 2. Execute each tool and collect results
        results = []
        for block in response.content:
            if block.type == "tool_use":
                output = execute_tool(block.name, block.input)
                results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,  # Must match!
                    "content": output,
                })

        # 3. Feed results back as a user message
        messages.append({"role": "user", "content": results})

    elif response.stop_reason == "end_turn":
        return response.content  # Final answer

File Walkthrough: tools.py

Why it is a separate file: This mirrors the backend pattern of separating API specifications from handlers. The tool definitions (JSON schemas) are what the model sees. The implementations are what your code executes. You can test them independently.

Tool Definitions (what the model sees)

TOOL_DEFINITIONS = [
    {
        "name": "get_weather",
        "description": (
            "Get current weather for a specific city. Returns temperature, "
            "condition, and humidity. Use this when the user asks about "
            "weather, temperature, or outdoor conditions in a location."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
    # ... more tools
]

Critical details:
description is the most important field — the model reads it to decide WHEN to call the tool. Include "Use this when..." clauses.
input_schema uses standard JSON Schema. required tells the model which parameters it must provide.
enum constrains the model to specific values — prevents inventing invalid options.
Parameter description with examples ("City name, e.g. 'Tokyo'") helps the model fill in the right values.

Security: The Calculator Tool

def _calculate(params: dict) -> str:
    expression = params["expression"]

    # Security: NEVER use raw eval() with untrusted input
    allowed_chars = set("0123456789+-*/.(,) ")
    allowed_words = {"sqrt", "abs", "round", "min", "max"}

    words = set(re.findall(r"[a-zA-Z_]+", expression))
    if not words.issubset(allowed_words):
        unsafe = words - allowed_words
        return f"Error: Unsafe operations not allowed: {unsafe}"

    result = eval(expression, {"__builtins__": {}}, SAFE_MATH)
    return f"Result: {result}"

Critical lesson: The model's output is untrusted input from a security perspective. If the model generates "__import__('os').system('rm -rf /')" as a calculation expression and you pass it to eval(), you have given the LLM code execution. Always sanitize, whitelist, and sandbox.

06. System Design — Production Architecture

In production, the tool-use loop sits behind an orchestration layer with these concerns:

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Client    │────>│  API Gateway     │────>│  Orchestrator   │
│  (Frontend) │     │  (Auth, Rate     │     │  (Agentic Loop) │
│             │<────│   Limiting)      │<────│                 │
└─────────────┘     └──────────────────┘     └────────┬────────┘
                                                      │
                          ┌───────────────────────────┼────────────────┐
                          │                           │                │
                    ┌─────v─────┐             ┌───────v──────┐  ┌─────v──────┐
                    │  Tool     │             │  Tool        │  │  Tool      │
                    │  Registry │             │  Validator   │  │  Logger    │
                    │           │             │  (Schema +   │  │  (Audit    │
                    │  name->fn │             │   Auth)      │  │   Trail)   │
                    └─────┬─────┘             └──────────────┘  └────────────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
      ┌─────v────┐ ┌─────v────┐ ┌──────v─────┐
      │ Product  │ │ Shipping │ │   Order    │
      │ Service  │ │ Service  │ │  Service   │
      │ (DB)     │ │ (API)    │ │ (Write)    │
      └──────────┘ └──────────┘ └────────────┘

Production Components

1. Tool Dispatcher — The Critical Middleware

This sits between the model and your tools. It handles authorization, rate limiting, input validation, timeout enforcement, audit logging, and error handling with user-friendly messages.

class ToolDispatcher:
    async def execute(self, tool_name, tool_input, user_context):
        # 1. Authorization: can this user call this tool?
        # 2. Rate limiting: exceeded tool call budget?
        # 3. Input validation: does input match the schema?
        # 4. Execute with timeout (asyncio.wait_for)
        # 5. Audit log every call
        return result

2. Idempotency for Mutating Tools

The model might call the same tool twice due to retries or loop errors. Write operations must be idempotent:

async def initiate_refund(params):
    idempotency_key = f"refund-{params['order_id']}-{params['amount']}"

    # Check if this refund was already processed
    existing = await refund_store.get(idempotency_key)
    if existing:
        return f"Refund already processed: {existing.refund_id}"

    # Process the refund with idempotency key
    refund = await payment_service.refund(
        order_id=params["order_id"],
        amount=params["amount"],
        idempotency_key=idempotency_key,
    )
    return f"Refund {refund.id} processed: ${refund.amount}"

3. Timeout Budgets

Each agentic loop iteration is an API call (~1-3s) plus tool execution time. You need a global deadline:

async def chat_with_deadline(message, deadline_seconds=30):
    deadline = time.time() + deadline_seconds

    while iteration < max_iterations:
        remaining = deadline - time.time()
        if remaining <= 0:
            return "Taking too long. Here's what I have so far..."

        response = await client.messages.create(
            model=MODEL, tools=TOOL_DEFINITIONS, messages=messages,
            max_tokens=min(1024, int(remaining * 50)),
        )
        # ... rest of agentic loop

4. Observability — Your Debugging Lifeline

Log every tool call. When the model calls the wrong tool, these logs are the only way to debug:

# Structured log for every tool call
{
    "event": "tool_call",
    "request_id": "req_abc123",
    "iteration": 2,
    "tool_name": "search_orders",
    "tool_input": {"email": "alice@example.com"},
    "tool_latency_ms": 34,
    "cumulative_input_tokens": 1847,
}

Dashboard metrics to track: tool call rate per tool, tool selection accuracy, average iterations per request (trending up = prompt problem), tool execution latency, loop timeout rate.

5. Graceful Degradation

If a tool fails, send a clear error message back to the model. It can often recover:

try:
    result = await execute_tool(block.name, block.input)
except DatabaseConnectionError:
    result = (
        "Error: The order database is temporarily unavailable. "
        "Please let the user know and suggest they try again."
    )
except ExternalAPIError as e:
    result = f"Error: The weather service returned: {e.message}"

The model will incorporate the error into its response naturally: "I'm sorry, I wasn't able to check your orders right now because our order system is temporarily unavailable."

07. Common Pitfalls

Important Pitfall 1: Vague tool descriptions

Descriptions like "get data" are useless. The model picks tools based on descriptions — be specific about what each tool does, when to use it, what it returns, and when NOT to use it. Include "Use this when..." and "Do NOT use this for..." clauses.

Critical Pitfall 2: Dropping content from the assistant message

Only keeping tool_use blocks and dropping text blocks corrupts the conversation context. The assistant message may contain BOTH text and tool_use blocks. Always preserve the entire content list: messages.append({"role": "assistant", "content": response.content})

Critical Pitfall 3: Mismatching tool_use_id

Each tool_result must reference the exact tool_use_id from the corresponding tool_use block. Hardcoding or mismatching IDs causes the API to reject your request. Always use block.id from the tool_use block.

Important Pitfall 4: No loop guard

Without a MAX_ITERATIONS guard, the model can loop forever — calling the same tool with the same arguments, or cycling between tools without making progress. Each iteration costs tokens. Always cap iterations and add cost tracking.

Critical Pitfall 5: Using eval() for model output

The model's output is untrusted input. It could generate __import__('os').system('rm -rf /'). Even with a "well-behaved" model, adversarial user prompts can manipulate it into generating malicious tool arguments. Whitelist allowed operations or use a sandboxed math parser.

Important Pitfall 6: Too many tools

Registering 50+ tools in every request wastes tokens (50 tools can add 5,000+ tokens per API call), degrades selection accuracy, and fills the context window faster. Fix: group into categories with two-stage routing, use embeddings for dynamic top-K selection, or split into specialized agents.

Important Pitfall 7: Stripping tool messages from history

Removing tool_use and tool_result messages to "save tokens" makes the model lose track of what data it already has. It will re-call tools unnecessarily or hallucinate previous results. If you need to reduce context size, summarize the tool results rather than removing them.

Tip Pitfall 8: Not handling parallel tool calls

A single response can contain multiple tool_use blocks. If you only process the first one and break, the model gets partial results and may behave unpredictably. Process ALL tool_use blocks and return ALL results in one user message.

08. Advanced Topics

Explore these next, in recommended order:

8.1 Streaming with Tool Use

Stream text to the user but buffer tool_use blocks until they are complete before executing. Users expect real-time responses, but tool execution must happen between streaming segments.

8.2 MCP (Model Context Protocol)

Anthropic's open standard for connecting LLMs to external tools. MCP servers expose tools via a standardized protocol; clients discover and call tools dynamically. Enables tool interoperability across different LLM applications.

8.3 Dynamic Tool Loading

Embed tool descriptions, match against the user's query with cosine similarity, load the top-K most relevant tools. Reduces context usage and improves selection accuracy.

8.4 Human-in-the-Loop for Dangerous Tools

For tools with side effects (refunds, deletes, deployments), pause and get human approval before executing. The model decides what to do; a human decides whether to let it.

8.5 Multi-Agent Tool Delegation

Supervisor agent routes to specialist agents (database tools, API tools, calculation tools). Reduces per-agent tool count, improves specialization, enables parallel execution.

8.6 Tool Result Caching

Hash tool name + input, check cache before executing, store with TTL. Subtlety: semantic caching with embeddings for varied phrasings of the same query.

8.7 Retry and Fallback Strategies

Retry with exponential backoff for transient errors. Fallback to secondary sources. Circuit breaker if a tool fails repeatedly. Always send clear error messages back to the model.

8.8 Tool Use Evaluation and Testing

Measure tool selection accuracy, argument accuracy, end-to-end evaluation, and regression testing when tool descriptions change.

09. Coding Exercises

Exercise 1 — Guardrailed Tools

Add a permission system: mark place_order as requiring explicit user confirmation. Build middleware that intercepts write-operation tool calls and returns a confirmation prompt instead of executing.

Exercise 2 — Parallel Tool Calls

Create a "compare products" flow: when a user says "compare keyboard vs mouse pricing with shipping to Japan," the model should call lookup_product and calculate_shipping for both products. Handle the parallel tool calls correctly.

Exercise 3 — Tool Observability

Build a logging middleware that records every tool call with: timestamp, tool name, input args, output, latency, and token usage. Expose it via a /debug/traces endpoint that shows the full execution trace for any conversation.

10. Architect-Level Questions

Q1: The model keeps calling the wrong tool. It uses get_weather when the user asks about orders. How do you debug and fix this?

Think: check tool descriptions for ambiguity, add "use this when..." / "do NOT use this for..." clauses, log the full message history, isolate with tool_choice, reduce tool count, check system prompt for conflicting instructions.

Q2: You're building a customer support agent with 40 tools. Response quality drops as you add more tools. What's your architecture?

Think: two-tier approach (category router with 5-6 meta-tools, then category-specific tools), embedding-based dynamic selection, multi-agent with specialized tool sets, dynamic loading with MCP.

Q3: A tool call mutates state (processes a refund), but the model's final response fails due to a timeout. The refund went through but the user sees an error. How do you handle this?

Think: idempotency keys, two-phase execution (pending then confirm), durable tool execution log keyed by request ID, compensating actions, never retry mutating calls blindly.

Q4: How does tool use affect token costs, and what strategies reduce spend in a high-volume system?

Think: tool definitions are injected as input tokens on EVERY API call in the loop. 10 tools x 5 iterations = definitions counted 5 times. Strategies: minimize descriptions, dynamic top-K loading (60-80% savings), tool_choice when known, cache results, better system prompts, model routing.

Q5: Compare tool use with RAG. When would you use each, and when would you combine them?

Think: RAG = passive retrieval before generation (knowledge questions, static docs). Tool use = active during generation (actions, live data). Combine: "Am I eligible for a refund on order #1234?" — RAG retrieves the refund policy, tool use fetches the order data, model combines both to answer.

11. Quick Reference Card

┌──────────────────────────────────────────────────────────────┐
│               TOOL USE ENGINEERING CHEATSHEET                │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  TOOL DEFINITION:                                           │
│    name:         unique identifier                          │
│    description:  MOST IMPORTANT -- drives selection accuracy │
│    input_schema: JSON Schema for parameters                 │
│                                                              │
│  AGENTIC LOOP:                                              │
│    while not done and under budget:                         │
│      response = llm(messages, tools)                        │
│      if end_turn -> return response                         │
│      if tool_use -> execute -> append results -> loop       │
│                                                              │
│  MESSAGE RULES:                                             │
│    * Alternate user <-> assistant                           │
│    * Append FULL assistant content (text + tool_use)        │
│    * Match tool_use_id in every tool_result                 │
│    * Send ALL tool_results in one user message              │
│                                                              │
│  tool_choice OPTIONS:                                       │
│    auto  -> model decides (default)                         │
│    any   -> must use some tool                              │
│    tool  -> must use specific tool                          │
│                                                              │
│  SECURITY:                                                  │
│    * Model output is UNTRUSTED INPUT                        │
│    * Never eval() model-generated code                      │
│    * Validate tool arguments before executing               │
│    * Idempotency keys for mutating operations               │
│                                                              │
│  COST FORMULA:                                              │
│    tool_tokens = num_tools x avg_tokens_per_tool            │
│    loop_cost = iterations x (tool_tokens + history_tokens)  │
│    Minimize: fewer tools, fewer iterations, cache results   │
│                                                              │
│  GOLDEN RULES:                                              │
│    1. Description quality = selection accuracy              │
│    2. Always guard the loop with max_iterations             │
│    3. Append FULL assistant message, not just tool blocks   │
│    4. Handle parallel tool calls (multiple tool_use blocks) │
│    5. Log every tool call -- it's your debugging lifeline   │
│                                                              │
└──────────────────────────────────────────────────────────────┘