// Tool Use / Function Calling
The complete engineering guide — from mental model to production architecture
01. Concept Explanation
LLMs generate text — that is their only primitive operation. They cannot query databases, call APIs, read files, or perform any real-world action. Tool use (also called function calling) bridges this gap by letting the model declare intent to call a function, which your code then executes.
The Core Mental Model
Think of the LLM as a very smart dispatcher in a microservices architecture. It reads the user's request, decides which backend service to call, and formats the request payload — but your code actually executes the call and returns the result.
User prompt -> LLM reasons -> outputs structured tool call (name + args)
-> Your code executes the function -> result fed back to LLM
-> LLM generates final response incorporating the result
The model does NOT run code. It produces a JSON object like:
{
"type": "tool_use",
"name": "get_weather",
"input": {"city": "Tokyo", "unit": "celsius"}
}
Your application parses this, calls the real get_weather() function, and sends the result back. The model then incorporates that result into its natural language answer.
Tool Use vs. RAG vs. Prompting
| Approach | When data is injected | Who decides what data | LLM's role |
| Prompting | Before generation, statically | Developer hardcodes it | Reason over fixed context |
| RAG | Before generation, dynamically | Retrieval system (embeddings) | Reason over retrieved context |
| Tool Use | During generation, on demand | The LLM itself decides | Decide what to fetch, then reason |
RAG retrieves context before the model runs. Tool use happens during generation — the model decides mid-response that it needs external data and explicitly requests it. This is the fundamental difference: tool use gives the model agency.
The Analogy for Backend Developers
If you have built REST APIs, you already understand tool use:
Tool Definition = OpenAPI/Swagger spec (what the endpoint does)
Tool Implementation = Route handler (the actual business logic)
Tool Call = HTTP request (model calls the endpoint)
Tool Result = HTTP response (data sent back to model)
Agentic Loop = Saga/Workflow orchestrator (multi-step coordination)
02. Why It Matters in Real Systems
From Text Generator to Action Taker
Without tool use, an LLM is a fancy autocomplete — it can only generate text based on its training data. With tool use, it becomes an agent that can observe and act on the real world.
User: "What's my order status?"
LLM: "I don't have access to your order information."
User: "What's my order status?"
LLM: [calls search_orders(email="user@example.com")]
LLM: "Your keyboard (ORD-1001) was delivered, and your
USB hub (ORD-1042) is shipped."
RAG vs. Tool Use — Different Problems
| Scenario | RAG | Tool Use | Why |
| "What's our refund policy?" | Yes | No | Static knowledge, retrieve from docs |
| "Process a refund for order #1234" | No | Yes | Requires action + real-time data |
| "What's the weather in Tokyo?" | No | Yes | Real-time data, not in training set |
| "Summarize our API documentation" | Yes | No | Existing documents, no action needed |
| "Find similar tickets, then create a Jira issue" | Yes | Yes | Needs both retrieval AND action |
Where Companies Use Tool Use
| Company Type | Tool Use Application |
| Customer support | Look up order status, initiate refunds, update accounts |
| Code assistants | Read files, run tests, search codebases, apply edits |
| Data analysts | Query SQL databases, run calculations, generate charts |
| Booking platforms | Search flights, check availability, make reservations |
| DevOps copilots | Check deploys, roll back releases, scale infrastructure |
| Financial services | Retrieve balances, execute trades, generate compliance reports |
Business impact: Accuracy (no hallucinated facts when tools can look them up), freshness (real-time data from APIs), action (the system can do things, not just say things), composability (small tested functions combine into complex workflows), and cost efficiency (one LLM + tools replaces building custom UIs for every operation).
03. Internal Mechanics
How the Model "Learns" to Call Tools
During fine-tuning (SFT + RLHF), models are trained on thousands of examples where the correct response is a structured tool call rather than plain text. The model learns a decision boundary: "When the user asks about weather, emit a tool_use block instead of guessing the temperature." This is essentially classification — the model classifies each turn as either "answer directly" or "call tool X with arguments Y."
What the Model Sees (Context Injection)
Your tool definitions are serialized into the model's context window. When you send tool schemas, the API serializes them into the system portion of the prompt. The model sees something like:
You have access to the following tools:
Tool: get_weather
Description: Get current weather for a city...
Parameters:
- city (string, required): City name
- unit (string, optional): celsius or fahrenheit
When you need to use a tool, respond with a tool_use block.
This means tool definitions consume input tokens on every API call. Ten tools with verbose descriptions can easily add 2,000-3,000 tokens to every request.
The API Protocol Flow
Step 1: Define tools as JSON schemas (name, description, input_schema)
Step 2: Send user message + tool definitions to API
POST /messages { model, tools: [...], messages: [{role: "user", ...}] }
Step 3: Model returns stop_reason: "tool_use"
content: [{type: "text", text: "I'll check..."}, {type: "tool_use", id, name, input}]
Step 4: YOUR CODE executes the function
result = get_weather("Tokyo", "celsius") -> "22C, Partly Cloudy"
Step 5: Send tool_result back to the model
messages: [..., {role: "user", content: [{type: "tool_result", tool_use_id, content}]}]
Step 6: Model generates final natural language response
stop_reason: "end_turn"
The Agentic Loop
Every tool-use interaction follows this precise loop. The model never executes code — it expresses intent, and your runtime fulfills it.
1
-> API Request
messages: [user_msg] + tools: [schemas]
You send the user query AND tool definitions
|
2
<- Model Response
stop_reason: "tool_use"
content: [{type: "tool_use", name, id, input}]
Model declares: "I want to call X with args Y"
|
REPEAT until stop_reason = "end_turn"
3
Your Code Executes
dispatch(tool_name, tool_input)
-> DB query, API call, calculation, etc.
This is the security/validation boundary
|
4
-> Feed Result Back
messages: [..., {tool_result, tool_use_id, content}]
Send execution output back to the model
|
5
<- Model Synthesizes or Calls Again
Either: stop_reason: "end_turn" -> final answer
Or: stop_reason: "tool_use" -> another call (loop)
Parallel Tool Calls
Claude can emit multiple tool_use blocks in a single response. This is the model saying "I need data from multiple tools simultaneously." You must execute ALL tool calls and send ALL results back in a single user message. This is a significant performance optimization — instead of separate loop iterations, the model gets multiple results in one round trip.
The tool_choice Parameter
tool_choice={"type": "auto"}
tool_choice={"type": "any"}
tool_choice={"type": "tool", "name": "get_weather"}
Use "any" or specific tool when you know a tool call is needed and don't want the model to skip it. Useful in pipelines where the LLM's job is specifically to extract structured data for a known tool.
Message Structure Rules
The Claude API enforces strict rules for tool use:
1. Messages must alternate: user -> assistant -> user -> assistant
2. A tool_use block must appear in an assistant message
3. A tool_result block must appear in the user message that immediately follows
4. Every tool_result must reference a tool_use_id from the preceding assistant message
5. Every tool_use block must have a corresponding tool_result
Violating any of these rules produces an API error. This is the most common source of bugs when implementing tool use.
04. Practical Example — E-Commerce Support Agent
Problem: Build a support agent that can look up orders, check weather for delivery estimates, and perform calculations — handling multi-step requests autonomously.
User says: "I'm alice@example.com. Can you check my orders and calculate the total cost of everything?"
Model thinks: "I need to look up Alice's orders first"
Model emits: tool_use -> search_orders(email="alice@example.com")
Executes search_orders -> returns order list with prices
Model sees: ORD-1001 ($149.99), ORD-1042 ($49.99)
Model thinks: "Now I need to calculate the total"
Model emits: tool_use -> calculate(expression="149.99 + 49.99")
Executes calculate -> returns "199.98"
Model sees: calculation result
Model generates: "Hi Alice! You have 2 orders:
1. Mechanical Keyboard (ORD-1001) -- $149.99, delivered
2. USB-C Hub (ORD-1042) -- $49.99, shipped
Your total is $199.98."
stop_reason: "end_turn"
Key engineering decisions driven by tool use understanding:
| Decision | Reasoning |
| Separate tool definitions from implementations | Mirrors OpenAPI spec vs handler — testable independently |
| Max iterations guard on the loop | Prevents runaway costs if the model loops endlessly |
| Log every tool call with input/output | Essential for debugging when the model calls the wrong tool |
| Return tool call history in the response | Client-side observability — users can see what happened |
| Execute tools sequentially, not concurrently | Simpler error handling; parallelize only when needed |
05. Hands-on Implementation
Project Structure
03-tool-use/
app.py # Standalone demo — agentic loop with tool use
tools.py # Tool definitions (JSON schemas) + implementations
Setup
pip install -r requirements.txt
cp .env.example .env
cd 03-tool-use
python app.py
Key Implementation Pattern
The core of every tool-use system is this loop. Here is the minimal version:
while round < MAX_ROUNDS:
response = client.messages.create(
model="claude-sonnet-4-20250514",
tools=TOOLS,
messages=messages,
)
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
results = []
for block in response.content:
if block.type == "tool_use":
output = execute_tool(block.name, block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": output,
})
messages.append({"role": "user", "content": results})
elif response.stop_reason == "end_turn":
return response.content
File Walkthrough: tools.py
Why it is a separate file: This mirrors the backend pattern of separating API specifications from handlers. The tool definitions (JSON schemas) are what the model sees. The implementations are what your code executes. You can test them independently.
Tool Definitions (what the model sees)
TOOL_DEFINITIONS = [
{
"name": "get_weather",
"description": (
"Get current weather for a specific city. Returns temperature, "
"condition, and humidity. Use this when the user asks about "
"weather, temperature, or outdoor conditions in a location."
),
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
]
Critical details:
description is the most important field — the model reads it to decide WHEN to call the tool. Include "Use this when..." clauses.
input_schema uses standard JSON Schema. required tells the model which parameters it must provide.
enum constrains the model to specific values — prevents inventing invalid options.
Parameter description with examples ("City name, e.g. 'Tokyo'") helps the model fill in the right values.
Security: The Calculator Tool
def _calculate(params: dict) -> str:
expression = params["expression"]
allowed_chars = set("0123456789+-*/.(,) ")
allowed_words = {"sqrt", "abs", "round", "min", "max"}
words = set(re.findall(r"[a-zA-Z_]+", expression))
if not words.issubset(allowed_words):
unsafe = words - allowed_words
return f"Error: Unsafe operations not allowed: {unsafe}"
result = eval(expression, {"__builtins__": {}}, SAFE_MATH)
return f"Result: {result}"
Critical lesson: The model's output is untrusted input from a security perspective. If the model generates "__import__('os').system('rm -rf /')" as a calculation expression and you pass it to eval(), you have given the LLM code execution. Always sanitize, whitelist, and sandbox.
06. System Design — Production Architecture
In production, the tool-use loop sits behind an orchestration layer with these concerns:
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Client │────>│ API Gateway │────>│ Orchestrator │
│ (Frontend) │ │ (Auth, Rate │ │ (Agentic Loop) │
│ │<────│ Limiting) │<────│ │
└─────────────┘ └──────────────────┘ └────────┬────────┘
│
┌───────────────────────────┼────────────────┐
│ │ │
┌─────v─────┐ ┌───────v──────┐ ┌─────v──────┐
│ Tool │ │ Tool │ │ Tool │
│ Registry │ │ Validator │ │ Logger │
│ │ │ (Schema + │ │ (Audit │
│ name->fn │ │ Auth) │ │ Trail) │
└─────┬─────┘ └──────────────┘ └────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌─────v────┐ ┌─────v────┐ ┌──────v─────┐
│ Product │ │ Shipping │ │ Order │
│ Service │ │ Service │ │ Service │
│ (DB) │ │ (API) │ │ (Write) │
└──────────┘ └──────────┘ └────────────┘
Production Components
1. Tool Dispatcher — The Critical Middleware
This sits between the model and your tools. It handles authorization, rate limiting, input validation, timeout enforcement, audit logging, and error handling with user-friendly messages.
class ToolDispatcher:
async def execute(self, tool_name, tool_input, user_context):
return result
2. Idempotency for Mutating Tools
The model might call the same tool twice due to retries or loop errors. Write operations must be idempotent:
async def initiate_refund(params):
idempotency_key = f"refund-{params['order_id']}-{params['amount']}"
existing = await refund_store.get(idempotency_key)
if existing:
return f"Refund already processed: {existing.refund_id}"
refund = await payment_service.refund(
order_id=params["order_id"],
amount=params["amount"],
idempotency_key=idempotency_key,
)
return f"Refund {refund.id} processed: ${refund.amount}"
3. Timeout Budgets
Each agentic loop iteration is an API call (~1-3s) plus tool execution time. You need a global deadline:
async def chat_with_deadline(message, deadline_seconds=30):
deadline = time.time() + deadline_seconds
while iteration < max_iterations:
remaining = deadline - time.time()
if remaining <= 0:
return "Taking too long. Here's what I have so far..."
response = await client.messages.create(
model=MODEL, tools=TOOL_DEFINITIONS, messages=messages,
max_tokens=min(1024, int(remaining * 50)),
)
4. Observability — Your Debugging Lifeline
Log every tool call. When the model calls the wrong tool, these logs are the only way to debug:
{
"event": "tool_call",
"request_id": "req_abc123",
"iteration": 2,
"tool_name": "search_orders",
"tool_input": {"email": "alice@example.com"},
"tool_latency_ms": 34,
"cumulative_input_tokens": 1847,
}
Dashboard metrics to track: tool call rate per tool, tool selection accuracy, average iterations per request (trending up = prompt problem), tool execution latency, loop timeout rate.
5. Graceful Degradation
If a tool fails, send a clear error message back to the model. It can often recover:
try:
result = await execute_tool(block.name, block.input)
except DatabaseConnectionError:
result = (
"Error: The order database is temporarily unavailable. "
"Please let the user know and suggest they try again."
)
except ExternalAPIError as e:
result = f"Error: The weather service returned: {e.message}"
The model will incorporate the error into its response naturally: "I'm sorry, I wasn't able to check your orders right now because our order system is temporarily unavailable."
07. Common Pitfalls
Important Pitfall 1: Vague tool descriptions
Descriptions like "get data" are useless. The model picks tools based on descriptions — be specific about what each tool does, when to use it, what it returns, and when NOT to use it. Include "Use this when..." and "Do NOT use this for..." clauses.
Critical Pitfall 2: Dropping content from the assistant message
Only keeping tool_use blocks and dropping text blocks corrupts the conversation context. The assistant message may contain BOTH text and tool_use blocks. Always preserve the entire content list: messages.append({"role": "assistant", "content": response.content})
Critical Pitfall 3: Mismatching tool_use_id
Each tool_result must reference the exact tool_use_id from the corresponding tool_use block. Hardcoding or mismatching IDs causes the API to reject your request. Always use block.id from the tool_use block.
Important Pitfall 4: No loop guard
Without a MAX_ITERATIONS guard, the model can loop forever — calling the same tool with the same arguments, or cycling between tools without making progress. Each iteration costs tokens. Always cap iterations and add cost tracking.
Critical Pitfall 5: Using eval() for model output
The model's output is untrusted input. It could generate __import__('os').system('rm -rf /'). Even with a "well-behaved" model, adversarial user prompts can manipulate it into generating malicious tool arguments. Whitelist allowed operations or use a sandboxed math parser.
Important Pitfall 6: Too many tools
Registering 50+ tools in every request wastes tokens (50 tools can add 5,000+ tokens per API call), degrades selection accuracy, and fills the context window faster. Fix: group into categories with two-stage routing, use embeddings for dynamic top-K selection, or split into specialized agents.
Important Pitfall 7: Stripping tool messages from history
Removing tool_use and tool_result messages to "save tokens" makes the model lose track of what data it already has. It will re-call tools unnecessarily or hallucinate previous results. If you need to reduce context size, summarize the tool results rather than removing them.
Tip Pitfall 8: Not handling parallel tool calls
A single response can contain multiple tool_use blocks. If you only process the first one and break, the model gets partial results and may behave unpredictably. Process ALL tool_use blocks and return ALL results in one user message.
08. Advanced Topics
Explore these next, in recommended order:
8.1 Streaming with Tool Use
Stream text to the user but buffer tool_use blocks until they are complete before executing. Users expect real-time responses, but tool execution must happen between streaming segments.
8.2 MCP (Model Context Protocol)
Anthropic's open standard for connecting LLMs to external tools. MCP servers expose tools via a standardized protocol; clients discover and call tools dynamically. Enables tool interoperability across different LLM applications.
8.3 Dynamic Tool Loading
Embed tool descriptions, match against the user's query with cosine similarity, load the top-K most relevant tools. Reduces context usage and improves selection accuracy.
8.4 Human-in-the-Loop for Dangerous Tools
For tools with side effects (refunds, deletes, deployments), pause and get human approval before executing. The model decides what to do; a human decides whether to let it.
8.5 Multi-Agent Tool Delegation
Supervisor agent routes to specialist agents (database tools, API tools, calculation tools). Reduces per-agent tool count, improves specialization, enables parallel execution.
8.6 Tool Result Caching
Hash tool name + input, check cache before executing, store with TTL. Subtlety: semantic caching with embeddings for varied phrasings of the same query.
8.7 Retry and Fallback Strategies
Retry with exponential backoff for transient errors. Fallback to secondary sources. Circuit breaker if a tool fails repeatedly. Always send clear error messages back to the model.
8.8 Tool Use Evaluation and Testing
Measure tool selection accuracy, argument accuracy, end-to-end evaluation, and regression testing when tool descriptions change.
09. Coding Exercises
Exercise 1 — Guardrailed Tools
Add a permission system: mark place_order as requiring explicit user confirmation. Build middleware that intercepts write-operation tool calls and returns a confirmation prompt instead of executing.
Exercise 2 — Parallel Tool Calls
Create a "compare products" flow: when a user says "compare keyboard vs mouse pricing with shipping to Japan," the model should call lookup_product and calculate_shipping for both products. Handle the parallel tool calls correctly.
Exercise 3 — Tool Observability
Build a logging middleware that records every tool call with: timestamp, tool name, input args, output, latency, and token usage. Expose it via a /debug/traces endpoint that shows the full execution trace for any conversation.
10. Architect-Level Questions
Q1: The model keeps calling the wrong tool. It uses get_weather when the user asks about orders. How do you debug and fix this?
Think: check tool descriptions for ambiguity, add "use this when..." / "do NOT use this for..." clauses, log the full message history, isolate with tool_choice, reduce tool count, check system prompt for conflicting instructions.
Q2: You're building a customer support agent with 40 tools. Response quality drops as you add more tools. What's your architecture?
Think: two-tier approach (category router with 5-6 meta-tools, then category-specific tools), embedding-based dynamic selection, multi-agent with specialized tool sets, dynamic loading with MCP.
Q3: A tool call mutates state (processes a refund), but the model's final response fails due to a timeout. The refund went through but the user sees an error. How do you handle this?
Think: idempotency keys, two-phase execution (pending then confirm), durable tool execution log keyed by request ID, compensating actions, never retry mutating calls blindly.
Q4: How does tool use affect token costs, and what strategies reduce spend in a high-volume system?
Think: tool definitions are injected as input tokens on EVERY API call in the loop. 10 tools x 5 iterations = definitions counted 5 times. Strategies: minimize descriptions, dynamic top-K loading (60-80% savings), tool_choice when known, cache results, better system prompts, model routing.
Q5: Compare tool use with RAG. When would you use each, and when would you combine them?
Think: RAG = passive retrieval before generation (knowledge questions, static docs). Tool use = active during generation (actions, live data). Combine: "Am I eligible for a refund on order #1234?" — RAG retrieves the refund policy, tool use fetches the order data, model combines both to answer.
11. Quick Reference Card
┌──────────────────────────────────────────────────────────────┐
│ TOOL USE ENGINEERING CHEATSHEET │
├──────────────────────────────────────────────────────────────┤
│ │
│ TOOL DEFINITION: │
│ name: unique identifier │
│ description: MOST IMPORTANT -- drives selection accuracy │
│ input_schema: JSON Schema for parameters │
│ │
│ AGENTIC LOOP: │
│ while not done and under budget: │
│ response = llm(messages, tools) │
│ if end_turn -> return response │
│ if tool_use -> execute -> append results -> loop │
│ │
│ MESSAGE RULES: │
│ * Alternate user <-> assistant │
│ * Append FULL assistant content (text + tool_use) │
│ * Match tool_use_id in every tool_result │
│ * Send ALL tool_results in one user message │
│ │
│ tool_choice OPTIONS: │
│ auto -> model decides (default) │
│ any -> must use some tool │
│ tool -> must use specific tool │
│ │
│ SECURITY: │
│ * Model output is UNTRUSTED INPUT │
│ * Never eval() model-generated code │
│ * Validate tool arguments before executing │
│ * Idempotency keys for mutating operations │
│ │
│ COST FORMULA: │
│ tool_tokens = num_tools x avg_tokens_per_tool │
│ loop_cost = iterations x (tool_tokens + history_tokens) │
│ Minimize: fewer tools, fewer iterations, cache results │
│ │
│ GOLDEN RULES: │
│ 1. Description quality = selection accuracy │
│ 2. Always guard the loop with max_iterations │
│ 3. Append FULL assistant message, not just tool blocks │
│ 4. Handle parallel tool calls (multiple tool_use blocks) │
│ 5. Log every tool call -- it's your debugging lifeline │
│ │
└──────────────────────────────────────────────────────────────┘