agent-tool-reasoning
Agent Reasoningstabletoolbenchrigorous codebase
Description
LLM Agent Tool-Use Reasoning Strategy
Objective
Design a better search/reasoning strategy for an LLM-based tool-use agent. Your code goes in the search() method of custom_search.py.
Background
StableToolBench evaluates LLM agents on multi-step tool use tasks. Given a user query and a set of tool APIs, the agent must decide which tools to call, with what arguments, and in what order to arrive at a final answer.
The search strategy controls how the agent explores the action space:
- Greedy chain (CoT): Call LLM, execute tool, repeat. No backtracking. Simple but gets stuck on errors.
- DFS with ranking: Generate multiple children, use LLM to rank them, expand best first. Backtracks on failure. More robust but costs extra LLM queries for ranking.
- DFSDT: Generate one child, immediately recurse depth-first. Backtrack a fixed number of steps on failure. Balance between exploration and cost.
What you can modify
The search(self, root_node) method in custom_search.py (the editable region). You have access to:
self._step(node)-- one LLM call + tool execution, returns new leaf nodesself._add_diversity_prompt(node)-- encourages different actions when re-expandingself._rank_nodes(candidates)-- LLM pairwise ranking (costs extra queries)- Tree state:
self.query_count,self.max_query_count,self.terminal_node, etc. - Node properties:
node.is_terminal,node.pruned,node.observation_code,node.get_depth()
Evaluation metrics
- pass_rate: Fraction of queries where the agent produces a valid final answer (higher is better)
- avg_queries: Average number of LLM queries per task (lower is better for efficiency)
- give_up_rate: Fraction of queries where the agent gives up (lower is better)
Code
custom_search.py
EditableRead-only
1"""CustomSearch: Editable search algorithm for StableToolBench.23This module implements a search strategy for LLM-based tool-use agents.4The agent receives a user query and a set of tool APIs, then must decide5which tools to call, with what arguments, and in what order.67The `search()` method is the editable region — modify it to implement8your own search/reasoning strategy (e.g., beam search, MCTS, adaptive9backtracking, best-first search, iterative deepening, etc.).1011Helper methods `_step()` and `_rank_nodes()` are provided and should12NOT be modified.13"""1415import re
Additional context files (read-only):
stabletoolbench/toolbench/inference/Tree/Tree.py
Results
| Model | Type | pass rate ↑ | avg queries ↓ | give up rate ↓ | sopr G1-instruction ↑ |
|---|---|---|---|---|---|
| dfs_ranked@deepseek-chat | baseline | 0.902 | 26.900 | 0.031 | 0.621 |
| dfs_ranked@qwen-2.5-72b | baseline | 0.945 | 26.300 | 0.006 | 0.603 |
| dfs_ranked@qwen-2.5-7b | baseline | 0.663 | 45.100 | 0.037 | 0.364 |
| dfsdt@deepseek-chat | baseline | 0.945 | 7.800 | 0.025 | 0.566 |
| dfsdt@qwen-2.5-72b | baseline | 0.969 | 6.600 | 0.000 | 0.569 |
| dfsdt@qwen-2.5-7b | baseline | 0.853 | 17.500 | 0.061 | 0.366 |
| greedy_chain@deepseek-chat | baseline | 0.742 | 4.100 | 0.098 | 0.513 |
| greedy_chain@qwen-2.5-72b | baseline | 0.761 | 3.300 | 0.055 | 0.484 |
| greedy_chain@qwen-2.5-7b | baseline | 0.583 | 6.200 | 0.055 | 0.222 |