agent-tool-reasoning

Agent Reasoningstabletoolbenchrigorous codebase

Description

LLM Agent Tool-Use Reasoning Strategy

Objective

Design a better search/reasoning strategy for an LLM-based tool-use agent. Your code goes in the search() method of custom_search.py.

Background

StableToolBench evaluates LLM agents on multi-step tool use tasks. Given a user query and a set of tool APIs, the agent must decide which tools to call, with what arguments, and in what order to arrive at a final answer.

The search strategy controls how the agent explores the action space:

Greedy chain (CoT): Call LLM, execute tool, repeat. No backtracking. Simple but gets stuck on errors.
DFS with ranking: Generate multiple children, use LLM to rank them, expand best first. Backtracks on failure. More robust but costs extra LLM queries for ranking.
DFSDT: Generate one child, immediately recurse depth-first. Backtrack a fixed number of steps on failure. Balance between exploration and cost.

What you can modify

The search(self, root_node) method in custom_search.py (the editable region). You have access to:

self._step(node) -- one LLM call + tool execution, returns new leaf nodes
self._add_diversity_prompt(node) -- encourages different actions when re-expanding
self._rank_nodes(candidates) -- LLM pairwise ranking (costs extra queries)
Tree state: self.query_count, self.max_query_count, self.terminal_node, etc.
Node properties: node.is_terminal, node.pruned, node.observation_code, node.get_depth()

Evaluation metrics

pass_rate: Fraction of queries where the agent produces a valid final answer (higher is better)
avg_queries: Average number of LLM queries per task (lower is better for efficiency)
give_up_rate: Fraction of queries where the agent gives up (lower is better)

Code

custom_search.py

EditableRead-only

1"""CustomSearch: Editable search algorithm for StableToolBench.
2
3This module implements a search strategy for LLM-based tool-use agents.
4The agent receives a user query and a set of tool APIs, then must decide
5which tools to call, with what arguments, and in what order.
6
7The `search()` method is the editable region — modify it to implement
8your own search/reasoning strategy (e.g., beam search, MCTS, adaptive
9backtracking, best-first search, iterative deepening, etc.).
10
11Helper methods `_step()` and `_rank_nodes()` are provided and should
12NOT be modified.
13"""
14
15import re

Additional context files (read-only):

stabletoolbench/toolbench/inference/Tree/Tree.py

Results

Model	Type	pass rate ↑	avg queries ↓	give up rate ↓	sopr G1-instruction ↑
dfs_ranked@deepseek-chat	baseline	0.902	26.900	0.031	0.621
dfs_ranked@qwen-2.5-72b	baseline	0.945	26.300	0.006	0.603
dfs_ranked@qwen-2.5-7b	baseline	0.663	45.100	0.037	0.364
dfsdt@deepseek-chat	baseline	0.945	7.800	0.025	0.566
dfsdt@qwen-2.5-72b	baseline	0.969	6.600	0.000	0.569
dfsdt@qwen-2.5-7b	baseline	0.853	17.500	0.061	0.366
greedy_chain@deepseek-chat	baseline	0.742	4.100	0.098	0.513
greedy_chain@qwen-2.5-72b	baseline	0.761	3.300	0.055	0.484
greedy_chain@qwen-2.5-7b	baseline	0.583	6.200	0.055	0.222