agent-tool-reasoning

Agent Reasoningstabletoolbenchrigorous codebase

Description

LLM Agent Tool-Use Reasoning Strategy

Objective

Design a better search/reasoning strategy for an LLM-based tool-use agent. Your code goes in the search() method of custom_search.py.

Background

StableToolBench evaluates LLM agents on multi-step tool use tasks. Given a user query and a set of tool APIs, the agent must decide which tools to call, with what arguments, and in what order to arrive at a final answer.

The search strategy controls how the agent explores the action space:

  • Greedy chain (CoT): Call LLM, execute tool, repeat. No backtracking. Simple but gets stuck on errors.
  • DFS with ranking: Generate multiple children, use LLM to rank them, expand best first. Backtracks on failure. More robust but costs extra LLM queries for ranking.
  • DFSDT: Generate one child, immediately recurse depth-first. Backtrack a fixed number of steps on failure. Balance between exploration and cost.

What you can modify

The search(self, root_node) method in custom_search.py (the editable region). You have access to:

  • self._step(node) -- one LLM call + tool execution, returns new leaf nodes
  • self._add_diversity_prompt(node) -- encourages different actions when re-expanding
  • self._rank_nodes(candidates) -- LLM pairwise ranking (costs extra queries)
  • Tree state: self.query_count, self.max_query_count, self.terminal_node, etc.
  • Node properties: node.is_terminal, node.pruned, node.observation_code, node.get_depth()

Evaluation metrics

  • pass_rate: Fraction of queries where the agent produces a valid final answer (higher is better)
  • avg_queries: Average number of LLM queries per task (lower is better for efficiency)
  • give_up_rate: Fraction of queries where the agent gives up (lower is better)

Code

custom_search.py
EditableRead-only
1"""CustomSearch: Editable search algorithm for StableToolBench.
2
3This module implements a search strategy for LLM-based tool-use agents.
4The agent receives a user query and a set of tool APIs, then must decide
5which tools to call, with what arguments, and in what order.
6
7The `search()` method is the editable region — modify it to implement
8your own search/reasoning strategy (e.g., beam search, MCTS, adaptive
9backtracking, best-first search, iterative deepening, etc.).
10
11Helper methods `_step()` and `_rank_nodes()` are provided and should
12NOT be modified.
13"""
14
15import re

Additional context files (read-only):

  • stabletoolbench/toolbench/inference/Tree/Tree.py

Results

ModelTypepass rate avg queries give up rate sopr G1-instruction
dfs_ranked@deepseek-chatbaseline0.90226.9000.0310.621
dfs_ranked@qwen-2.5-72bbaseline0.94526.3000.0060.603
dfs_ranked@qwen-2.5-7bbaseline0.66345.1000.0370.364
dfsdt@deepseek-chatbaseline0.9457.8000.0250.566
dfsdt@qwen-2.5-72bbaseline0.9696.6000.0000.569
dfsdt@qwen-2.5-7bbaseline0.85317.5000.0610.366
greedy_chain@deepseek-chatbaseline0.7424.1000.0980.513
greedy_chain@qwen-2.5-72bbaseline0.7613.3000.0550.484
greedy_chain@qwen-2.5-7bbaseline0.5836.2000.0550.222