ToolMisuseBench: A deterministic benchmark for tool-augmented Agents

(huggingface.co)

1 points | by akgitrepos 10 hours ago ago

2 comments

akgitrepos 10 hours ago ago
ToolMisuseBench is a deterministic, offline benchmark dataset for evaluating tool-using agents under realistic failure conditions, including schema misuse, execution failures, interface drift, and recovery under budget constraints.
This dataset is intended for reproducible evaluation of agent tool-use behavior, not for training a general-purpose language model.
akgitrepos 10 hours ago ago
GitHub Repo: https://github.com/akgitrepos/toolmisusebench