12 points | by bobfunk 15 hours ago ago
4 comments
curious how you handle the non-determinism in eval. agents change behavior between runs, so what are you measuring against - previous run output, static golden files, or something else
This is actually really sick
I guess I really need a HN for Agents to submit it to, though :)
Thanks!
curious how you handle the non-determinism in eval. agents change behavior between runs, so what are you measuring against - previous run output, static golden files, or something else
This is actually really sick
I guess I really need a HN for Agents to submit it to, though :)
Thanks!