a bit heavier weight, but seems worthwhile if working in an org where many people consume the skill:
- find N tasks from your repo that serve as good representation of what you want the agent to do with the task
- run agent with old skill/new skill against those tasks
- measure test pass rate / other quality metrics that you care about with skill
- token usage, speed, alignment, ...
- tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator
- use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking
I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o
In general, you don’t know. Sure thing if you have a specific code base in which you already had a bunch of tests (non ai generated ) and the code you are regenerating is always touching the logic behind those tests, sure you can assess to some extent your skills/prompt changes.
But in general you just don’t know. You havr a bunch of skills md files that who knows how they work if changed a little bit here a little bit there.
People who claim they know are selling snake oil
The way we handle it is keeping a small set of fixed test cases that we never change. Like same inputs, same expected outputs. so when we tweak a prompt we run it against those first. if it passes the fixed cases and feels better on the new ones, we keep it.
a bit heavier weight, but seems worthwhile if working in an org where many people consume the skill:
- find N tasks from your repo that serve as good representation of what you want the agent to do with the task - run agent with old skill/new skill against those tasks - measure test pass rate / other quality metrics that you care about with skill - token usage, speed, alignment, ... - tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator - use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking
I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o
In general, you don’t know. Sure thing if you have a specific code base in which you already had a bunch of tests (non ai generated ) and the code you are regenerating is always touching the logic behind those tests, sure you can assess to some extent your skills/prompt changes. But in general you just don’t know. You havr a bunch of skills md files that who knows how they work if changed a little bit here a little bit there. People who claim they know are selling snake oil
The way we handle it is keeping a small set of fixed test cases that we never change. Like same inputs, same expected outputs. so when we tweak a prompt we run it against those first. if it passes the fixed cases and feels better on the new ones, we keep it.
How you get deterministic output though? t=0? Pydantic AI outputs?
[dead]
[dead]
[dead]