This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models
You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model
For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on
Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?
appreciate it, I wanted to make the AI behavior easy to understand.
Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.
I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones
Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.
This is undeniably intriguing. Will be paying close attention.
This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models
You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model
For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on
Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?
appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.
Interesting. Did you give the agents any skills for playing civ? If not, are you planning to?
I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones
Have you tried playing the agents yourself? Do they crush human competition?
I was able to beat the AI every time, they're pretty bad at this point but I expect them to get much better overtime
would you describe yourself as particularly good or the models as particularly bad?
hey first of all cool product. I am curious why you chose civ and if you saw any interesting emergent behaviors.
Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.
This is a sick idea I must say
it was fun building it, sometimes the LLMs are pretty funny in how they play