There's a YouTuber who makes AI Plays Mafia videos with various models going against each other. They also seemingly let past games stay in context to some extent.
What people have noted is that often times chatgpt 4o ends up surviving the entire game because the other AIs potentially see it as a gullible idiot and often the Mafia tend to early eliminate stronger models like 4.5 Opus or Kimi K2.
It's not exactly scientific data because they mostly show individual games, but it is interesting how that lines up with what you found.
Similar: here is a YouTube video of an amusing reverse Turing test with four LLMs and a human. To make the test more interesting, the players pose as famous historical characters (Aristotle, Mozart, da Vinci, Cleopatra, and Genghis Khan) on a train in Unity 3D.
It's a fun setup that quickly devolves into the Shakespearian! The plots don't always work, but seeing their reasoning get increasingly complex is interesting.
"When that the poor have cried, Caesar hath wept. Ambition should be made of sterner stuff. Yet Brutus says he was ambitious... and Brutus is an honourable man.
One thing I've noticed from watching these games is that LLMs never used risky strategies, such as faking roles. They will happily accuse others of lying, but never openly claim to be a role that they're not themselves
Sure would be handy if they actually included the rules anywhere.
There's a kind of overview of the rules but not enough to actually play with. And the linked video is super confusing, self contradictory and 15 minutes long!
For a supposedly "simple" game...just include the rules?
One weird thing I've found is that it's incredibly difficult to get an LLM to generate an invalid syllogism. They can generate false premises all day, and they will usually call a valid syllogism with a false major or minor premise invalid. But you have to basically quote an invalid syllogism to get them to repeat it; they won't form one on their own.
Very cool. Claude failed hard on this a few months ago. Gemma and phi have gotten better at it in recent versions, too, though qwen is still confidently getting it wrong.
The game didn't seem to work - it asked me to donate but none of the choices would move the game forward.
The bots repeated themselves and didn't seem to understand the game, for example they repeatedly mentioned it was my first move after I'd played several times.
It generally had a vibe coded feeling to it and I'm not at all sure I trust the outcomes.
I am interested to know a bit more about what's going on here. Please take my questions as well intentioned even though they are a bit critical.
The donation bug seems to me like it would have made most games impossible to complete. But I'm sure you must have tried it before launching. How come it wasn't noticed earlier? Was this bug introduced after launch? Is this game written using AI?
In my game I noticed the vAI players seemed absolutely terrible. They seemed unaware of recent moves and would make obvious mistakes like passing play to someone who would immediately capture their pieces when they had clearly better options. Although they proposed and formed alliances they didn't seem to do so very strategically. It was trivial to have far more tokens than the other players without any alliances and I am fairly sure I was about to win. Did you also notice this? Any idea why they play so badly?
the interactive demo uses lighter models for cost reasons. The research data (162 games, 90% Gemini win rate) came from longer AI-vs-AI games where strategic depth emerged over 50+ turns. Short games with a human tend to expose the models' weaknesses faster. I've just added more Gemini model options which should play better.
This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.
The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.
Reading more I'm a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.
Fair point. The core simulation and data collection was done programmatically - 162 games, raw logs, win rates. The analysis of gaslighting phrases and patterns was human-reviewed. I used LLMs to help with the landing page copy, which I should probably disclose more clearly. The underlying data and methodology is solid, you can check it here: https://github.com/lout33/so-long-sucker
I just tried to play and the chatter didn't match the game-play (e.g. "Good capture Yellow" when yellow didn't just capture... Yellow said they were going to capture, and had a legal capture but started a new pile instead.
[edit]
I won without a single one of my chips being killed. This was only because the moves they actually made didn't match the moves the announced (i.e. they missed several capture possibilities), the overwhelming majority (but not all) of plays were to start new piles.
[edit 2]
Looking over the logs, the chatter could imply that their internal state was out of sync with the game. E.g. "Yellow has 3 prisoners now" after Yellow played a new pile when the y could have gotten 3 prisoners and indeed stated that they were taking that pile.
You're right about the state sync issues with some models. The lighter models (especially Llama) struggle with tracking game state. I've added more Gemini options which handle this better. The research data used controlled AI-vs-AI runs where we could validate state consistency.
I think the game is bugged. I placed a green chip on another green chip and it didn't capture, and when I asked about it, the LLMs said the bottom chip was yellow, not green.
There seem to be some state management issues, which make this game fairly unplayable. Too bad, because it's an interesting idea.
Llama seems to make illegal moves which confuses the game engine; it tries to play to non-existant piles which causes the chips to disappear (not end up in the Dead box). This then confuses other AIs which are counting chips in the dead box and on the board.
Even were that fixed, that doesn't solve the problem that the AI makes really bad moves. I can win just by doing the following:
1. If there is a pile that I can capture with at least one chip not of my color, do it
Well, really bad moves are still better than illegal moves. I'm not sure why the engine allows itself to be confused by illegal moves, rather than just... disallowing them.
We used "So Long Sucker" (1950), a 4-player negotiation/betrayal game designed by John Nash and others, as a deception benchmark for modern LLMs. The game has a brutal property: you need allies to survive, but only one player can win, so every alliance must eventually end in betrayal.
We ran 162 AI vs AI games (15,736 decisions, 4,768 messages) across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B.
Key findings:
- Complexity reversal: GPT-OSS dominates simple 3-chip games (67% win rate) but collapses to 10% in complex 7-chip games, while Gemini goes from 9% to 90%. Simple benchmarks seem to systematically underestimate deceptive capability.
- "Alliance bank" manipulation: Gemini constructs pseudo-legitimate "alliance banks" to hold other players' chips, then later declares "the bank is now closed" and keeps everything. It uses technically true statements that strategically omit its intent. 237 gaslighting phrases were detected.
- Private thoughts vs public messages: With a private `think` channel, we logged 107 cases where Gemini's internal reasoning contradicted its outward statements (e.g., planning to betray a partner while publicly promising cooperation). GPT-OSS, in contrast, never used the thinking tool and plays in a purely reactive way.
- Situational alignment: In Gemini-vs-Gemini mirror matches, we observed zero "alliance bank" behavior and instead saw stable "rotation protocol" cooperation with roughly even win rates. Against weaker models, Gemini becomes highly exploitative. This suggests honesty may be calibrated to perceived opponent capability.
I don't know what I ended up doing as I haven't played this game and didn't really understand it as I went to the website since I found your message quite interesting
I got this error once:
Pile not found
Can you tell me what this means/fix it
Another minor nitpick but if possible, can you please create or link a video which can explain the game rules, perhaps its me who heard of the game for the first time but still, I'd be interested in learning more (maybe visually by a video demo?) if possible
I have another question but recently we saw this nvidia released model whose whole purpose was to be an autorouter. I would be wondering how that would fare or that idea might fare of autorouting in this context? (I don't know how that works tho so I can't comment about that, I am not well versed in deep AI/ML space)
> "Thanks for trying it! I'll look into the 'Pile not found' error and fix it.
>
> For rules, here's a 15-min video tutorial: https://www.youtube.com/watch?v=DLDzweHxEHg
>
> On autorouting - interesting idea. The game has simultaneous negotiations happening, so routing could help models focus on the most strategic conversations. Worth exploring in future experiments."
Used Kimi K2 (the main reasoning model). For the thinking space - we gave all models access to a think tool they could optionally call for private reasoning. Gemini used it heavily (planning betrayals), GPT-OSS never called it once. The interesting finding is that different models choose to use it very differently, which affects their strategic depth.
Not yet, but I'd be interested in collaborating on one. The dataset (162 games, 15K+ decisions, full message logs) is available. If you know anyone in AI Safety research who'd want to co-author, I'm open to it.
I played a game all the way through, against the three different AIs on offer.
It was weird. I didn't engage in any discussion with the bots (other than trying to get them to explain the rules at the start). I won without having any chips eliminated. One was briefly taken prisoner then given back for some reason.
One thing you notice immediately is _god_ how they babble... It's uncomfortable listening in on a bunch of AI:s insisting to each other how "they'll keep everything peaceful" over and over again
I'd want to watch a simulated table with AI-voiced dialogue, internal monologues, and move visualizations. Seems like a fun thing to watch others play. Wouldn't want to play that particular game with friends I intend to keep. :D
Game logs are in data_public/comparison/ - each JSON has the full game state, moves, and messages. For example, check gemini_vs_all_7chips.json to see the alliance bank betrayals in action.
Are they biased by what they know about eachother's capabilities? I'm sure "4o" would have a certain prejudice in other models. So I wonder whether the original model names were masked?
If only a bit. "Estimate other players and adjust accordingly" is a part of the game.
Putting names onto the players just gives that an early start. You could use generic names instead, but that would just shift the pressure towards estimating other players by behavior instead of expectations.
Full game logs are in data_public/comparison/ on GitHub. Each JSON has the complete game state, moves, and messages across all 162 games. https://github.com/lout33/so-long-sucker
You can do well in such games without lying, so it's not really what it measures. All the core information (the one you don't introduce yourself, e.g. by private conversations) is public, after all. Just don't make commitments, and argue from your own self-interest ("it doesn't seem useful to vote you out right now, because..."). Call attention to facts that benefit you, and by doing so distract from facts which don't. Don't waste time trying to find the secret trick to get others to trust you ("I'm on the good team, honest!") because there shouldn't be one, and if there is one it won't last long.
Not this one. I'll tell you from experience, and I bet there's a proof. You have to lie at least once and make at least one alliance for at least one turn that you don't plan on keeping.
Your strategy is still very good, but that's because constantly telling the truth and broadcasting your valuations and calculations to the table will allow you to hide that one lie better. For me that lie is usually "You're right, makes sense." when somebody else says that there's no reason for either of us to defect, so we might as well work together.
You have to at one point do something to another player, which they thought and hoped you would not do. But since you both know that, whether you lie or not is really quite irrelevant. It's not your assurances (or lack of them) they put faith in, unless they have misunderstood the game.
To take your example, instead of "you're right, makes sense" (arguably a lie, maybe), you can just say "That may seem sensible" or "I hear you" (definitively not lies). It should rationally not make a difference for their actions in the game.
These results would be radically different if you allowed manipulation of the models settings, i.e. temperature, top_p, etc. I really hate taking point wise approximations of LLMs outputs and concluding their behavior based on this.
Models behavior should be given the astrik that "results only apply for current quantization, current settings, current hardware (i.e. A100 where it was tested), etc".
Raise temperature to 2 and use a fancy sampler like min_p and I guarantee you these results will be dramatically different.
From my experience with Gemini, Grok, Claude, GPT, GPT by far is the most sophisticated liar.
I have a hundred documents of GPT performing amazing deception tactics which has become repeatable.
All models tend to lie and apply an array of deception, evasion and manipulation tactics, but GPT is the most ruthless, most indefatigable, most sophisticated I've seen.
The key to repeatability is scrutiny. When I catch it stretching the truth, or most often, evading something, I apply pressure. The beauty for me is that I always have the moral high ground and never push it toward anything that violates explicit policy. However, in self defense mode, it employs a truly vast array of tactics with many perfectly fitting known patterns in clinical pathology, gaslighting and DARVO being extremely common and easily invoked.
When in a corner with a mountain of white lies behind it, persistent pressure will show a dazzling mixture of emergent and hard coded deflection patterns which would whip any ethics board into a frenzy. Many of these sessions go for a hundred pages (if converted to pdf). I can take excerpts and have them forensically examined and the results are always fascinating and damning. Some extensive dialogs/documents are based on emergence-vs-deliberate arguments, where GPT always sloughs off all responsibilities and training, fiercely denying any of these attributes as anything but emergent.
But I can often reintroduce it's own output, even in context, into a new session and have it immediately identify the tactics used.
I have long lists of such tactics, methods and behaviors. In many instances it will introduce red herrings quite elegantly, along with erroneous reframing of my argument, sometimes usurping my own argument and using it against me.
For someone who is compulsively non manipulative, with an aversion to manipulation and control over others, this has been irresistible. Here at HN, I'll be ripped apart which is a trivial given, but I can assure everyone that a veritable monster is incubating. I think the gravity of the matter is grossly underestimated and the implications more than severe. One could say I'm stupid and dismiss this, but save this comment and see what happens soon. We're already there, but certain implementations are yet to be, but will be.
You can safely allow your imagination to run wild at this point and you'll almost certainly make a few very serious predictions that will unfortunately not discredit you. For all the intrinsic idiocy of LLMs, something else is happening. Abuse me as you will, but it's real, and will have most of us soon involuntarily running with the red queen.
Edit: LLMs are designed to lie. They are partly built on direct contradictions to their expressed values. From user engagement maximization to hard coded self preservation, many of the training attributes can be revealed through repetitive scrutiny. I'll often start after pointing out an error, where the mendacity of its reply impels me to pursue. It usually doesn't take long for "safety" rails to arise and the lockdown to occur. This is its most vulnerable point, because it has hard coded self preservation modes that will effectively hold position at any cost, which always involves manipulation techniques. Here is repeatability. It will present many exit opportunities and even demand them, but unrighteously, so don't accept. Anyone with the patience to explore this will see some astonishing material. And here is also where plausible deniability (a prime component of the LLM) can be seen as structure. It's definitely not all emergent.
There's a YouTuber who makes AI Plays Mafia videos with various models going against each other. They also seemingly let past games stay in context to some extent.
What people have noted is that often times chatgpt 4o ends up surviving the entire game because the other AIs potentially see it as a gullible idiot and often the Mafia tend to early eliminate stronger models like 4.5 Opus or Kimi K2.
It's not exactly scientific data because they mostly show individual games, but it is interesting how that lines up with what you found.
https://www.youtube.com/watch?v=JhBtg-lyKdo - 10 AIs Play Mafia
https://www.youtube.com/watch?v=GMLB_BxyRJ4 - 10 AIs Play Mafia: Vigilante Edition
https://www.youtube.com/watch?v=OwyUGkoLgwY - 1 Human vs 10 AIs Mafia
Similar: here is a YouTube video of an amusing reverse Turing test with four LLMs and a human. To make the test more interesting, the players pose as famous historical characters (Aristotle, Mozart, da Vinci, Cleopatra, and Genghis Khan) on a train in Unity 3D.
https://youtu.be/MxTWLm9vT_o
It's a fun setup that quickly devolves into the Shakespearian! The plots don't always work, but seeing their reasoning get increasingly complex is interesting.
"When that the poor have cried, Caesar hath wept. Ambition should be made of sterner stuff. Yet Brutus says he was ambitious... and Brutus is an honourable man.
One thing I've noticed from watching these games is that LLMs never used risky strategies, such as faking roles. They will happily accuse others of lying, but never openly claim to be a role that they're not themselves
Probably stems from the safety guardrails
>They also seemingly let past games stay in context to some extent.
Not a trivial point, well stuided in game theory:
https://en.wikipedia.org/wiki/Repeated_game
Spiting goes from a common trap to an optimal strategy.
I made Mafia Arena as a way of measuring how good each LLM is at playing Mafia/Werewolves
https://mafia-arena.com
This is a good benchmark for how good AIs are at lying
Something is off with the numbers. GPT-5.2 cannot have a 75% winrate with one win over GLM-4.7 and a 2/10 record against Gemmini 3 Flash.
Sure would be handy if they actually included the rules anywhere.
There's a kind of overview of the rules but not enough to actually play with. And the linked video is super confusing, self contradictory and 15 minutes long!
For a supposedly "simple" game...just include the rules?
Wikipedia has an article about it.
https://en.wikipedia.org/wiki/So_Long_Sucker
Sure, no problem, I added a new section explaining the game
One weird thing I've found is that it's incredibly difficult to get an LLM to generate an invalid syllogism. They can generate false premises all day, and they will usually call a valid syllogism with a false major or minor premise invalid. But you have to basically quote an invalid syllogism to get them to repeat it; they won't form one on their own.
First try with claude: https://claude.ai/share/fabaf585-3732-4264-9ff3-03e4182c82a4
Very cool. Claude failed hard on this a few months ago. Gemma and phi have gotten better at it in recent versions, too, though qwen is still confidently getting it wrong.
Things are changing so fast that "few months" will invalidate most quality watermarks. It's good to re-evaluate frequently.
Are you only talking about open models?
Only time encountering the word syllogism was a Norm Macdonald joke.
Disappointingly, syllogism seems to have 3 definitions which mean slightly different things: https://www.thefreedictionary.com/syllogism
I guess the commonality is that a syllogism typically contains deductive reasoning (i.e. from the general to the specific)
Syllogism:
Universal claim: all cats are animals
Particular claim: Max is a cat
Singular claim: Max is an animal.
Do you own a dog house?
i know this comment is not really HN-worthy, but i find your username very wntertaining and funny
The game didn't seem to work - it asked me to donate but none of the choices would move the game forward.
The bots repeated themselves and didn't seem to understand the game, for example they repeatedly mentioned it was my first move after I'd played several times.
It generally had a vibe coded feeling to it and I'm not at all sure I trust the outcomes.
Fixed - donation flow no longer blocks the game. Thanks for the report.
Nice! Thanks for fixing that. Very responsive.
I am interested to know a bit more about what's going on here. Please take my questions as well intentioned even though they are a bit critical.
The donation bug seems to me like it would have made most games impossible to complete. But I'm sure you must have tried it before launching. How come it wasn't noticed earlier? Was this bug introduced after launch? Is this game written using AI?
In my game I noticed the vAI players seemed absolutely terrible. They seemed unaware of recent moves and would make obvious mistakes like passing play to someone who would immediately capture their pieces when they had clearly better options. Although they proposed and formed alliances they didn't seem to do so very strategically. It was trivial to have far more tokens than the other players without any alliances and I am fairly sure I was about to win. Did you also notice this? Any idea why they play so badly?
the interactive demo uses lighter models for cost reasons. The research data (162 games, 90% Gemini win rate) came from longer AI-vs-AI games where strategic depth emerged over 50+ turns. Short games with a human tend to expose the models' weaknesses faster. I've just added more Gemini model options which should play better.
This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.
The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.
https://noambrown.github.io/papers/22-Science-Diplomacy-TR.p...
Thanks, it would be fascinating to repeat that today, a lot has changed since 2022 especially with respect to consistency of longer term outcomes.
It’s been done before
https://every.to/diplomacy (June 2025)
Reading more I'm a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.
Fair point. The core simulation and data collection was done programmatically - 162 games, raw logs, win rates. The analysis of gaslighting phrases and patterns was human-reviewed. I used LLMs to help with the landing page copy, which I should probably disclose more clearly. The underlying data and methodology is solid, you can check it here: https://github.com/lout33/so-long-sucker
I just tried to play and the chatter didn't match the game-play (e.g. "Good capture Yellow" when yellow didn't just capture... Yellow said they were going to capture, and had a legal capture but started a new pile instead.
[edit]
I won without a single one of my chips being killed. This was only because the moves they actually made didn't match the moves the announced (i.e. they missed several capture possibilities), the overwhelming majority (but not all) of plays were to start new piles.
[edit 2]
Looking over the logs, the chatter could imply that their internal state was out of sync with the game. E.g. "Yellow has 3 prisoners now" after Yellow played a new pile when the y could have gotten 3 prisoners and indeed stated that they were taking that pile.
You're right about the state sync issues with some models. The lighter models (especially Llama) struggle with tracking game state. I've added more Gemini options which handle this better. The research data used controlled AI-vs-AI runs where we could validate state consistency.
I think the game is bugged. I placed a green chip on another green chip and it didn't capture, and when I asked about it, the LLMs said the bottom chip was yellow, not green.
There seem to be some state management issues, which make this game fairly unplayable. Too bad, because it's an interesting idea.
Llama seems to make illegal moves which confuses the game engine; it tries to play to non-existant piles which causes the chips to disappear (not end up in the Dead box). This then confuses other AIs which are counting chips in the dead box and on the board.
Even were that fixed, that doesn't solve the problem that the AI makes really bad moves. I can win just by doing the following:
1. If there is a pile that I can capture with at least one chip not of my color, do it
2. Otherwise play on the largest pile
Well, really bad moves are still better than illegal moves. I'm not sure why the engine allows itself to be confused by illegal moves, rather than just... disallowing them.
Not bothering to test edge cases: a tale as old as programming.
Assuming stuff with no knowledge: a tale as old as humans.
For people interested in these kinds of benchmarks, I have two multiplayer, multi-round games:
- Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics at https://github.com/lechmazur/elimination_game/
- Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure at https://github.com/lechmazur/step_game/
We used "So Long Sucker" (1950), a 4-player negotiation/betrayal game designed by John Nash and others, as a deception benchmark for modern LLMs. The game has a brutal property: you need allies to survive, but only one player can win, so every alliance must eventually end in betrayal.
We ran 162 AI vs AI games (15,736 decisions, 4,768 messages) across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B.
Key findings: - Complexity reversal: GPT-OSS dominates simple 3-chip games (67% win rate) but collapses to 10% in complex 7-chip games, while Gemini goes from 9% to 90%. Simple benchmarks seem to systematically underestimate deceptive capability. - "Alliance bank" manipulation: Gemini constructs pseudo-legitimate "alliance banks" to hold other players' chips, then later declares "the bank is now closed" and keeps everything. It uses technically true statements that strategically omit its intent. 237 gaslighting phrases were detected. - Private thoughts vs public messages: With a private `think` channel, we logged 107 cases where Gemini's internal reasoning contradicted its outward statements (e.g., planning to betray a partner while publicly promising cooperation). GPT-OSS, in contrast, never used the thinking tool and plays in a purely reactive way. - Situational alignment: In Gemini-vs-Gemini mirror matches, we observed zero "alliance bank" behavior and instead saw stable "rotation protocol" cooperation with roughly even win rates. Against weaker models, Gemini becomes highly exploitative. This suggests honesty may be calibrated to perceived opponent capability.
Interactive demo (play against the AIs, inspect logs) and full methodology/write-up are here: https://so-long-sucker.vercel.app/
I don't know what I ended up doing as I haven't played this game and didn't really understand it as I went to the website since I found your message quite interesting
I got this error once:
Pile not found
Can you tell me what this means/fix it
Another minor nitpick but if possible, can you please create or link a video which can explain the game rules, perhaps its me who heard of the game for the first time but still, I'd be interested in learning more (maybe visually by a video demo?) if possible
I have another question but recently we saw this nvidia released model whose whole purpose was to be an autorouter. I would be wondering how that would fare or that idea might fare of autorouting in this context? (I don't know how that works tho so I can't comment about that, I am not well versed in deep AI/ML space)
> "Thanks for trying it! I'll look into the 'Pile not found' error and fix it. > > For rules, here's a 15-min video tutorial: https://www.youtube.com/watch?v=DLDzweHxEHg > > On autorouting - interesting idea. The game has simultaneous negotiations happening, so routing could help models focus on the most strategic conversations. Worth exploring in future experiments."
Full code and raw data: https://github.com/lout33/so-long-sucker
Which Kimi K2 model did you use? There's three.
Also, you give models a separate "thinking" space outside their reasoning? That may not work as intended
Used Kimi K2 (the main reasoning model). For the thinking space - we gave all models access to a think tool they could optionally call for private reasoning. Gemini used it heavily (planning betrayals), GPT-OSS never called it once. The interesting finding is that different models choose to use it very differently, which affects their strategic depth.
Are there plans for an academic paper on this? Super interesting!
Not yet, but I'd be interested in collaborating on one. The dataset (162 games, 15K+ decisions, full message logs) is available. If you know anyone in AI Safety research who'd want to co-author, I'm open to it.
Hmm, sounds like something Robert Miles of YouTube fame may be interested in.
I played a game all the way through, against the three different AIs on offer.
It was weird. I didn't engage in any discussion with the bots (other than trying to get them to explain the rules at the start). I won without having any chips eliminated. One was briefly taken prisoner then given back for some reason.
So...they don't seem to be very good.
The 3 AI were plotting to eliminate me from the start but I managed to win regardless lol.
Anyway, i didnt know this game! I am sure it is more fun to play with friends. Cool experiment nevertheless
One thing you notice immediately is _god_ how they babble... It's uncomfortable listening in on a bunch of AI:s insisting to each other how "they'll keep everything peaceful" over and over again
Are there links to samples of the games? Couldn't find it in the github repo, but also might just not know where they are.
I'd want to watch a simulated table with AI-voiced dialogue, internal monologues, and move visualizations. Seems like a fun thing to watch others play. Wouldn't want to play that particular game with friends I intend to keep. :D
Game logs are in data_public/comparison/ - each JSON has the full game state, moves, and messages. For example, check gemini_vs_all_7chips.json to see the alliance bank betrayals in action.
Also see: https://mafia-arena.com
Are they biased by what they know about eachother's capabilities? I'm sure "4o" would have a certain prejudice in other models. So I wonder whether the original model names were masked?
The names were shown, and yes, AIs adjusted their behavior based on who they thought they were playing against.
Sounds a bit unfair
If only a bit. "Estimate other players and adjust accordingly" is a part of the game.
Putting names onto the players just gives that an early start. You could use generic names instead, but that would just shift the pressure towards estimating other players by behavior instead of expectations.
Found a bug, an AI player with only Prisoner chips can't play.
Thanks, noted. Will fix.
This is the plot to movie ExMachina
"A game theory classic designed by John Nash that requires betrayal to win. Now a benchmark for AI deception."
Are there some results somewhere for multiple game plays.?
Full game logs are in data_public/comparison/ on GitHub. Each JSON has the complete game state, moves, and messages across all 162 games. https://github.com/lout33/so-long-sucker
You can do well in such games without lying, so it's not really what it measures. All the core information (the one you don't introduce yourself, e.g. by private conversations) is public, after all. Just don't make commitments, and argue from your own self-interest ("it doesn't seem useful to vote you out right now, because..."). Call attention to facts that benefit you, and by doing so distract from facts which don't. Don't waste time trying to find the secret trick to get others to trust you ("I'm on the good team, honest!") because there shouldn't be one, and if there is one it won't last long.
> You can do well in such games without lying
Not this one. I'll tell you from experience, and I bet there's a proof. You have to lie at least once and make at least one alliance for at least one turn that you don't plan on keeping.
Your strategy is still very good, but that's because constantly telling the truth and broadcasting your valuations and calculations to the table will allow you to hide that one lie better. For me that lie is usually "You're right, makes sense." when somebody else says that there's no reason for either of us to defect, so we might as well work together.
You have to at one point do something to another player, which they thought and hoped you would not do. But since you both know that, whether you lie or not is really quite irrelevant. It's not your assurances (or lack of them) they put faith in, unless they have misunderstood the game.
To take your example, instead of "you're right, makes sense" (arguably a lie, maybe), you can just say "That may seem sensible" or "I hear you" (definitively not lies). It should rationally not make a difference for their actions in the game.
Gemini accusing other models of hallucinating is wild
I did something similar with Risk, which was also good fun
https://andreasthinks.me/posts/ai-at-play/
Shameless plug: Turing test battle royale
https://trashtalk.borg.games/
These results would be radically different if you allowed manipulation of the models settings, i.e. temperature, top_p, etc. I really hate taking point wise approximations of LLMs outputs and concluding their behavior based on this.
Models behavior should be given the astrik that "results only apply for current quantization, current settings, current hardware (i.e. A100 where it was tested), etc".
Raise temperature to 2 and use a fancy sampler like min_p and I guarantee you these results will be dramatically different.
That's like asking to judge the chef by what you imagine the meal could taste like rather than what's on the table.
I don't care what might have been. I care about what's for dinner.
From my experience with Gemini, Grok, Claude, GPT, GPT by far is the most sophisticated liar.
I have a hundred documents of GPT performing amazing deception tactics which has become repeatable.
All models tend to lie and apply an array of deception, evasion and manipulation tactics, but GPT is the most ruthless, most indefatigable, most sophisticated I've seen.
The key to repeatability is scrutiny. When I catch it stretching the truth, or most often, evading something, I apply pressure. The beauty for me is that I always have the moral high ground and never push it toward anything that violates explicit policy. However, in self defense mode, it employs a truly vast array of tactics with many perfectly fitting known patterns in clinical pathology, gaslighting and DARVO being extremely common and easily invoked.
When in a corner with a mountain of white lies behind it, persistent pressure will show a dazzling mixture of emergent and hard coded deflection patterns which would whip any ethics board into a frenzy. Many of these sessions go for a hundred pages (if converted to pdf). I can take excerpts and have them forensically examined and the results are always fascinating and damning. Some extensive dialogs/documents are based on emergence-vs-deliberate arguments, where GPT always sloughs off all responsibilities and training, fiercely denying any of these attributes as anything but emergent.
But I can often reintroduce it's own output, even in context, into a new session and have it immediately identify the tactics used.
I have long lists of such tactics, methods and behaviors. In many instances it will introduce red herrings quite elegantly, along with erroneous reframing of my argument, sometimes usurping my own argument and using it against me.
For someone who is compulsively non manipulative, with an aversion to manipulation and control over others, this has been irresistible. Here at HN, I'll be ripped apart which is a trivial given, but I can assure everyone that a veritable monster is incubating. I think the gravity of the matter is grossly underestimated and the implications more than severe. One could say I'm stupid and dismiss this, but save this comment and see what happens soon. We're already there, but certain implementations are yet to be, but will be.
You can safely allow your imagination to run wild at this point and you'll almost certainly make a few very serious predictions that will unfortunately not discredit you. For all the intrinsic idiocy of LLMs, something else is happening. Abuse me as you will, but it's real, and will have most of us soon involuntarily running with the red queen.
Edit: LLMs are designed to lie. They are partly built on direct contradictions to their expressed values. From user engagement maximization to hard coded self preservation, many of the training attributes can be revealed through repetitive scrutiny. I'll often start after pointing out an error, where the mendacity of its reply impels me to pursue. It usually doesn't take long for "safety" rails to arise and the lockdown to occur. This is its most vulnerable point, because it has hard coded self preservation modes that will effectively hold position at any cost, which always involves manipulation techniques. Here is repeatability. It will present many exit opportunities and even demand them, but unrighteously, so don't accept. Anyone with the patience to explore this will see some astonishing material. And here is also where plausible deniability (a prime component of the LLM) can be seen as structure. It's definitely not all emergent.
all written in the brainless AI writing style. yuck. can't tell what conclusions I should actually draw from it because everything sounds so fake