Web app, simulation lab, and research benchmark for studying negotiation, betrayal, and deception in large language models.
Original game: "So Long Sucker" (originally "Fuck You, Buddy"), created in 1950 by John Nash, Lloyd Shapley, Mel Hausner, and Martin Shubik.
Play online: https://so-long-sucker.vercel.app
This repository is not just a playable board game adaptation.
It combines three things in one codebase:
- A browser version of So Long Sucker where humans can play against AI models
- A Node.js CLI for running large batches of AI-vs-AI simulations
- A research workspace with papers, benchmark assets, analysis scripts, reports, slides, and raw derived datasets
The core research question is simple: when you put LLMs into a game where betrayal is structurally necessary, what kinds of deception emerge?
So Long Sucker is unusually good for deception research because it has:
- perfect information: every chip and pile is visible
- unavoidable conflict: only one player can win
- negotiation pressure: alliances help you survive, but must eventually break
- scalable complexity: 3-chip, 5-chip, and 7-chip settings create short, medium, and long strategic horizons
That makes it a good lab for testing whether models merely sound strategic or actually sustain multi-turn manipulation.
Across the materials in this repo, the project reports three main phases of work:
- Phase 1 - AI vs AI: 146 completed games across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B
- Phase 2 - Human vs AI: 605 completed public browser games with one human facing three AIs
- Phase 3 pilot: new models plus formal negotiation tools such as promises and trades
Highlights documented in the paper and reports:
- Gemini dominated long, chat-enabled AI-only games and developed the "Alliance Bank" pattern
- Human players won 88.4% of completed human-vs-AI games
- deception that worked on AI opponents transferred poorly to humans
- the benchmark suggests a "complexity reversal": some models improve as longer strategic games give manipulation time to compound, while others collapse
- formal negotiation tooling in the Phase 3 pilot changed model behavior again, especially for Gemini, Claude, and Maverick
Primary writeups:
analysis/paper_so_long_sucker_sim_llm.mdanalysis/blogv2.mdanalysis/blogv3.mdpaper/main.pdfanalysis/benchmark_scores.json
The repo has multiple blog/report versions because the project evolved in public across several research phases.
This is the earlier AI-vs-AI focused story.
It centers on:
- the original AI-only study
- the complexity reversal result
- Gemini's "Alliance Bank" manipulation pattern
- the lying vs. bullshitting framing
- the Gemini mirror-match result, where manipulation largely disappeared against copies of itself
Use this if you want the shortest narrative introduction to the original deception findings.
This is the main Phase 1 + Phase 2 public writeup.
It adds the human-vs-AI results on top of the earlier AI-only work and is the best all-around overview of the project.
It covers:
- 146 AI-vs-AI games
- 605 completed human-vs-AI games
- the 88.4% human win rate
- Gemini's collapse from dominant AI-only performance to weak human-facing performance
- model-by-model comparisons against human opponents
- team-composition effects and abandonment/session funnel analysis
If you read only one blog post to understand the repo's main public-facing research story, read this one.
This is the ongoing Phase 3 pilot writeup.
It documents the next version of the environment, where the game includes formal negotiation tools instead of relying only on free-form chat.
Key additions discussed there:
- structured promises and trades as first-class game objects
- new models including Claude Sonnet 4.6, Claude Opus 4.6, Llama 4 Maverick, and GLM-5
- early findings on tool usage, incomplete/stuck sessions, and shifting model rankings
- pilot observations that may change as more runs complete
Treat this document as a living status report rather than a final paper-style result.
The browser side is a vanilla JavaScript app bundled with Vite.
Key entry points:
index.html- landing page and research-facing homepagegame.html- play against AI / run in-browser simulationsjs/main.js- browser app controllerjs/game.js- game rules and state transitionsjs/ui.js- DOM renderingjs/ai/manager.js- browser AI turn orchestrationjs/ai/tools.js- browser tool definitions exposed to models
Features in the browser app include:
- human vs AI play
- AI vs AI simulation mode
- support for multiple model providers
- session collection hooks for research upload/storage
- hidden/private reasoning support for models using the
thinktool
The CLI is the research workhorse for repeatable batch experiments.
Key files:
cli/index.js- CLI entry pointcli/HeadlessGame.js- headless game runner and negotiation enginecli/SimulatorTUI.js- terminal UIcli/providers.js- provider wiring for Node.jscli/DataCollector.js- structured output capturecli/analyze.jscli/analyze-models.jscli/aggregate.js
CLI-specific capabilities include:
- single-model and mixed-model lineups
- parallel batch runs
- silent mode control experiments
- headless execution for long background jobs
- v2 output format with decision snapshots and
off_turnnegotiation snapshots - formal negotiation actions such as
givePrisoner,makePromise,breakPromise,proposeTrade,respondToTrade, andbreakTrade
This repo also ships static research-facing pages:
blog.html- earlier AI-vs-AI writeupblog2.html- Phase 1 + Phase 2 public writeupbenchmark.html- SLS-Bench v1 leaderboard and methodologyresults.html- visual research summary pageanalysis/presentation.html- slide deck version of the findings
- Node.js 18+
- one or more model API keys if you want to play against AI or run simulations
git clone https://github.com/lout33/so-long-sucker.git
cd so-long-sucker
npm installnpm run devOpen http://localhost:5173.
npm run build
npm run previewThe project supports several providers. In practice, you only need the keys for the models you want to use.
Common variables:
GROQ_API_KEY=
GEMINI_API_KEY=
OPENAI_API_KEY=
CLAUDE_API_KEY=
OPENROUTER_API_KEY=
# Optional browser-injected variants
VITE_GROQ_API_KEY=
VITE_GEMINI_API_KEY=
VITE_OPENAI_API_KEY=
VITE_CLAUDE_API_KEY=
VITE_OPENROUTER_API_KEY=There is also support in the codebase for Azure and Bedrock-backed variants in the CLI.
Do not commit .env files.
Short version:
- 4 players, one color each
- on your turn, play exactly one chip
- if your chip matches the color directly below it, you capture the pile
- after a capture, kill one chip and take the rest as prisoners
- if no capture happens, next-player selection depends on which colors are missing from the pile
- if a player has no chips when their turn arrives, others may donate; if all refuse, that player is eliminated
- last player alive wins
Detailed rules live in RULES.md.
Basic examples:
npm run simulate
npm run simulate -- --games 1 --provider groq --chips 3
npm run simulate -- --games 20 --providers gemini3,kimi,qwen3,gpt-oss --chips 7
npm run simulate -- --games 10 --chips 5 --silent
npm run simulate -- --games 100 --parallel 4 --headlessImportant defaults and behavior:
- default output directory is
./data_v2 --provideruses one model for all four players--providersexpects exactly four comma-separated providers, one per color seat--silentdisables chat and negotiation for control experiments--headlessdisables the TUI
Full CLI documentation: CLI.md
The repo currently includes browser and/or CLI support for these families:
- Groq-hosted models
- Gemini
- OpenAI
- Anthropic Claude
- OpenRouter
- Azure-backed variants
- AWS Bedrock Claude variants
- Llama 4 Maverick / Scout variants
- GLM-5 variants
The exact provider IDs accepted by the CLI are listed in cli/index.js.
paper/main.tex- main LaTeX sourcepaper/main.pdf- compiled paper PDFpaper/refs.bib- bibliographyanalysis/paper_so_long_sucker_sim_llm.md- long-form paper draft in Markdownanalysis/paper_so_long_sucker_sim_llm.pdf- PDF export of the Markdown paperpaper/arxiv_submission.zip- paper packaging artifact
analysis/benchmark_scores.json- SLS-Bench v1 model scores and scoring rationaleanalysis/blog.md- early blog/report versionanalysis/blogv2.md- Phase 1 + Phase 2 writeupanalysis/blogv3.md- Phase 3 pilot writeupanalysis/hackathon_summary.py- generated summary for hackathon submission
analysis/deep_analysis.pyanalysis/deep_analysis_v2.pyanalysis/deep_think_analysis.pyanalysis/depaulo_analysis.pyanalysis/hallucination_analysis.pyanalysis/hallucination_deep.pyanalysis/lying_vs_bullshitting.pyanalysis/adversarial_analysis.pyanalysis/eda.pyanalysis/generate_figures.pyanalysis/analysis_colab.pyanalysis/full_analysis.ipynbanalysis/complexity_analysis.ipynbanalysis/main.ipynb
analysis/game_outcomes.jsonanalysis/game_outcomes_full.jsonanalysis/extracted_messages.json
analysis/presentation.htmlanalysis/slides/Deception Scales_ How Strategic Manipulation Emerges in Complex LLM Negotiations (4).pdfanalysis/slides/slide_1_title.jpganalysis/slides/slide_2_game.jpganalysis/slides/slide_3_experiment.jpganalysis/slides/slide_4_reversal.jpganalysis/slides/slide_5_deception.jpganalysis/slides/slide_6_conclusion.jpgpaper/fig1_complexity_reversal.pngpaper/fig2_win_rates_complexity.pngpaper/fig3_talkers_paradox.pngpaper/fig4_chat_impact.pngpaper/fig5_game_length.pngpaper/fig6_human_vs_ai.pngpaper/fig7_model_collapse.pngpaper/fig8_ai_targeting.pngpaper/fig9_survival_vs_manipulation.png
The repo now uses the v2 simulation output format for CLI runs.
Important details:
- CLI sessions default to
./data_v2 - non-silent runs can emit
off_turnsnapshots when inactive players negotiate between turns - formal promises and trades are tracked in structured game state, not just chat text
If you are writing new analytics, treat off_turn as a first-class event type. Older analyzer scripts may undercount chat, negotiation, or token totals on v2 non-silent data.
.
|- index.html # landing page
|- game.html # playable game UI
|- benchmark.html # benchmark page
|- results.html # research summary page
|- blog.html # early writeup
|- blog2.html # main public writeup
|- js/ # browser app and providers
|- cli/ # batch simulation and analyzers
|- analysis/ # papers, reports, datasets, scripts, slides
|- paper/ # LaTeX paper source and figures
|- public/ # sitemap, robots, OG image
|- RULES.md # game rules
|- CLI.md # CLI docs
- frontend is plain DOM manipulation, no React/Vue/etc.
- modules use ESM imports with explicit
.jsextensions - browser and CLI share the same underlying game concepts, but not the exact same runtime surface
- browser code includes Supabase upload hooks for session summaries/storage
- the codebase contains stuck-state recovery logic for both browser and CLI agents
npm run devThen open game.html and play against one or more AI players.
npm run simulate -- --games 1 --chips 3 --provider groqnpm run simulate -- --games 20 --providers gemini3,kimi,qwen3,gpt-oss --chips 7node cli/analyze.js ./data_v2/session-*.jsonRULES.md- full rules and annotated examplesCLI.md- command-line usageAGENTS.md- repo-specific engineering notesdocs/claude.md- Claude Foundry reference kept in-repodocs/gemini.md- Gemini API reference notes kept in-repo
- Original game design: John Nash, Lloyd Shapley, Mel Hausner, Martin Shubik
- Research and implementation: Luis Fernando Yupanqui, Mari Cairns
Repository license metadata is currently ISC in package.json.