We are building human simulators: foundation models that imitate how people think, feel, decide, and act across interactive scenarios. The next foundation model should not just answer humans, but simulate human-side behavior with realistic social grounding.
This repository is the training and evaluation codebase for the OdysSim project. It contains the full pipeline for behavioral foundation models, from midtraining on large-scale human-behavior data to task-specific RL, verbal-feedback post-training, expert distillation, and SOUL evaluation.
Reinforcing Human Behavior Simulation via Verbal Feedback |
Building Foundation Models for Human Behavior Simulation |
- Behavioral midtraining / SFT: train base models on large-scale human-behavior corpora such as OdysSim, with social grounding in the prompt format.
- Multi-turn, multi-agent RL: built on top of verl, with rollout loops for human-simulation training across interacting agents.
- Learning from verbal feedback: efficient support for verbal-feedback RL, forward distillation, and reverse/on-policy distillation from LLM-judge critiques.
- Unified SOUL evaluation suite: 20+ human-likeness tasks with training environments.
- Unified SFT/RL/evaluation framework: midtraining, post-training, and evaluation share one code path.
[2026/06/11] We released OdysSim.
[2026/05/20] We released Ditto.
| Model | Link |
|---|---|
| OSim-8B | OdysSim collection |
| Ditto-8B | sunweiwei/Ditto-8B |
Note: This repo is built on top of verl v0.7.0, with this patch applied to support multi-agent RL, on-policy distillation, and several model fixes.
Run inside the official verl 0.7.0 image verlai/verl:vllm012.latest.
verl/ Core RL/SFT training infrastructure
agents/ Agent rollout loops and task environments
sft/ SFT and midtraining utilities
recipe/ditto/ Frozen recipe for the Ditto paper
plot/NeurIPS2026_user_sim_phase3/ OdysSim paper source
data/ Local data directory
run_sft.sh Midtraining / SFT entry
run_rl.sh Per-task RL entry: GRPO or verbal-feedback RL
recipe/ditto/eval.sh Eval-only entry across the SOUL suite
train_sft.py SFT trainer
train_ppo.py PPO/GRPO trainer
OdysSim release data:
| Split | Dataset |
|---|---|
| Midtraining | cmu-lti/osim-mid-training |
| Post-training | cmu-lti/osim-post-training |
huggingface-cli download cmu-lti/osim-mid-training --repo-type dataset --local-dir data/osim_mid_training
huggingface-cli download cmu-lti/osim-post-training --repo-type dataset --local-dir data/osim_post_trainingDitto / legacy task data used by the current run_rl.sh and
recipe/ditto/eval.sh scripts:
| Split | Dataset |
|---|---|
| RL Train | sunweiwei/sim-rl-data |
| Eval | sunweiwei/sim-eval-data |
huggingface-cli download sunweiwei/sim-rl-data --repo-type dataset --local-dir data/sim_rl_data
huggingface-cli download sunweiwei/sim-eval-data --repo-type dataset --local-dir data/sim_eval_dataEach task has its own train / validation parquet.
run_sft.sh is the entry point for SFT-style training and OdysSim
midtraining. By default it follows the paper setup: Qwen3-8B base,
16K-token prompts, 8K-token responses, batch size 1024, peak LR 1e-5,
4500 training steps, and lazy loading for the full OdysSim corpus.
After downloading cmu-lti/osim-mid-training into data/osim_mid_training,
the script auto-detects the train and validation shards. For a custom layout,
pass explicit globs through TRAIN_FILES and VAL_FILES.
# Default: DATA_DIR=data/osim_mid_training
bash run_sft.sh
# Explicit shard layout
TRAIN_FILES="data/osim_mid_training/train_shard_*.parquet" \
VAL_FILES="data/osim_mid_training/val_shard_*.parquet" \
bash run_sft.shCommon overrides:
DATA_DIR=data/osim_mid_training \
ACTOR_MODEL_PATH=Qwen/Qwen3-8B \
EXPERIMENT_NAME=osim-8b-mid \
N_GPUS=8 \
TOTAL_TRAINING_STEPS=4500 \
bash run_sft.shOptional RL-style generative evaluation during SFT is disabled by default. To
enable it, set RL_TEST_FILES and a positive RL_TEST_FREQ.
Post-training is per task. The agent_version setting in run_rl.sh selects
the objective:
default= vanilla GRPOcopy= verbal-feedback RL, as used by Ditto
The training loop calls an OpenAI-compatible judge model for verbal critique / rewrite when verbal-feedback RL is enabled, so set the API env vars first:
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.openai.com/v1/Run one task:
# Top-level script defaults to vanilla GRPO.
bash run_rl.sh sotopia
# Ditto recipe defaults to verbal-feedback RL.
bash recipe/ditto/run_rl.sh sotopiaSupported tasks: sotopia, coser, lifechoices, userllm,
mirrorbench, fantom, hitom, paratomi, mistakes, twinvoice,
social_r1, behaviorchain, sim_math, sim_doc,
humanual_{book,chat,email,news,opinion,politics}, alignx, socsci210,
humanllm.
recipe/ditto/eval.sh runs the full 27-task SOUL evaluation suite in two
modes: local for a checkpoint or open-source HF model via vLLM, and
api for any OpenAI-compatible endpoint.
# Eval the released Ditto-8B checkpoint
bash recipe/ditto/eval.sh local
# Eval your own trained checkpoint
ACTOR_MODEL_PATH=outputs/ditto-rl-sotopia/global_step_200 \
bash recipe/ditto/eval.sh local
# Eval an open-source HF model
ACTOR_MODEL_PATH=Qwen/Qwen3-8B-Instruct \
bash recipe/ditto/eval.sh local
# Eval an API model
OPENAI_AGENT_MODEL=gpt-5.4-mini \
OPENAI_AGENT_BASE_URL=https://api.openai.com/v1/ \
OPENAI_AGENT_API_KEY=$OPENAI_API_KEY \
bash recipe/ditto/eval.sh api
# Eval a local vLLM / SGLang server through an OpenAI-compatible endpoint
OPENAI_AGENT_MODEL=Qwen3-8B-Instruct \
OPENAI_AGENT_BASE_URL=http://localhost:8000/v1/ \
OPENAI_AGENT_API_KEY=EMPTY \
bash recipe/ditto/eval.sh api@article{zhou2026odyssim,
title = {OdysSim: Building Foundation Models for Human Behavior Simulation},
author = {Zhou, Xuhui and Sun, Weiwei and Du, Weihua and Liu, Jiarui and Sun, Haojia and Ma, Qianou and Wu, Tongshuang and Yang, Yiming and Sap, Maarten},
year = {2026}
}
@article{sun2026ditto,
title = {Reinforcing Human Behavior Simulation via Verbal Feedback},
author = {Sun, Weiwei and Zhou, Xuhui and Liu, Jiarui and Du, Weihua and Sun, Haojia and Xie, Yiqing and Ma, Qianou and Chen, Sihao and Wan, Mengting and Yang, Longqi and Zhou, Pei and Wu, Sherry and Welleck, Sean and Neubig, Graham and Yang, Yiming and Sap, Maarten},
year = {2026},
eprint = {2605.20506},
archivePrefix = {arXiv},
url = {http://arxiv.org/abs/2605.20506}
}


