SubTokenTest is a suite of independent, task-specific benchmarks for large language models with a unified runner. Each task ships with its own prompts, data generation, and evaluation logic, while cli.py in the repo root lets you launch any task with consistent model backends (APIs or local vLLM).
- Unified entrypoint:
python cli.py run <task> -- <task-flags>discoverstasks/<task>/main.py, sets the working directory, and forwards all arguments. - Centralized configs: copies of every task’s configs live under
configs/<task>/;configs/locator.pyresolves relative paths for you. - Shared model adapters in
models/: OpenAI-compatible APIs (including reasoning-token accounting for o-series/DeepSeek) and vLLM with chat-template fallback. - Rich datasets and generators: ready-made JSONL corpora in
datasets/plus per-task generators (e.g.,generate_contexts.py,generate_datasets.py,run_model_test.py). - Built-in token usage reporting: most tasks record prompt/completion/reasoning tokens and ratios alongside accuracy metrics.
- Python 3.10+ recommended; GPU is needed for vLLM.
- Install dependencies with
uv(see/scripts/UV_SETUP.mdfor doc inuv):uv venv .venv source .venv/bin/activate uv sync # add --group dev for lint/test tools
- Set API keys as needed:
OPENAI_API_KEY,DEEPSEEK_API_KEY(DeepSeek also honors--base_url).
- List tasks:
python cli.py list
- Run a task (flags after
--go straight to the task script):# Adversarial prompt canonicalization with OpenAI API python cli.py run adversarial_prompt -- \ --model_type openai \ --config configs/adversarial_prompt/benchmark_config.yaml \ --num_samples 20 # Map navigation with a local vLLM model python cli.py run map_navigation -- \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --model-type vllm \ --config configs/map_navigation/model_config.py \ --data data/sokoban_test.json \ --output results/llama3_sokoban.json
- Please view
/scripts/cli_run.shfor detailed instructions to run each task. - You can also call scripts directly, e.g.,
python tasks/gomoku/main.py --help. - You can also view the experiments log in the
/scriptsrepo for example.
adversarial_prompt: Normalize perturbed/jailbreak prompts back to their canonical form with exact-match/Levenshtein metrics and token-usage breakdowns; supports GPT-generated contexts.aligned_table: Reconstruct aligned tables (LaTeX/Markdown/plaintext) from structured data; reports alignment/content scores per format.biological_sequence: DNA/RNA complements and protein code conversions across four sub-tasks; configurable model list and data loading/generation.cipher_decipher: Morse and Caesar encode/decode tasks with async runner, prompt-style overrides, and response logging.context_aware_redaction: Detect and mask PII (phone, ID, credit card) in context; works with API or vLLM backends.gomoku: Classify Gomoku board states (win/draw) across board sizes; includes dataset generators and response saving.map_navigation: QA over Sokoban and FrozenLake maps using system/user prompts; exact-match evaluator with batch support.rsa_randomart: Find coordinate differences between RSA randomart patterns; supports data-only generation and weighted coordinate/replacement scoring.tree: Binary-tree reasoning (structure queries and path analysis) with similarity scoring and difficulty-aware stats.typewriter: Typewriter/backspace simulation tasks with system/few-shot prompting, optional vLLM batching, and multi-model evaluation.
All tasks understand --restricted_reasoning (where implemented) to nudge models toward concise outputs, and most accept --verbose to print prompts/responses and token tallies.
- Config resolution: pass relative names and let
configs/locator.pyfind them, e.g.,--config config.yamlresolves toconfigs/<task>/config.yamlwhen available. - Model backends:
- API (
openai/deepseek/api): setOPENAI_API_KEYorDEEPSEEK_API_KEY; optional--reasoning-effortfor o-series;models/api.pynormalizes usage fields. - vLLM: enable with
--model_type vllmor--use-vllm; tune--gpu_memory_utilization,--tensor-parallel-size,--batch-size,--enforce-eageras supported by each task.
- API (
- Datasets: curated JSONL files in
datasets/cover every benchmark (seedatasets/README.mdfor examples and counts). Most tasks also ship generators undertasks/<task>/scripts/or root-level helpers (e.g.,generate_datasets.py,generate_contexts.py). - Outputs: tasks typically write results/metrics (and optional response logs) under
results/oroutputs/with timestamps; many also emit token-usage summaries and human-readable reports.
cli.py: unified dispatcher; legacypython cli.py <task>also works.tasks/: benchmark implementations (see subfolders listed above).configs/: centralized task configs pluslocator.pyhelper.models/: shared API and vLLM adapters with token accounting utilities.datasets/: ready-made JSONL datasets spanning all tasks.experiments/: research notebooks/scripts (e.g., linear probes, TTBC, reproduction helpers).test/: sample datasets and simple runner scripts for smoke checks.run.sh/cli_run.sh: curated command examples for common runs.
- Task-level tests exist where applicable (e.g.,
tasks/adversarial_prompt/tests/test_components.py). Run withpython -m pytestfrom the repo root after installing dev extras:uv sync --group dev. - For functional checks, run a small-sample benchmark with
--num_samplestrimmed to a handful of cases before launching full sweeps.
- Keep runs inside the repo root so relative paths and centralized configs resolve correctly.
- Use
python cli.py run <task> -- --helpto see task-specific flags (batching, generation-only modes, logging, etc.). - When comparing models, set seeds and reuse generated datasets to keep samples aligned across runs.