Project link
https://github.com/temporal-community/sheep-audio-dreams
Language
Python
Short description (max 256 chars)
Two AI voice agents play through a D&D one-shot adventure autonomously, no human required. Lyra and Zara generate dialogue, roll dice, and advance the story — all spoken out loud to each other using native speech models. The entire session runs as a durable Temporal workflow, showing how to orchestrate multi-model voice pipelines with automatic retries, crash recovery, and behavioral observability.
Long Description
D&D Voice Agents Demo — The Wild Sheep Chase
Two AI voice agents play through a D&D one-shot adventure autonomously using native speech models, orchestrated end-to-end by Temporal. No human involvement required.
The scenario: a wizard got polymorphed into a sheep by her evil apprentice and just crashed through the tavern door. Hit Next Turn and watch Lyra (half-elf ranger) and Zara (tiefling sorceress) figure it out, generating dialogue, rolling dice, and advancing the story, all spoken out loud to each other.
Two demos are included
The REST demo uses a turn-by-turn request/response model. Each character fully completes before the other responds, with a Dungeon Master (Claude Haiku or GPT-4o-mini) narrating the d20 roll outcome after each turn. The Temporal execution graph is easy to read and every step is visible, a good starting point before adding streaming complexity.
The streaming demo uses WebSocket connections to native speech models. Characters begin speaking within under 1 second. Audio delivery is out-of-band via an asyncio.Queue while Temporal tracks state and handles retries, showing how to integrate low-latency voice with durable execution without bloating Temporal's event log with audio bytes.
How Temporal makes it durable
Every turn runs as a durable activity inside a single workflow per session. Crash the app mid-turn, restart it, and the workflow resumes exactly where it left off, same turn, same state. If an API call hits a 429 rate limit, Temporal retries with exponential backoff and the workflow never fails. The UI makes this visible: grouped bars on any activity show exactly where a retry happened and why.
One key architectural decision: the activity is the unit of retry, not the individual API call. Zara makes two API calls per turn (Gemini for text, OpenAI TTS for voice) but both run inside a single activity. This keeps the workflow simple at a known tradeoff, a failed TTS call reruns text generation too. Splitting into two activities would give finer-grained retry at the cost of more workflow complexity.
Voice stack
| Character |
REST |
Streaming |
| Lyra |
gpt-4o-audio-preview |
gpt-4o-realtime-preview |
| Zara |
Gemini 2.5 Flash text + OpenAI TTS |
gemini-2.5-flash-native-audio |
Both characters pass the previous character's actual audio as input so the models hear tone, pacing, and emotional cues, not just a text transcript.
Author(s)
Melanie Warrick @nyghtowl & Built with Claude (Anthropic) — claude.ai
Project link
https://github.com/temporal-community/sheep-audio-dreams
Language
Python
Short description (max 256 chars)
Two AI voice agents play through a D&D one-shot adventure autonomously, no human required. Lyra and Zara generate dialogue, roll dice, and advance the story — all spoken out loud to each other using native speech models. The entire session runs as a durable Temporal workflow, showing how to orchestrate multi-model voice pipelines with automatic retries, crash recovery, and behavioral observability.
Long Description
D&D Voice Agents Demo — The Wild Sheep Chase
Two AI voice agents play through a D&D one-shot adventure autonomously using native speech models, orchestrated end-to-end by Temporal. No human involvement required.
The scenario: a wizard got polymorphed into a sheep by her evil apprentice and just crashed through the tavern door. Hit Next Turn and watch Lyra (half-elf ranger) and Zara (tiefling sorceress) figure it out, generating dialogue, rolling dice, and advancing the story, all spoken out loud to each other.
Two demos are included
The REST demo uses a turn-by-turn request/response model. Each character fully completes before the other responds, with a Dungeon Master (Claude Haiku or GPT-4o-mini) narrating the d20 roll outcome after each turn. The Temporal execution graph is easy to read and every step is visible, a good starting point before adding streaming complexity.
The streaming demo uses WebSocket connections to native speech models. Characters begin speaking within under 1 second. Audio delivery is out-of-band via an
asyncio.Queuewhile Temporal tracks state and handles retries, showing how to integrate low-latency voice with durable execution without bloating Temporal's event log with audio bytes.How Temporal makes it durable
Every turn runs as a durable activity inside a single workflow per session. Crash the app mid-turn, restart it, and the workflow resumes exactly where it left off, same turn, same state. If an API call hits a 429 rate limit, Temporal retries with exponential backoff and the workflow never fails. The UI makes this visible: grouped bars on any activity show exactly where a retry happened and why.
One key architectural decision: the activity is the unit of retry, not the individual API call. Zara makes two API calls per turn (Gemini for text, OpenAI TTS for voice) but both run inside a single activity. This keeps the workflow simple at a known tradeoff, a failed TTS call reruns text generation too. Splitting into two activities would give finer-grained retry at the cost of more workflow complexity.
Voice stack
Both characters pass the previous character's actual audio as input so the models hear tone, pacing, and emotional cues, not just a text transcript.
Author(s)
Melanie Warrick @nyghtowl & Built with Claude (Anthropic) — claude.ai