D&D Voice Agents Demo — The Wild Sheep Chase

### Project link

https://github.com/temporal-community/sheep-audio-dreams

### Language

Python

### Short description (max 256 chars)

Two AI voice agents play through a D&D one-shot adventure autonomously, no human required. Lyra and Zara generate dialogue, roll dice, and advance the story — all spoken out loud to each other using native speech models. The entire session runs as a durable Temporal workflow, showing how to orchestrate multi-model voice pipelines with automatic retries, crash recovery, and behavioral observability.


### Long Description

<html><head></head><body><h2>D&amp;D Voice Agents Demo — The Wild Sheep Chase</h2>
<p>Two AI voice agents play through a D&amp;D one-shot adventure autonomously using native speech models, orchestrated end-to-end by Temporal. No human involvement required.</p>
<p>The scenario: a wizard got polymorphed into a sheep by her evil apprentice and just crashed through the tavern door. Hit <strong>Next Turn</strong> and watch Lyra (half-elf ranger) and Zara (tiefling sorceress) figure it out, generating dialogue, rolling dice, and advancing the story, all spoken out loud to each other.</p>
<img width="2816" height="1536" alt="Image" src="https://github.com/user-attachments/assets/c597af1b-d4b4-46eb-85fc-936a3f69ff9d" />

<h2>Two demos are included</h2>
<p>The <strong>REST demo</strong> uses a turn-by-turn request/response model. Each character fully completes before the other responds, with a Dungeon Master (Claude Haiku or GPT-4o-mini) narrating the d20 roll outcome after each turn. The Temporal execution graph is easy to read and every step is visible, a good starting point before adding streaming complexity.</p>
<p>The <strong>streaming demo</strong> uses WebSocket connections to native speech models. Characters begin speaking within under 1 second. Audio delivery is out-of-band via an <code>asyncio.Queue</code> while Temporal tracks state and handles retries, showing how to integrate low-latency voice with durable execution without bloating Temporal's event log with audio bytes.</p>
<h2>How Temporal makes it durable</h2>
<p>Every turn runs as a durable activity inside a single workflow per session. Crash the app mid-turn, restart it, and the workflow resumes exactly where it left off, same turn, same state. If an API call hits a 429 rate limit, Temporal retries with exponential backoff and the workflow never fails. The UI makes this visible: grouped bars on any activity show exactly where a retry happened and why.</p>
<p>One key architectural decision: the activity is the unit of retry, not the individual API call. Zara makes two API calls per turn (Gemini for text, OpenAI TTS for voice) but both run inside a single activity. This keeps the workflow simple at a known tradeoff, a failed TTS call reruns text generation too. Splitting into two activities would give finer-grained retry at the cost of more workflow complexity.</p>
<h2>Voice stack</h2>

Character | REST | Streaming
-- | -- | --
Lyra | gpt-4o-audio-preview | gpt-4o-realtime-preview
Zara | Gemini 2.5 Flash text + OpenAI TTS | gemini-2.5-flash-native-audio


<p>Both characters pass the previous character's actual audio as input so the models hear tone, pacing, and emotional cues, not just a text transcript.</p></body></html>
<img width="2816" height="1536" alt="Image" src="https://github.com/user-attachments/assets/4f3ae3ec-785a-48b0-a660-7d501499d033" />


### Author(s)

Melanie Warrick @nyghtowl & Built with Claude (Anthropic) — claude.ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

D&D Voice Agents Demo — The Wild Sheep Chase #87

Project link

Language

Short description (max 256 chars)

Long Description