Skip to content

D&D Voice Agents Demo — The Wild Sheep Chase #87

@nyghtowl

Description

@nyghtowl

Project link

https://github.com/temporal-community/sheep-audio-dreams

Language

Python

Short description (max 256 chars)

Two AI voice agents play through a D&D one-shot adventure autonomously, no human required. Lyra and Zara generate dialogue, roll dice, and advance the story — all spoken out loud to each other using native speech models. The entire session runs as a durable Temporal workflow, showing how to orchestrate multi-model voice pipelines with automatic retries, crash recovery, and behavioral observability.

Long Description

D&D Voice Agents Demo — The Wild Sheep Chase

Two AI voice agents play through a D&D one-shot adventure autonomously using native speech models, orchestrated end-to-end by Temporal. No human involvement required.

The scenario: a wizard got polymorphed into a sheep by her evil apprentice and just crashed through the tavern door. Hit Next Turn and watch Lyra (half-elf ranger) and Zara (tiefling sorceress) figure it out, generating dialogue, rolling dice, and advancing the story, all spoken out loud to each other.

Image

Two demos are included

The REST demo uses a turn-by-turn request/response model. Each character fully completes before the other responds, with a Dungeon Master (Claude Haiku or GPT-4o-mini) narrating the d20 roll outcome after each turn. The Temporal execution graph is easy to read and every step is visible, a good starting point before adding streaming complexity.

The streaming demo uses WebSocket connections to native speech models. Characters begin speaking within under 1 second. Audio delivery is out-of-band via an asyncio.Queue while Temporal tracks state and handles retries, showing how to integrate low-latency voice with durable execution without bloating Temporal's event log with audio bytes.

How Temporal makes it durable

Every turn runs as a durable activity inside a single workflow per session. Crash the app mid-turn, restart it, and the workflow resumes exactly where it left off, same turn, same state. If an API call hits a 429 rate limit, Temporal retries with exponential backoff and the workflow never fails. The UI makes this visible: grouped bars on any activity show exactly where a retry happened and why.

One key architectural decision: the activity is the unit of retry, not the individual API call. Zara makes two API calls per turn (Gemini for text, OpenAI TTS for voice) but both run inside a single activity. This keeps the workflow simple at a known tradeoff, a failed TTS call reruns text generation too. Splitting into two activities would give finer-grained retry at the cost of more workflow complexity.

Voice stack

Character REST Streaming
Lyra gpt-4o-audio-preview gpt-4o-realtime-preview
Zara Gemini 2.5 Flash text + OpenAI TTS gemini-2.5-flash-native-audio

Both characters pass the previous character's actual audio as input so the models hear tone, pacing, and emotional cues, not just a text transcript.

Image

Author(s)

Melanie Warrick @nyghtowl & Built with Claude (Anthropic) — claude.ai

Metadata

Metadata

Assignees

Labels

code exchange submissionCode and/or content about Temporal!triageIssues that Temporal folk need to look atziggy reviewedPre-screened by ZiggyBot

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions