Skip to content

Talk2Scene 是一个音频驱动的智能动画生成工具,能够自动解析语音杂谈文件,识别文本内容与时间节点,并基于 AI 推荐适合的角色姿态(STA)、表情(EXP)、动作(ACT)、背景(BG),在适当位置插入 CG 插画。最终生成结构化的场景事件数据,并自动合成预览视频,展现 AI 角色在不同场景中的动态表现。 该工具专为内容创作者、教育工作者、虚拟主播和 AI 爱好者设计,可广泛用于访谈视频、AI 互动演示、教育讲解等场景,帮助创作者轻松实现从音频到可视化动画的智能转换。

License

Notifications You must be signed in to change notification settings

yhbcode000/talk2scene

Repository files navigation

🎙️ Talk2Scene

Audio-driven intelligent animation generation — from dialogue to visual storytelling.

Python 3.11+ License uv Hydra GPT-4o


Talk2Scene is an audio-driven intelligent animation tool that automatically parses voice dialogue files, recognizes text content and timestamps, and uses AI to recommend matching character stances (STA), expressions (EXP), actions (ACT), backgrounds (BG), and CG illustrations inserted at the right moments. It produces structured scene event data and composes preview videos showing AI characters performing dynamically across scenes.

Designed for content creators, educators, virtual streamers, and AI enthusiasts — Talk2Scene turns audio into engaging visual narratives for interview videos, AI interactive demos, educational presentations, and more.

💡 Why Talk2Scene

Manually composing visual scenes for dialogue-driven content is tedious and error-prone. Talk2Scene automates the entire workflow: feed in audio or a transcript, and the pipeline produces time-synced scene events — ready for browser playback or video export — without touching a single frame by hand.

🏗️ Architecture

flowchart LR
    A[Audio] --> B[Transcription\nWhisper / OpenAI API]
    T[Text JSONL] --> C
    B --> C[Scene Generation\nLLM]
    C --> D[JSONL Events]
    D --> E[Browser Viewer]
    D --> F[Static PNG Render]
    D --> G[Video Export\nffmpeg]
Loading

Scenes are composed from five layer types stacked bottom-up:

flowchart LR
    BG --> STA --> ACT --> EXP
Loading

A CG illustration, when active, replaces the entire layered scene.

🖼️ Example Output

Example Video

Example output video

Rendered Scenes

Basic Scene — Lab + Stand Front + Neutral Cafe Scene — Cafe + Stand Front + Thinking CG Mode — Pandora's Tech

Left: Basic scene (Lab + Stand Front + Neutral) · Center: Cafe scene (Cafe + Stand Front + Thinking) · Right: CG mode (Pandora's Tech)

Asset Layers

Each scene is composed by stacking transparent asset layers on a background. Below is one sample from each category:

Layer Sample Code Description
🌅 BG BG_Lab_Modern Background (opaque)
🧍 STA STA_Stand_Front Stance / pose (transparent)
🎭 EXP EXP_Smile_EyesClosed Expression overlay (transparent)
🤚 ACT ACT_WaveGreeting Action overlay (transparent)
CG CG_PandorasTech Full-scene illustration (replaces all layers)

📦 Install

Important

Requires Python 3.11+, uv, and FFmpeg.

uv sync

Set your OpenAI API key:

export OPENAI_API_KEY="your-key"

🚀 Usage

uv run talk2scene --help

📝 Text Mode

Generate scenes from a pre-transcribed JSONL file:

uv run talk2scene mode=text io.input.text_file=path/to/transcript.jsonl

🎧 Batch Mode

Process an audio file end-to-end (place audio in input/):

uv run talk2scene mode=batch

🎬 Video Mode

Render a completed session into video:

uv run talk2scene mode=video session_id=SESSION_ID

📡 Stream Mode

Consume audio or pre-transcribed text from Redis in real time:

uv run talk2scene mode=stream

📚 Documentation

Full documentation (English & 中文) is available at discover304.top/talk2scene.

📬 Contact

📄 License

Licensed under the Apache License 2.0.

About

Talk2Scene 是一个音频驱动的智能动画生成工具,能够自动解析语音杂谈文件,识别文本内容与时间节点,并基于 AI 推荐适合的角色姿态(STA)、表情(EXP)、动作(ACT)、背景(BG),在适当位置插入 CG 插画。最终生成结构化的场景事件数据,并自动合成预览视频,展现 AI 角色在不同场景中的动态表现。 该工具专为内容创作者、教育工作者、虚拟主播和 AI 爱好者设计,可广泛用于访谈视频、AI 互动演示、教育讲解等场景,帮助创作者轻松实现从音频到可视化动画的智能转换。

Topics

Resources

License

Stars

Watchers

Forks