feat: Add subtitle generation with speaker and timing information#1124
Closed
ypwcharles wants to merge 5 commits intoSWivid:mainfrom
Closed
feat: Add subtitle generation with speaker and timing information#1124ypwcharles wants to merge 5 commits intoSWivid:mainfrom
ypwcharles wants to merge 5 commits intoSWivid:mainfrom
Conversation
This commit introduces the capability to generate a JSON-formatted subtitle file synchronously with the audio synthesis. - Added argument to to specify the output subtitle file name. - The feature is also supported through configuration files by adding the key. - Updated and to include documentation and examples for the new feature.
合并主分支更新
The previous text splitting logic using with a lookahead assertion was unreliable and failed to split the text correctly if the file did not start with a voice tag, causing the entire text to be synthesized with the default voice. This commit replaces the splitting mechanism with a more robust method that uses with a capturing group. This correctly tokenizes the text based on markers, ensuring each segment is associated with the proper voice as intended.
This commit improves the subtitle generation feature by: 1. Adding a "speaker" field to each subtitle entry, indicating which voice (e.g., "main", "town") synthesized the text. 2. Cleaning the subtitle text by removing leading/trailing quotation marks from the original text, providing a cleaner output.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This pull request introduces a highly valuable new feature to the inference CLI: the ability to generate a structured subtitle file (in JSON format) concurrently with the synthesized audio.
This provides users with precise timing, text, and speaker information for each audio segment, which is essential for applications like video production, transcription alignment, and content creation.
New Feature Details
A new command-line argument, --output_subtitle_file (short alias -j), has been added. When a filename is provided (e.g., subtitle.json), the script generates a JSON file containing an array of subtitle entries.
Each entry in the JSON file is an object with the following fields:
"text": The cleaned-up text content of the audio segment.
"speaker": The speaker tag associated with the segment (e.g., "speaker1").
"time_begin": The start time of the segment in milliseconds.
"time_end": The end time of the segment in milliseconds.
"text_begin": The starting character offset of the segment's text within the full input text.
"text_end": The ending character offset of the segment's text.
How It Works
Calculate Duration: The script precisely calculates the duration of each synthesized audio chunk.
Maintain Timeline: It maintains a cumulative timeline to ensure the "time_begin" and "time_end" values for each consecutive segment are accurate.
Track Text Offsets: It also tracks character offsets to map the generated audio segments directly and accurately back to the source text.
Write to File: The final list of subtitle data is written to the specified JSON file using UTF-8 encoding to ensure proper handling of all character types.
By providing critical metadata that was previously unavailable, this feature significantly enhances the project's overall value and utility.