Describe the bug
When using Flash 2.5 on Vertex AI for audio transcription with the google-genai package with batching enabled, the model repeatedly outputs the literal token [unclear]. This repetition consumes the entire max_output_tokens budget before transcription completes, causing the response to be truncated and resulting in invalid or incomplete JSON.
This behavior appears to be a recent regression. The same transcription pipeline was significantly more reliable approximately 1–1.5 months ago, with far fewer [unclear] repetitions and successful completion of JSON responses.
Environment
- Platform: Vertex AI
- Model: Flash 2.5
- Library: google-genai
- Task: Audio transcription with batching
- Response MIME type:
application/json
- Response schema: Enabled
- Thinking mode: Diabled
Steps to reproduce
- Send an audio file via
file_uri with a transcription prompt
- Enable structured JSON output using
response_schema
- Set
max_output_tokens appropriate for the expected transcription length
- Invoke Flash 2.5 on Vertex AI with batching
Expected behavior
- The model should avoid excessive repetition of
[unclear]
- The model should complete transcription within the token budget
- The model should consistently return a valid JSON response conforming to the schema
Actual behavior
- The model repeatedly emits
[unclear] segments
- Output tokens are exhausted before transcription completes
- JSON output is truncated or malformed
Code snippet
parts = [
{
"file_data": {
"file_uri": uri,
"mime_type": self._get_mime_type(file_path)
}
},
{
"text": final_prompt
}
]
generation_config = {
"response_mime_type": "application/json",
"temperature": transcription_config.TEMPERATURE,
"max_output_tokens": transcription_config.get_max_output_tokens(model),
}
schema_class = get_transcription_result_class(model, phase)
if schema_class:
if isinstance(schema_class, dict):
generation_config["response_schema"] = schema_class
else:
schema_dict = schema_class.model_json_schema()
schema_dict = self._resolve_json_schema_refs(schema_dict)
schema_dict.pop("$defs", None)
generation_config["response_schema"] = schema_dict
generation_config["thinking_config"] = {
"thinking_budget": transcription_config.THINKING_BUDGET
}
instance = {
"id": str(i - 1),
"request": {
"contents": [
{
"role": "user",
"parts": parts
}
],
"generation_config": generation_config
}
}
Additional context
- Increasing
max_output_tokens doesn't reduce the issue.
- The regression has been observed consistently over the last 1–1.5 months
Questions
- Is this a known regression in Flash 2.5 transcription behavior?
- Are there recommended mitigations to prevent token exhaustion due to repeated
[unclear] output?
- Is Flash 2.5 currently recommended for transcription workloads on Vertex AI?
Describe the bug
When using Flash 2.5 on Vertex AI for audio transcription with the
google-genaipackage with batching enabled, the model repeatedly outputs the literal token[unclear]. This repetition consumes the entiremax_output_tokensbudget before transcription completes, causing the response to be truncated and resulting in invalid or incomplete JSON.This behavior appears to be a recent regression. The same transcription pipeline was significantly more reliable approximately 1–1.5 months ago, with far fewer
[unclear]repetitions and successful completion of JSON responses.Environment
application/jsonSteps to reproduce
file_uriwith a transcription promptresponse_schemamax_output_tokensappropriate for the expected transcription lengthExpected behavior
[unclear]Actual behavior
[unclear]segmentsCode snippet
Additional context
max_output_tokensdoesn't reduce the issue.Questions
[unclear]output?