Describe the bug
SpeechRecognizer with MODE_ADVANCED rejects valid PCM audio streamed via AudioSource.fromPfd() with ERROR_TYPE_INVALID_REQUEST when the audio begins with ~800ms of silence. The same code path succeeds for audio where speech starts immediately. The engine appears to apply an early silence-based rejection — it errors out in ~595ms, before the speech content at ~1000ms is even reached.
There is no documentation on audio content requirements (e.g., leading silence tolerance) or supported formats for AudioSource.fromPfd(). The official sample app only demonstrates live microphone input, not pre-recorded audio file input.
To Reproduce
- Decode two MP4 videos to raw PCM (16kHz, mono, 16-bit) using
MediaExtractor + MediaCodec
- Stream the raw PCM bytes (no WAV header) through
ParcelFileDescriptor.createPipe() — matching the official sample's pattern
- Pass
AudioSource.fromPfd(pipe[0]) to SpeechRecognizer.startRecognition()
- Video A (speech from 0ms, source 48kHz): transcribes successfully
- Video B (speech from ~1000ms with ~800ms leading silence, source 44.1kHz): fails with
ERROR_TYPE_INVALID_REQUEST in 595ms
val pipe = ParcelFileDescriptor.createPipe()
// Background thread: write raw PCM to pipe (no WAV header)
thread {
FileOutputStream(pipe[1].fileDescriptor).use { fos ->
fos.write(rawPcmBytes) // 16kHz, mono, 16-bit PCM
}
pipe[1].close()
}
// Collect recognition results
recognizer.startRecognition(
speechRecognizerRequest { audioSource = AudioSource.fromPfd(pipe[0]) }
).collect { response ->
when (response) {
is SpeechRecognizerResponse.ErrorResponse -> {
// Video B hits this immediately:
// "Speech recognition engine is closed due to internal error: ERROR_TYPE_INVALID_REQUEST"
}
// ...
}
}
Audio energy analysis (RMS per 200ms window) for the failing video:
0ms: RMS=1241 (brief decode artifact)
200ms: RMS=0 (silence)
400ms: RMS=5 (silence)
600ms: RMS=3 (silence)
800ms: RMS=367 (transition)
1000ms: RMS=2897 (speech starts here — but API already rejected at 595ms)
Workaround: Trimming leading silence with a custom energy-based VAD before piping to ML Kit resolves the issue — Video B transcribes successfully after trimming.
Expected behavior
The API should either:
- Wait for speech to appear in the audio stream before making an accept/reject decision, or
- Document the leading silence tolerance and audio content requirements for
fromPfd()
Additionally, AudioSource.fromPfd() documentation should specify supported audio formats (raw PCM vs WAV, expected sample rate, channels, encoding).
SDK Info:
- SDK Name & Version:
com.google.mlkit:genai-speech-recognition:1.0.0-alpha1
- Mode:
SpeechRecognizerOptions.Mode.MODE_ADVANCED
Smartphone:
- Device: Pixel 10
- Device OS: Android 15
- AICore: Up to date, bootloader locked
Development Environment:
- Android Studio Ladybug
- Kotlin 2.1.0
Describe the bug
SpeechRecognizerwithMODE_ADVANCEDrejects valid PCM audio streamed viaAudioSource.fromPfd()withERROR_TYPE_INVALID_REQUESTwhen the audio begins with ~800ms of silence. The same code path succeeds for audio where speech starts immediately. The engine appears to apply an early silence-based rejection — it errors out in ~595ms, before the speech content at ~1000ms is even reached.There is no documentation on audio content requirements (e.g., leading silence tolerance) or supported formats for
AudioSource.fromPfd(). The official sample app only demonstrates live microphone input, not pre-recorded audio file input.To Reproduce
MediaExtractor+MediaCodecParcelFileDescriptor.createPipe()— matching the official sample's patternAudioSource.fromPfd(pipe[0])toSpeechRecognizer.startRecognition()ERROR_TYPE_INVALID_REQUESTin 595msAudio energy analysis (RMS per 200ms window) for the failing video:
Workaround: Trimming leading silence with a custom energy-based VAD before piping to ML Kit resolves the issue — Video B transcribes successfully after trimming.
Expected behavior
The API should either:
fromPfd()Additionally,
AudioSource.fromPfd()documentation should specify supported audio formats (raw PCM vs WAV, expected sample rate, channels, encoding).SDK Info:
com.google.mlkit:genai-speech-recognition:1.0.0-alpha1SpeechRecognizerOptions.Mode.MODE_ADVANCEDSmartphone:
Development Environment: