[Bug report] GenAI Speech Recognition: ERROR_TYPE_INVALID_REQUEST when audio begins with silence (fromPfd)

**Describe the bug**
`SpeechRecognizer` with `MODE_ADVANCED` rejects valid PCM audio streamed via `AudioSource.fromPfd()` with `ERROR_TYPE_INVALID_REQUEST` when the audio begins with ~800ms of silence. The same code path succeeds for audio where speech starts immediately. The engine appears to apply an early silence-based rejection — it errors out in ~595ms, before the speech content at ~1000ms is even reached.
 
There is no documentation on audio content requirements (e.g., leading silence tolerance) or supported formats for `AudioSource.fromPfd()`. The [official sample app](https://github.com/googlesamples/mlkit/blob/master/android/speech/app/src/main/java/com/google/mlkit/genai/speech/demo/SpeechRecognitionActivity.kt) only demonstrates live microphone input, not pre-recorded audio file input.

**To Reproduce**
 
1. Decode two MP4 videos to raw PCM (16kHz, mono, 16-bit) using `MediaExtractor` + `MediaCodec`
2. Stream the raw PCM bytes (no WAV header) through `ParcelFileDescriptor.createPipe()` — matching the official sample's pattern
3. Pass `AudioSource.fromPfd(pipe[0])` to `SpeechRecognizer.startRecognition()`
4. Video A (speech from 0ms, source 48kHz): transcribes successfully
5. Video B (speech from ~1000ms with ~800ms leading silence, source 44.1kHz): fails with `ERROR_TYPE_INVALID_REQUEST` in 595ms
```kotlin
val pipe = ParcelFileDescriptor.createPipe()
 
// Background thread: write raw PCM to pipe (no WAV header)
thread {
    FileOutputStream(pipe[1].fileDescriptor).use { fos ->
        fos.write(rawPcmBytes) // 16kHz, mono, 16-bit PCM
    }
    pipe[1].close()
}
 
// Collect recognition results
recognizer.startRecognition(
    speechRecognizerRequest { audioSource = AudioSource.fromPfd(pipe[0]) }
).collect { response ->
    when (response) {
        is SpeechRecognizerResponse.ErrorResponse -> {
            // Video B hits this immediately:
            // "Speech recognition engine is closed due to internal error: ERROR_TYPE_INVALID_REQUEST"
        }
        // ...
    }
}
```
 
Audio energy analysis (RMS per 200ms window) for the failing video:
```
   0ms: RMS=1241  (brief decode artifact)
 200ms: RMS=0     (silence)
 400ms: RMS=5     (silence)
 600ms: RMS=3     (silence)
 800ms: RMS=367   (transition)
1000ms: RMS=2897  (speech starts here — but API already rejected at 595ms)
```
 
**Workaround:** Trimming leading silence with a custom energy-based VAD before piping to ML Kit resolves the issue — Video B transcribes successfully after trimming.
 
**Expected behavior**
 
The API should either:
- Wait for speech to appear in the audio stream before making an accept/reject decision, or
- Document the leading silence tolerance and audio content requirements for `fromPfd()`
Additionally, `AudioSource.fromPfd()` documentation should specify supported audio formats (raw PCM vs WAV, expected sample rate, channels, encoding).
 
**SDK Info:**
- SDK Name & Version: `com.google.mlkit:genai-speech-recognition:1.0.0-alpha1`
- Mode: `SpeechRecognizerOptions.Mode.MODE_ADVANCED`
**Smartphone:**
- Device: Pixel 10
- Device OS: Android 15
- AICore: Up to date, bootloader locked
**Development Environment:**
- Android Studio Ladybug
- Kotlin 2.1.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug report] GenAI Speech Recognition: ERROR_TYPE_INVALID_REQUEST when audio begins with silence (fromPfd) #1027

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug report] GenAI Speech Recognition: ERROR_TYPE_INVALID_REQUEST when audio begins with silence (fromPfd) #1027

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions