Skip to content

[Bug report] GenAI Speech Recognition: ERROR_TYPE_INVALID_REQUEST when audio begins with silence (fromPfd) #1027

@aB9s

Description

@aB9s

Describe the bug
SpeechRecognizer with MODE_ADVANCED rejects valid PCM audio streamed via AudioSource.fromPfd() with ERROR_TYPE_INVALID_REQUEST when the audio begins with ~800ms of silence. The same code path succeeds for audio where speech starts immediately. The engine appears to apply an early silence-based rejection — it errors out in ~595ms, before the speech content at ~1000ms is even reached.

There is no documentation on audio content requirements (e.g., leading silence tolerance) or supported formats for AudioSource.fromPfd(). The official sample app only demonstrates live microphone input, not pre-recorded audio file input.

To Reproduce

  1. Decode two MP4 videos to raw PCM (16kHz, mono, 16-bit) using MediaExtractor + MediaCodec
  2. Stream the raw PCM bytes (no WAV header) through ParcelFileDescriptor.createPipe() — matching the official sample's pattern
  3. Pass AudioSource.fromPfd(pipe[0]) to SpeechRecognizer.startRecognition()
  4. Video A (speech from 0ms, source 48kHz): transcribes successfully
  5. Video B (speech from ~1000ms with ~800ms leading silence, source 44.1kHz): fails with ERROR_TYPE_INVALID_REQUEST in 595ms
val pipe = ParcelFileDescriptor.createPipe()
 
// Background thread: write raw PCM to pipe (no WAV header)
thread {
    FileOutputStream(pipe[1].fileDescriptor).use { fos ->
        fos.write(rawPcmBytes) // 16kHz, mono, 16-bit PCM
    }
    pipe[1].close()
}
 
// Collect recognition results
recognizer.startRecognition(
    speechRecognizerRequest { audioSource = AudioSource.fromPfd(pipe[0]) }
).collect { response ->
    when (response) {
        is SpeechRecognizerResponse.ErrorResponse -> {
            // Video B hits this immediately:
            // "Speech recognition engine is closed due to internal error: ERROR_TYPE_INVALID_REQUEST"
        }
        // ...
    }
}

Audio energy analysis (RMS per 200ms window) for the failing video:

   0ms: RMS=1241  (brief decode artifact)
 200ms: RMS=0     (silence)
 400ms: RMS=5     (silence)
 600ms: RMS=3     (silence)
 800ms: RMS=367   (transition)
1000ms: RMS=2897  (speech starts here — but API already rejected at 595ms)

Workaround: Trimming leading silence with a custom energy-based VAD before piping to ML Kit resolves the issue — Video B transcribes successfully after trimming.

Expected behavior

The API should either:

  • Wait for speech to appear in the audio stream before making an accept/reject decision, or
  • Document the leading silence tolerance and audio content requirements for fromPfd()
    Additionally, AudioSource.fromPfd() documentation should specify supported audio formats (raw PCM vs WAV, expected sample rate, channels, encoding).

SDK Info:

  • SDK Name & Version: com.google.mlkit:genai-speech-recognition:1.0.0-alpha1
  • Mode: SpeechRecognizerOptions.Mode.MODE_ADVANCED
    Smartphone:
  • Device: Pixel 10
  • Device OS: Android 15
  • AICore: Up to date, bootloader locked
    Development Environment:
  • Android Studio Ladybug
  • Kotlin 2.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions