Skip to content

Vertex AI: input_audio_transcription silently ignored - zero input_transcription events emitted #2348

@david-labs-ca

Description

@david-labs-ca

Environment

  • Model: gemini-live-2.5-flash-native-audio
  • Endpoint: Vertex AI (vertexai=True, project-based auth via ADC)
  • SDK version: google-genai v1.70.0
  • Platform: Cloud Run (Python 3.11)

Configuration

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    input_audio_transcription=types.AudioTranscriptionConfig(),
    output_audio_transcription=types.AudioTranscriptionConfig(),
    # ... speech_config, tools, etc.
)

async with client.aio.live.connect(model="gemini-live-2.5-flash-native-audio", config=config) as session:
    async for response in session.receive():
        sc = response.server_content
        # sc.output_transcription - works, text arrives reliably
        # sc.input_transcription - NEVER populated, always None

Problem

input_audio_transcription=AudioTranscriptionConfig() is accepted in LiveConnectConfig without any error or warning, but the Vertex AI backend never emits input_transcription events in LiveServerContent.

  • output_transcription (model speech ? text): works correctly
  • input_transcription (user speech ? text): never arrives - zero events across hundreds of sessions over multiple days
  • No error, no warning, no rejection of the config - it is silently swallowed

The SDK has the field (LiveServerContent.input_transcription: Optional[Transcription]), the config type exists (AudioTranscriptionConfig), the documentation describes it - but on Vertex AI the feature simply does not function.

Impact

We run a production full-duplex voice AI assistant on Vertex AI. We architected our turn management system around input_transcription events:

  1. User transcription display - users could not see their own speech in the UI
  2. Turn semaphore - our dead air detection relies on input_transcription to call record_user_speech(). Since it never fires, the system cannot distinguish "user is silent" from "user is speaking"
  3. Cascade failure - broken turn detection triggered aggressive reconnection loops, wasting compute and degrading UX

We spent 8+ hours debugging what we initially thought was Gemini session instability, deploying four production hotfixes, before tracing it to this single missing feature. The fix was a one-line fallback to browser-side SpeechRecognition.

Expected behavior

If input_audio_transcription is set in config:

  • server_content.input_transcription should contain user speech text (same as output_transcription does for model speech)

If the feature is not supported on Vertex AI:

  • The config should be rejected with a clear error, not silently accepted

Related issues

This has been reported multiple times with no clear resolution on Vertex AI:

Ask

  1. Is input_audio_transcription actually supported on Vertex AI for gemini-live-2.5-flash-native-audio?
  2. If not, please reject the config or document the limitation clearly
  3. If it is supposed to work, what's the timeline for a fix?

Silent acceptance of config that gets ignored is the worst possible developer experience - especially for paying enterprise customers building production infrastructure on your platform.

Metadata

Metadata

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions