ASR Benchmark Report

Benchmark Insights

—

Model Limitations for Long Audio

General Purpose Models

! For Gemma-4: Transcribing long audio files is not feasible. The Gemma-4 e4b model used for benchmarking has a context window of 128k tokens and consumes 25 tokens per second of audio. Therefore, a one-hour audio file requires 60 * 60 * 25 = 90,000 tokens. When factoring in the additional tokens required for the model's thinking process, the available context is insufficient to transcribe the entire audio file at once. While chunking the audio and transcribing it sequentially is a potential workaround, it is significantly slower and was therefore not utilized.

The same limitation applies to other general-purpose models, such as Gemini 3.1 Pro, Gemini 3 Flash, etc.

Model Comparison

aggregate metrics

Loading…

Sample-Level Results

click any row to inspect

Loading…

Dataset Difficulty

easiest → hardest

Recommendations

—