β—ˆ Automatic Speech Recognition Β· Benchmark Report

Model Performance Analysis

Loading…
Model Limitations for Long Audio
General Purpose Models
! For Gemma-4: Transcribing long audio files is not feasible. The Gemma-4 e4b model used for benchmarking has a context window of 128k tokens and consumes 25 tokens per second of audio. Therefore, a one-hour audio file requires 60 * 60 * 25 = 90,000 tokens. When factoring in the additional tokens required for the model's thinking process, the available context is insufficient to transcribe the entire audio file at once. While chunking the audio and transcribing it sequentially is a potential workaround, it is significantly slower and was therefore not utilized.

The same limitation applies to other general-purpose models, such as Gemini 3.1 Pro, Gemini 3 Flash, etc.
Model Comparison
aggregate metrics
Loading…
Sample-Level Results
click any row to inspect
Loading…