Audio Samples
β
total evaluated
Models Tested
β
transcription systems
Variations
β
test conditions
!
For Gemma-4: Transcribing long audio files is not feasible. The Gemma-4 e4b model used for benchmarking has a context window of 128k tokens and consumes 25 tokens per second of audio. Therefore, a one-hour audio file requires 60 * 60 * 25 = 90,000 tokens. When factoring in the additional tokens required for the model's thinking process, the available context is insufficient to transcribe the entire audio file at once. While chunking the audio and transcribing it sequentially is a potential workaround, it is significantly slower and was therefore not utilized.
The same limitation applies to other general-purpose models, such as Gemini 3.1 Pro, Gemini 3 Flash, etc.