Alok Bishoyi

Whisper's Hidden Optimization

TL;DR: Running audio at 1.5x speed before sending it to Whisper cuts costs by 33% while only adding 2.4% error rate.
This shouldn't work as well as it does.

Most people think of OpenAI's pricing as fixed. $0.006 per minute for Whisper transcription. You send audio, you get charged for the duration, end of story.

But there's a lever hidden in plain sight. Whisper charges by duration, not by the amount of speech. Speed up the audio before sending it, and you pay less.

George Mandis noticed this first. I wanted to know exactly how far you could push it.

Result: 1.5x speed reduces costs by 33% with only 2.4% additional word errors. 1.2x gives you 17% savings with just 1.6% errors.

The Test

I took 19 YouTube videos - mostly educational content like 3Blue1Brown and Kurzgesagt - and ran each through Whisper at different speeds. From 1.0x to 2.0x, in 0.1x increments. 209 experiments total.

To measure quality, I compared each speed-adjusted transcription against the normal 1.0x version. This gives a clean word error rate without the noise of YouTube's auto-captions.

The results were cleaner than expected:

1.1x: 9% savings, 2.2% additional errors
1.2x: 17% savings, 2.5% additional errors
1.3x: 23% savings, 2.3% additional errors
1.4x: 29% savings, 2.8% additional errors
1.5x: 33% savings, 2.4% additional errors

Then it gets worse fast. 2.0x gives you 50% savings but 8.1% errors on average, with some videos hitting 20%+ error rates.

Quality Impact Across All Speeds

Why It Works

Whisper was trained on audio at different speeds. Not intentionally, but because the internet is full of sped-up audio. Podcasts at 1.25x, YouTube videos at 1.5x, compressed conference calls.

So when you send it 1.5x audio, it's not really outside the training distribution. The model has seen this before.

The cliff at 2.0x is where you leave familiar territory. Beyond that, you're asking the model to handle speeds it rarely encountered during training.

Cost vs Quality Tradeoff

How to Use This

Before sending audio to Whisper, speed it up with FFmpeg:

ffmpeg -i input.wav -af "atempo=1.5" output.wav

That's it. The atempo filter preserves pitch while changing speed, so voices don't sound like chipmunks.

The data is there. Pick whatever tradeoff makes sense for your use case.

What This Means

Most optimizations involve complex tradeoffs. Better caching, smarter algorithms, more efficient data structures. This one is just a single parameter change.

It works because pricing models sometimes have edges you can exploit. Whisper charges by time, not by the amount of work. So you can do the same work faster and pay less.

The surprising part is how robust the model is to this manipulation. 1.5x speed should break things more than it does. But machine learning models trained on internet data are apparently quite tolerant of internet-speed audio.

Dataset: 209 experiments across 19 educational videos. Complete analysis and source code available on GitHub.