Most people think of OpenAI's pricing as fixed. $0.006 per minute for Whisper transcription. You send audio, you get charged for the duration, end of story.
But there's a lever hidden in plain sight. Whisper charges by duration, not by the amount of speech. Speed up the audio before sending it, and you pay less.
George Mandis noticed this first. I wanted to know exactly how far you could push it.
I took 19 YouTube videos - mostly educational content like 3Blue1Brown and Kurzgesagt - and ran each through Whisper at different speeds. From 1.0x to 2.0x, in 0.1x increments. 209 experiments total.
To measure quality, I compared each speed-adjusted transcription against the normal 1.0x version. This gives a clean word error rate without the noise of YouTube's auto-captions.
The results were cleaner than expected:
Then it gets worse fast. 2.0x gives you 50% savings but 8.1% errors on average, with some videos hitting 20%+ error rates.
Whisper was trained on audio at different speeds. Not intentionally, but because the internet is full of sped-up audio. Podcasts at 1.25x, YouTube videos at 1.5x, compressed conference calls.
So when you send it 1.5x audio, it's not really outside the training distribution. The model has seen this before.
The cliff at 2.0x is where you leave familiar territory. Beyond that, you're asking the model to handle speeds it rarely encountered during training.
Before sending audio to Whisper, speed it up with FFmpeg:
ffmpeg -i input.wav -af "atempo=1.5" output.wav
That's it. The atempo
filter preserves pitch while changing speed, so voices don't sound like chipmunks.
The data is there. Pick whatever tradeoff makes sense for your use case.
Most optimizations involve complex tradeoffs. Better caching, smarter algorithms, more efficient data structures. This one is just a single parameter change.
It works because pricing models sometimes have edges you can exploit. Whisper charges by time, not by the amount of work. So you can do the same work faster and pay less.
The surprising part is how robust the model is to this manipulation. 1.5x speed should break things more than it does. But machine learning models trained on internet data are apparently quite tolerant of internet-speed audio.