OpenAI's transcription API can be used cheaply by speeding up audio data by 2x or 3x

OpenAI offers a variety of AI functions through APIs, including an API that transcribes voice data and outputs it as text data. Regarding this transcription API, software engineer
OpenAI Charges by the Minute, So Make the Minutes Shorter • George Mandis
https://george.mand.is/2025/06/openai-charges-by-the-minute-so-make-the-minutes-shorter/
The usage fees for OpenAI's transcription API are as follows. The fee for every 1 million input tokens is $6 (about 864 yen) for the high-performance 'gpt-4o-transcribe' and $3 (about 432 yen) for the modest performance 'gpt-4o-mini-transcribe'. In addition, the estimated cost per minute of voice data is $0.006 (about 0.86 yen) for 'gpt-4o-transcribe' and $0.003 (about 0.43 yen) for 'gpt-4o-mini-transcribe'. In other words, the OpenAI transcription API can be used more cheaply by shortening the playback time of the voice data and reducing the number of tokens.

With Whisper, the price is set by the hour, not by the token, and you can use it at $0.006 (about 0.86 yen) per minute of voice data. The shorter the playback time, the cheaper it becomes.

One way to shorten the playback time while preserving the content contained in the audio data is to 'trim unnecessary parts such as gaps between each speech,' but Mandis has succeeded in reducing costs without compromising transcription quality by 'processing the speech at twice or three times the normal speed without trimming.'
Mr. Mandis originally intended to transcribe 'audio data of about 40 minutes of lectures,' but he said that he was unable to transcribe it at 1x speed because there were too many tokens. So he used ffmpeg to process the audio data at 2x speed and transcribe it, and realized that it was possible to transcribe it cheaply and with high quality.
Below is a table summarizing the number of tokens and fees when processing the 'audio data of a lecture of about 40 minutes' at twice or three times the speed.
Double speed | Playback time | Number of tokens | Input Cost | Output Cost |
---|---|---|---|---|
1x speed | 2372 seconds | Input not allowed | Input not allowed | Input not allowed |
2x Speed | 1186 seconds | 11,856 | $0.07 (approx. 10.09 yen) | 0.02 dollars (approximately 2.88 yen) |
3x Speed | 791 seconds | 7904 | 0.04 dollars (approximately 5.76 yen) | 0.02 dollars (approximately 2.88 yen) |
At 2x and 3x speed, it was possible to transcribe without any loss of quality, but at 4x speed, the quality deteriorated rapidly and the same phrases were repeated over and over again.

Based on the results of the above tests, Mandis concluded that 'when using OpenAI's transcription API, processing the audio data at two or three times the speed can reduce costs.'
Related Posts:
in Software, Posted by log1o_hf