Jun 27, 2025 10:33:00

OpenAI's transcription API can be used cheaply by speeding up audio data by 2x or 3x

OpenAI offers a variety of AI functions through APIs, including an API that transcribes voice data and outputs it as text data. Regarding this transcription API, software engineer

George Mandis reports that 'speeding up voice data by two or three times can reduce costs without compromising quality.'

OpenAI Charges by the Minute, So Make the Minutes Shorter • George Mandis
https://george.mand.is/2025/06/openai-charges-by-the-minute-so-make-the-minutes-shorter/

The usage fees for OpenAI's transcription API are as follows. The fee for every 1 million input tokens is $6 (about 864 yen) for the high-performance 'gpt-4o-transcribe' and $3 (about 432 yen) for the modest performance 'gpt-4o-mini-transcribe'. In addition, the estimated cost per minute of voice data is $0.006 (about 0.86 yen) for 'gpt-4o-transcribe' and $0.003 (about 0.43 yen) for 'gpt-4o-mini-transcribe'. In other words, the OpenAI transcription API can be used more cheaply by shortening the playback time of the voice data and reducing the number of tokens.

With Whisper, the price is set by the hour, not by the token, and you can use it at $0.006 (about 0.86 yen) per minute of voice data. The shorter the playback time, the cheaper it becomes.

One way to shorten the playback time while preserving the content contained in the audio data is to 'trim unnecessary parts such as gaps between each speech,' but Mandis has succeeded in reducing costs without compromising transcription quality by 'processing the speech at twice or three times the normal speed without trimming.'

Mr. Mandis originally intended to transcribe 'audio data of about 40 minutes of lectures,' but he said that he was unable to transcribe it at 1x speed because there were too many tokens. So he used ffmpeg to process the audio data at 2x speed and transcribe it, and realized that it was possible to transcribe it cheaply and with high quality.

Below is a table summarizing the number of tokens and fees when processing the 'audio data of a lecture of about 40 minutes' at twice or three times the speed.

Double speed	Playback time	Number of tokens	Input Cost	Output Cost
1x speed	2372 seconds	Input not allowed	Input not allowed	Input not allowed
2x Speed	1186 seconds	11,856	$0.07 (approx. 10.09 yen)	0.02 dollars (approximately 2.88 yen)
3x Speed	791 seconds	7904	0.04 dollars (approximately 5.76 yen)	0.02 dollars (approximately 2.88 yen)

At 2x and 3x speed, it was possible to transcribe without any loss of quality, but at 4x speed, the quality deteriorated rapidly and the same phrases were repeated over and over again.

Based on the results of the above tests, Mandis concluded that 'when using OpenAI's transcription API, processing the audio data at two or three times the speed can reduce costs.'

Related Posts:

Jun 27, 2025 10:33:00 in Software, Posted by log1o_hf