How can I simulate real-time streaming transcription using OpenAI API? #2307

Santoshchodipilli · 2025-04-15T04:24:10Z

Santoshchodipilli
Apr 15, 2025

I'm working on a project where I want to convert speech to text in real-time using OpenAI's Whisper model. I see that Whisper's hosted API (whisper-1) currently only supports batch mode — sending a full audio file and receiving the full transcript.

I'm trying to achieve a streaming-like transcription experience, where I can start receiving partial transcriptions as audio is still being recorded or uploaded.

Is there a way to simulate streaming transcription using Whisper?

I'm using Python.

I considered chunking the audio into small parts and sending them sequentially.

Is that the best approach, or is there a better method?

Also, is there any public roadmap or timeline for when the official OpenAI Whisper API might support real-time streaming transcription?

Thanks in advance!

pravakarp98 · 2025-04-25T09:56:35Z

pravakarp98
Apr 25, 2025

You're on the right path of emulating streaming transcription with Whisper — that's the best workaround available at the moment, TheOpenAI's whisper-1 API is only capable of batch processing and not streaming.

let me describe it for you -

Emulating Streaming with Whisper (Python)
first, breaking down the audio into small segments (e.g., every 2-5 seconds) and processing them in chunked form and passing them into Whisper is the most prevalent and efficient method for simulating real-time transcription. Here's what you can do:

Method: Chunked Streaming Simulation
Record audio in chunks via a microphone stream.
Store each chunk into a temporary buffer or file.
Pass the chunk to Whisper (whisper-1) API or locally execute if you have the open-source model.
Show the transcription incrementally.

Python Utilities You May Use:
pyaudio or sounddevice — for recording microphone sound in chunks.
queue.Queue — for handling chunks of audio asynchronously.
openai.Audio.transcribe("whisper-1",.) — for transcribing a chunk if you are using the OpenAI API.
Or use OpenAI whisper for local inference with faster turnaround.

If you're interested, I can help you set up a full real-time transcription.

2 replies

Santoshchodipilli Apr 25, 2025
Author

It's not about to microphone stream, I am expecting streaming response for recorded audio file which is static.
So, as you said if i chunk the audio with respect to 2-5 seconds. if any word falls at the chunking endpoint, then the chunks goes to be wrong.

pravakarp98 Apr 25, 2025

You can use an offset or overlap to map the last word of chunk A with the first word of chunk B.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I simulate real-time streaming transcription using OpenAI API? #2307

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How can I simulate real-time streaming transcription using OpenAI API? #2307

Santoshchodipilli Apr 15, 2025

Replies: 1 comment · 2 replies

pravakarp98 Apr 25, 2025

Santoshchodipilli Apr 25, 2025 Author

pravakarp98 Apr 25, 2025

Santoshchodipilli
Apr 15, 2025

Replies: 1 comment 2 replies

pravakarp98
Apr 25, 2025

Santoshchodipilli Apr 25, 2025
Author