Instant Voice-to-Text

Tanu Varshney
Sep 8, 2025
2 min read

Let’s Click a Button, Speak, and Instantly See Your Words Appear as Text

In this article, you’ll build a small but powerful web app in Python that records your voice in the browser and sends it to OpenAI’s Whisper model for transcription, all in under 5 minutes.

Imagine clicking a button, speaking, and instantly seeing your words appear as text!

What We’ll Build

A simple browser page with:

Record button – Gradio microphone input
Submit button – Sends audio to Whisper
Text output – Displays the transcription

Prerequisites

Before we start, make sure you have:

Python 3.8+
An OpenAI API key (get it here)
Basic familiarity with a terminal

Step 1: Project Setup

1. Create a Project Folder

mkdir voice-to-text
cd voice-to-text
python -m venv venv

2. Activate Virtual Environment (Windows PowerShell)

.\venv\Scripts\activate

⚠️ If PowerShell blocks scripts, run this once:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Then reactivate:

.\venv\Scripts\activate

3. Create requirements.txt

Inside your project folder, create a file called requirements.txt and add:

openai
gradio
python-dotenv
soundfile

4. Install Dependencies

With your virtual environment activated, run:

pip install -r requirements.txt

This installs all your project dependencies at once, making setup quick and reproducible.

5. Create a .env File

Create a file called .env in your project root:

OPENAI_API_KEY=sk-YourKeyHere

This keeps your API key secure and separate from your code.

Step 2: The Code

Create a file app.py and add:

import os
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr
import tempfile, soundfile as sf

load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def transcribe(audio):
    if audio is None:
        return "No audio recorded."

    if isinstance(audio, (tuple, list)):
        data, sr = audio
        tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
        sf.write(tmp.name, data, sr)
        filepath = tmp.name
    else:
        filepath = audio

    with open(filepath, "rb") as f:
        resp = client.audio.transcriptions.create(model="whisper-1", file=f)
    return resp.text

with gr.Blocks() as demo:
    gr.Markdown("## 🎤 Voice-to-Text Mini App")
    audio = gr.Audio(source="microphone", type="filepath", label="Speak here")
    btn = gr.Button("Submit")
    output = gr.Textbox(label="Transcribed Text")
    btn.click(fn=transcribe, inputs=audio, outputs=output)

if __name__ == "__main__":
    demo.launch()

Step 3: Run the App

python app.py

Open the local URL Gradio shows (usually http://127.0.0.1:7860). Click Record, speak, then click Submit. Your words appear instantly in the Textbox.

Common Pitfalls

API Key Error: Make sure .env exists and load_dotenv() is called.
TypeError proxies: Uninstall old OpenAI package, then reinstall:

pip install --upgrade openai

Gradio Errors: Check the terminal for the real traceback.

Next Steps

Save transcripts to a file or database.
Add language detection or translation.
Deploy on Hugging Face Spaces or Render (set your OPENAI_API_KEY in their Secrets panel).