top of page
Search

Instant Voice-to-Text

Let’s Click a Button, Speak, and Instantly See Your Words Appear as Text

In this article, you’ll build a small but powerful web app in Python that records your voice in the browser and sends it to OpenAI’s Whisper model for transcription, all in under 5 minutes.

Imagine clicking a button, speaking, and instantly seeing your words appear as text!



What We’ll Build

A simple browser page with:

  • Record button – Gradio microphone input

  • Submit button – Sends audio to Whisper

  • Text output – Displays the transcription


ree

Prerequisites

Before we start, make sure you have:

  • Python 3.8+

  • An OpenAI API key (get it here)

  • Basic familiarity with a terminal


ree

Step 1: Project Setup


1. Create a Project Folder

mkdir voice-to-text
cd voice-to-text
python -m venv venv

2. Activate Virtual Environment (Windows PowerShell)

.\venv\Scripts\activate


⚠️ If PowerShell blocks scripts, run this once:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Then reactivate:

.\venv\Scripts\activate

3. Create requirements.txt

Inside your project folder, create a file called requirements.txt and add:

openai
gradio
python-dotenv
soundfile

4. Install Dependencies

With your virtual environment activated, run:

pip install -r requirements.txt
This installs all your project dependencies at once, making setup quick and reproducible.

5. Create a .env File

Create a file called .env in your project root:

OPENAI_API_KEY=sk-YourKeyHere
This keeps your API key secure and separate from your code.

Step 2: The Code

Create a file app.py and add:

import os
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr
import tempfile, soundfile as sf

load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def transcribe(audio):
    if audio is None:
        return "No audio recorded."

    if isinstance(audio, (tuple, list)):
        data, sr = audio
        tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
        sf.write(tmp.name, data, sr)
        filepath = tmp.name
    else:
        filepath = audio

    with open(filepath, "rb") as f:
        resp = client.audio.transcriptions.create(model="whisper-1", file=f)
    return resp.text

with gr.Blocks() as demo:
    gr.Markdown("## 🎤 Voice-to-Text Mini App")
    audio = gr.Audio(source="microphone", type="filepath", label="Speak here")
    btn = gr.Button("Submit")
    output = gr.Textbox(label="Transcribed Text")
    btn.click(fn=transcribe, inputs=audio, outputs=output)

if __name__ == "__main__":
    demo.launch()

Step 3: Run the App

python app.py

Open the local URL Gradio shows (usually http://127.0.0.1:7860). Click Record, speak, then click Submit. Your words appear instantly in the Textbox.


Common Pitfalls

  • API Key Error: Make sure .env exists and load_dotenv() is called.

  • TypeError proxies: Uninstall old OpenAI package, then reinstall:

pip install --upgrade openai
  • Gradio Errors: Check the terminal for the real traceback.


Next Steps

  • Save transcripts to a file or database.

  • Add language detection or translation.

  • Deploy on Hugging Face Spaces or Render (set your OPENAI_API_KEY in their Secrets panel).


 
 
 

Comments


bottom of page