GPU-Accelerated Audio Transcription with Whisper

👨‍💻

Rens Vermeulen

IT Solutions Architect at Bit Byte IT

Transcribing audio files can be time-consuming, especially when processing hours of content. That's why I created this PowerShell script that leverages OpenAI's Whisper model with GPU acceleration to achieve 4-5x faster transcription speeds compared to CPU-only processing.

🚀 What This Project Does

This project provides a PowerShell script that transcribes audio files (MP3) to text using OpenAI's Whisper model with GPU acceleration via faster-whisper. Whether you're transcribing podcasts, lectures, meetings, or sermons, this tool makes the process fast and efficient.

✨ Key Features

🎯 GPU Acceleration: Utilizes NVIDIA GPU for 4-5x faster transcription compared to CPU
🔄 Automatic Fallback: Falls back to CPU if GPU is unavailable
📦 Batch Processing: Processes multiple audio files in a single run
🎓 High Accuracy: Uses the medium Whisper model with Dutch language support
⚡ Easy to Use: Simple PowerShell script with minimal configuration

💻 Requirements

Hardware

NVIDIA GPU: GeForce RTX series (tested on RTX 5070 Ti with 12GB VRAM)
VRAM: Minimum 4GB for medium model, 12GB recommended for large models

Software

Windows 10/11
Python 3.10 or higher
NVIDIA GPU Drivers: Latest drivers from NVIDIA
PowerShell: Built-in on Windows

🛠️ Installation Guide

Step 1: Verify GPU Drivers

First, verify that your NVIDIA drivers are installed correctly by running:

nvidia-smi

This should display your GPU information. If not, download and install the latest drivers from NVIDIA's website.

Step 2: Create Virtual Environment

Navigate to the project directory and create a Python virtual environment:

cd "C:\Path\To\Your\Project"
python -m venv venv_gpu

Step 3: Activate Virtual Environment

.\venv_gpu\Scripts\Activate.ps1

Your prompt should now show (venv_gpu) indicating the virtual environment is active.

Step 4: Install PyTorch with CUDA 12.8

Important: For RTX 5070 Ti (Blackwell architecture) and newer GPUs, you need CUDA 12.8 support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

For older GPUs (RTX 3000/4000 series), you can use CUDA 12.1:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 5: Install faster-whisper

pip install faster-whisper

Step 6: Verify GPU Support

Verify that PyTorch can detect your GPU:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"

Expected output:

CUDA available: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 5070 Ti

⚙️ Configuration

Audio Files Location

By default, the script looks for MP3 files in:

C:\Path\To\Your\Audio\Files\

To change this, edit the $folder variable in transcribe.ps1:

$folder = "C:\Your\Custom\Path\"

Whisper Model Selection

The script uses the medium model by default. Available models include:

tiny - Fastest, least accurate (~1GB VRAM)
base - Fast, basic accuracy (~1GB VRAM)
small - Good balance (~2GB VRAM)
medium - High accuracy (~5GB VRAM) [Default]
large-v2 - Best accuracy (~10GB VRAM)
large-v3 - Latest, best accuracy (~10GB VRAM)

Language Configuration

The script is configured for Dutch (nl). To change the language, modify the transcribe line:

segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

Supported languages include: en (English), nl (Dutch), de (German), fr (French), es (Spanish), and many more.

🎯 Usage

Basic Usage

Place your MP3 files in the configured folder
Activate the virtual environment (if not already active):
.\venv_gpu\Scripts\Activate.ps1
Run the script:
.\transcribe.ps1

Output

The script will:

Process each MP3 file in the folder
Create a .txt file with the same name as the audio file
Save the transcription in the same folder as the audio file
Display progress in the console with colored output

First Run

The first time you run the script, it will download the Whisper model (~1.5GB for medium model). This only happens once, as the model is cached locally.

📝 Complete PowerShell Script

Here's the complete transcribe.ps1 script that powers the GPU-accelerated transcription:

# GPU-accelerated transcription using faster-whisper
# Folder containing audio files
$folder = "C:\Path\To\Your\Audio\Files\"

# Get all supported audio files
$files = Get-ChildItem -Path $folder -Filter *.mp3

# Python script to run faster-whisper
$pythonScript = @'
import sys
from faster_whisper import WhisperModel

# Load model with GPU support
# Options: tiny, base, small, medium, large-v2, large-v3
# device options: "cuda" for GPU, "cpu" for CPU
try:
    model = WhisperModel("medium", device="cuda", compute_type="float16")
    print("Using GPU acceleration")
except Exception as e:
    print(f"GPU not available ({e}), falling back to CPU")
    model = WhisperModel("medium", device="cpu", compute_type="int8")

audio_file = sys.argv[1]
output_file = sys.argv[2]

print(f"Transcribing: {audio_file}")

# Transcribe
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

print(f"Detected language '{info.language}' with probability {info.language_probability}")

# Write to file
with open(output_file, 'w', encoding='utf-8') as f:
    for segment in segments:
        f.write(segment.text + "\n")

print(f"Transcription saved to: {output_file}")
'@

# Save Python script temporarily
$tempPythonScript = "$env:TEMP\transcribe_gpu.py"
$pythonScript | Out-File -FilePath $tempPythonScript -Encoding UTF8

foreach ($file in $files) {
    Write-Host "Transcribing: $($file.Name)" -ForegroundColor Green
    
    # Output filename
    $outputFile = [System.IO.Path]::ChangeExtension($file.FullName, ".txt")
    
    # Run Python script
    python $tempPythonScript $file.FullName $outputFile
    
    Write-Host "Completed: $($file.Name)" -ForegroundColor Cyan
    Write-Host ""
}

# Cleanup
Remove-Item $tempPythonScript -ErrorAction SilentlyContinue

Write-Host "All files processed." -ForegroundColor Green

This script combines PowerShell for file handling with Python's faster-whisper library for the actual transcription, making it easy to batch process multiple audio files with GPU acceleration.

⚡ Performance

Speed Comparison

For a 60-minute audio file on RTX 5070 Ti:

Method	Time	Speed
whisper-cli.exe (CPU)	~45-60 min	1x
faster-whisper (GPU)	~8-12 min	4-5x

Memory Usage

Model	VRAM Usage	Accuracy
tiny	~1GB	Basic
small	~2GB	Good
medium	~5GB	High
large-v3	~10GB	Best

🔧 Troubleshooting

GPU Not Detected

Error: Library cublas64_12.dll is not found or cannot be loaded

Solution: Make sure you installed PyTorch with the correct CUDA version:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Script Falls Back to CPU

If you see "GPU not available, falling back to CPU", check:

GPU drivers are installed: nvidia-smi
PyTorch can see GPU: python -c "import torch; print(torch.cuda.is_available())"
Virtual environment is activated
Correct CUDA version installed for your GPU

ModuleNotFoundError

Error: ModuleNotFoundError: No module named 'faster_whisper'

Solution: Make sure the virtual environment is activated and faster-whisper is installed:

.\venv_gpu\Scripts\Activate.ps1
pip install faster-whisper

Out of Memory Error

If you get CUDA out of memory errors:

Use a smaller model (small instead of medium)
Use compute_type="int8" instead of "float16"
Close other GPU-intensive applications

🎓 Advanced Configuration

Compute Type Options

For better performance or lower memory usage, you can adjust the compute type:

# Best accuracy, highest VRAM usage
model = WhisperModel("medium", device="cuda", compute_type="float32")

# Good balance (default)
model = WhisperModel("medium", device="cuda", compute_type="float16")

# Lower VRAM usage, slightly lower accuracy
model = WhisperModel("medium", device="cuda", compute_type="int8")

Beam Size

Adjust the beam size for transcription quality vs speed tradeoff:

# Faster, less accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=1)

# Balanced (default)
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

# Slower, more accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=10)

💡 Why faster-whisper?

faster-whisper is preferred over the standard OpenAI Whisper implementation because:

4-5x faster transcription speed
Lower memory usage (both VRAM and RAM)
Same accuracy - uses identical models
Better GPU utilization through CTranslate2 optimization
Production-ready - widely used in real-world applications

🎯 Real-World Use Cases

Podcast Transcription: Quickly transcribe podcast episodes for show notes and accessibility
Meeting Minutes: Convert recorded meetings to searchable text documents
Lecture Notes: Transcribe educational content for students
Content Creation: Generate subtitles for video content
Research: Transcribe interviews and focus groups for qualitative analysis

📦 Dependencies

The project uses the following Python packages:

torch (with CUDA 12.8): PyTorch deep learning framework
faster-whisper: Optimized Whisper implementation
ctranslate2: Fast inference engine
av: Audio/video processing
onnxruntime: Runtime for ONNX models

💬 Final Thoughts

GPU-accelerated audio transcription has completely transformed my workflow. What used to take an hour now takes just 10-12 minutes, allowing me to process multiple audio files in the time it would have taken to transcribe a single file using CPU-only methods.

The combination of OpenAI's Whisper model and faster-whisper's optimization makes this an incredibly powerful tool for anyone who regularly needs to transcribe audio content. Whether you're a content creator, researcher, or just need to transcribe the occasional meeting, this solution offers both speed and accuracy.

Give it a try and experience the power of GPU-accelerated transcription yourself!

AI Machine Learning Whisper GPU Python Audio Processing Transcription NVIDIA

👨‍💻

About Rens Vermeulen

Rens is an IT Solutions Architect at Bit Byte IT with extensive experience in automation and AI integration. He's passionate about leveraging cutting-edge technology to solve real-world problems and loves sharing his projects with the community.

GPU-Accelerated Audio Transcription with Whisper

Rens Vermeulen

Rens Vermeulen

🚀 What This Project Does

✨ Key Features

💻 Requirements

Hardware

Software

🛠️ Installation Guide

Step 1: Verify GPU Drivers

Step 2: Create Virtual Environment

Step 3: Activate Virtual Environment

Step 4: Install PyTorch with CUDA 12.8

Step 5: Install faster-whisper

Step 6: Verify GPU Support

⚙️ Configuration

Audio Files Location

Whisper Model Selection

Language Configuration

🎯 Usage

Basic Usage

Output

First Run

📝 Complete PowerShell Script

⚡ Performance

Speed Comparison

Memory Usage

🔧 Troubleshooting

GPU Not Detected

Script Falls Back to CPU

ModuleNotFoundError

Out of Memory Error

🎓 Advanced Configuration

Compute Type Options

Beam Size

💡 Why faster-whisper?

🎯 Real-World Use Cases

📦 Dependencies

💬 Final Thoughts

Share this article

About Rens Vermeulen