Transcribing audio files can be time-consuming, especially when processing hours of content. That's why I created this PowerShell script that leverages OpenAI's Whisper model with GPU acceleration to achieve 4-5x faster transcription speeds compared to CPU-only processing.

🚀 What This Project Does

This project provides a PowerShell script that transcribes audio files (MP3) to text using OpenAI's Whisper model with GPU acceleration via faster-whisper. Whether you're transcribing podcasts, lectures, meetings, or sermons, this tool makes the process fast and efficient.

✨ Key Features

  • 🎯 GPU Acceleration: Utilizes NVIDIA GPU for 4-5x faster transcription compared to CPU
  • 🔄 Automatic Fallback: Falls back to CPU if GPU is unavailable
  • 📦 Batch Processing: Processes multiple audio files in a single run
  • 🎓 High Accuracy: Uses the medium Whisper model with Dutch language support
  • Easy to Use: Simple PowerShell script with minimal configuration

💻 Requirements

Hardware

  • NVIDIA GPU: GeForce RTX series (tested on RTX 5070 Ti with 12GB VRAM)
  • VRAM: Minimum 4GB for medium model, 12GB recommended for large models

Software

  • Windows 10/11
  • Python 3.10 or higher
  • NVIDIA GPU Drivers: Latest drivers from NVIDIA
  • PowerShell: Built-in on Windows

🛠️ Installation Guide

Step 1: Verify GPU Drivers

First, verify that your NVIDIA drivers are installed correctly by running:

nvidia-smi

This should display your GPU information. If not, download and install the latest drivers from NVIDIA's website.

Step 2: Create Virtual Environment

Navigate to the project directory and create a Python virtual environment:

cd "C:\Path\To\Your\Project"
python -m venv venv_gpu

Step 3: Activate Virtual Environment

.\venv_gpu\Scripts\Activate.ps1

Your prompt should now show (venv_gpu) indicating the virtual environment is active.

Step 4: Install PyTorch with CUDA 12.8

Important: For RTX 5070 Ti (Blackwell architecture) and newer GPUs, you need CUDA 12.8 support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

For older GPUs (RTX 3000/4000 series), you can use CUDA 12.1:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 5: Install faster-whisper

pip install faster-whisper

Step 6: Verify GPU Support

Verify that PyTorch can detect your GPU:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"

Expected output:

CUDA available: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 5070 Ti

⚙️ Configuration

Audio Files Location

By default, the script looks for MP3 files in:

C:\Path\To\Your\Audio\Files\

To change this, edit the $folder variable in transcribe.ps1:

$folder = "C:\Your\Custom\Path\"

Whisper Model Selection

The script uses the medium model by default. Available models include:

  • tiny - Fastest, least accurate (~1GB VRAM)
  • base - Fast, basic accuracy (~1GB VRAM)
  • small - Good balance (~2GB VRAM)
  • medium - High accuracy (~5GB VRAM) [Default]
  • large-v2 - Best accuracy (~10GB VRAM)
  • large-v3 - Latest, best accuracy (~10GB VRAM)

Language Configuration

The script is configured for Dutch (nl). To change the language, modify the transcribe line:

segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

Supported languages include: en (English), nl (Dutch), de (German), fr (French), es (Spanish), and many more.

🎯 Usage

Basic Usage

  1. Place your MP3 files in the configured folder
  2. Activate the virtual environment (if not already active):
    .\venv_gpu\Scripts\Activate.ps1
  3. Run the script:
    .\transcribe.ps1

Output

The script will:

  • Process each MP3 file in the folder
  • Create a .txt file with the same name as the audio file
  • Save the transcription in the same folder as the audio file
  • Display progress in the console with colored output

First Run

The first time you run the script, it will download the Whisper model (~1.5GB for medium model). This only happens once, as the model is cached locally.

📝 Complete PowerShell Script

Here's the complete transcribe.ps1 script that powers the GPU-accelerated transcription:

# GPU-accelerated transcription using faster-whisper
# Folder containing audio files
$folder = "C:\Path\To\Your\Audio\Files\"

# Get all supported audio files
$files = Get-ChildItem -Path $folder -Filter *.mp3

# Python script to run faster-whisper
$pythonScript = @'
import sys
from faster_whisper import WhisperModel

# Load model with GPU support
# Options: tiny, base, small, medium, large-v2, large-v3
# device options: "cuda" for GPU, "cpu" for CPU
try:
    model = WhisperModel("medium", device="cuda", compute_type="float16")
    print("Using GPU acceleration")
except Exception as e:
    print(f"GPU not available ({e}), falling back to CPU")
    model = WhisperModel("medium", device="cpu", compute_type="int8")

audio_file = sys.argv[1]
output_file = sys.argv[2]

print(f"Transcribing: {audio_file}")

# Transcribe
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

print(f"Detected language '{info.language}' with probability {info.language_probability}")

# Write to file
with open(output_file, 'w', encoding='utf-8') as f:
    for segment in segments:
        f.write(segment.text + "\n")

print(f"Transcription saved to: {output_file}")
'@

# Save Python script temporarily
$tempPythonScript = "$env:TEMP\transcribe_gpu.py"
$pythonScript | Out-File -FilePath $tempPythonScript -Encoding UTF8

foreach ($file in $files) {
    Write-Host "Transcribing: $($file.Name)" -ForegroundColor Green
    
    # Output filename
    $outputFile = [System.IO.Path]::ChangeExtension($file.FullName, ".txt")
    
    # Run Python script
    python $tempPythonScript $file.FullName $outputFile
    
    Write-Host "Completed: $($file.Name)" -ForegroundColor Cyan
    Write-Host ""
}

# Cleanup
Remove-Item $tempPythonScript -ErrorAction SilentlyContinue

Write-Host "All files processed." -ForegroundColor Green

This script combines PowerShell for file handling with Python's faster-whisper library for the actual transcription, making it easy to batch process multiple audio files with GPU acceleration.

⚡ Performance

Speed Comparison

For a 60-minute audio file on RTX 5070 Ti:

Method Time Speed
whisper-cli.exe (CPU) ~45-60 min 1x
faster-whisper (GPU) ~8-12 min 4-5x

Memory Usage

Model VRAM Usage Accuracy
tiny ~1GB Basic
small ~2GB Good
medium ~5GB High
large-v3 ~10GB Best

🔧 Troubleshooting

GPU Not Detected

Error: Library cublas64_12.dll is not found or cannot be loaded

Solution: Make sure you installed PyTorch with the correct CUDA version:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Script Falls Back to CPU

If you see "GPU not available, falling back to CPU", check:

  • GPU drivers are installed: nvidia-smi
  • PyTorch can see GPU: python -c "import torch; print(torch.cuda.is_available())"
  • Virtual environment is activated
  • Correct CUDA version installed for your GPU

ModuleNotFoundError

Error: ModuleNotFoundError: No module named 'faster_whisper'

Solution: Make sure the virtual environment is activated and faster-whisper is installed:

.\venv_gpu\Scripts\Activate.ps1
pip install faster-whisper

Out of Memory Error

If you get CUDA out of memory errors:

  • Use a smaller model (small instead of medium)
  • Use compute_type="int8" instead of "float16"
  • Close other GPU-intensive applications

🎓 Advanced Configuration

Compute Type Options

For better performance or lower memory usage, you can adjust the compute type:

# Best accuracy, highest VRAM usage
model = WhisperModel("medium", device="cuda", compute_type="float32")

# Good balance (default)
model = WhisperModel("medium", device="cuda", compute_type="float16")

# Lower VRAM usage, slightly lower accuracy
model = WhisperModel("medium", device="cuda", compute_type="int8")

Beam Size

Adjust the beam size for transcription quality vs speed tradeoff:

# Faster, less accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=1)

# Balanced (default)
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)

# Slower, more accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=10)

💡 Why faster-whisper?

faster-whisper is preferred over the standard OpenAI Whisper implementation because:

  • 4-5x faster transcription speed
  • Lower memory usage (both VRAM and RAM)
  • Same accuracy - uses identical models
  • Better GPU utilization through CTranslate2 optimization
  • Production-ready - widely used in real-world applications

🎯 Real-World Use Cases

  • Podcast Transcription: Quickly transcribe podcast episodes for show notes and accessibility
  • Meeting Minutes: Convert recorded meetings to searchable text documents
  • Lecture Notes: Transcribe educational content for students
  • Content Creation: Generate subtitles for video content
  • Research: Transcribe interviews and focus groups for qualitative analysis

📦 Dependencies

The project uses the following Python packages:

  • torch (with CUDA 12.8): PyTorch deep learning framework
  • faster-whisper: Optimized Whisper implementation
  • ctranslate2: Fast inference engine
  • av: Audio/video processing
  • onnxruntime: Runtime for ONNX models

💬 Final Thoughts

GPU-accelerated audio transcription has completely transformed my workflow. What used to take an hour now takes just 10-12 minutes, allowing me to process multiple audio files in the time it would have taken to transcribe a single file using CPU-only methods.

The combination of OpenAI's Whisper model and faster-whisper's optimization makes this an incredibly powerful tool for anyone who regularly needs to transcribe audio content. Whether you're a content creator, researcher, or just need to transcribe the occasional meeting, this solution offers both speed and accuracy.

Give it a try and experience the power of GPU-accelerated transcription yourself!