Transcribing audio files can be time-consuming, especially when processing hours of content. That's why I created this PowerShell script that leverages OpenAI's Whisper model with GPU acceleration to achieve 4-5x faster transcription speeds compared to CPU-only processing.
🚀 What This Project Does
This project provides a PowerShell script that transcribes audio files (MP3) to text using OpenAI's Whisper model with GPU acceleration via faster-whisper. Whether you're transcribing podcasts, lectures, meetings, or sermons, this tool makes the process fast and efficient.
✨ Key Features
- 🎯 GPU Acceleration: Utilizes NVIDIA GPU for 4-5x faster transcription compared to CPU
- 🔄 Automatic Fallback: Falls back to CPU if GPU is unavailable
- 📦 Batch Processing: Processes multiple audio files in a single run
- 🎓 High Accuracy: Uses the medium Whisper model with Dutch language support
- ⚡ Easy to Use: Simple PowerShell script with minimal configuration
💻 Requirements
Hardware
- NVIDIA GPU: GeForce RTX series (tested on RTX 5070 Ti with 12GB VRAM)
- VRAM: Minimum 4GB for medium model, 12GB recommended for large models
Software
- Windows 10/11
- Python 3.10 or higher
- NVIDIA GPU Drivers: Latest drivers from NVIDIA
- PowerShell: Built-in on Windows
🛠️ Installation Guide
Step 1: Verify GPU Drivers
First, verify that your NVIDIA drivers are installed correctly by running:
nvidia-smi
This should display your GPU information. If not, download and install the latest drivers from NVIDIA's website.
Step 2: Create Virtual Environment
Navigate to the project directory and create a Python virtual environment:
cd "C:\Path\To\Your\Project"
python -m venv venv_gpu
Step 3: Activate Virtual Environment
.\venv_gpu\Scripts\Activate.ps1
Your prompt should now show (venv_gpu) indicating the virtual environment is active.
Step 4: Install PyTorch with CUDA 12.8
Important: For RTX 5070 Ti (Blackwell architecture) and newer GPUs, you need CUDA 12.8 support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
For older GPUs (RTX 3000/4000 series), you can use CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 5: Install faster-whisper
pip install faster-whisper
Step 6: Verify GPU Support
Verify that PyTorch can detect your GPU:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"
Expected output:
CUDA available: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 5070 Ti
⚙️ Configuration
Audio Files Location
By default, the script looks for MP3 files in:
C:\Path\To\Your\Audio\Files\
To change this, edit the $folder variable in transcribe.ps1:
$folder = "C:\Your\Custom\Path\"
Whisper Model Selection
The script uses the medium model by default. Available models include:
tiny- Fastest, least accurate (~1GB VRAM)base- Fast, basic accuracy (~1GB VRAM)small- Good balance (~2GB VRAM)medium- High accuracy (~5GB VRAM) [Default]large-v2- Best accuracy (~10GB VRAM)large-v3- Latest, best accuracy (~10GB VRAM)
Language Configuration
The script is configured for Dutch (nl). To change the language, modify the transcribe line:
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)
Supported languages include: en (English), nl (Dutch), de (German),
fr (French), es (Spanish), and many more.
🎯 Usage
Basic Usage
- Place your MP3 files in the configured folder
- Activate the virtual environment (if not already active):
.\venv_gpu\Scripts\Activate.ps1
- Run the script:
.\transcribe.ps1
Output
The script will:
- Process each MP3 file in the folder
- Create a
.txtfile with the same name as the audio file - Save the transcription in the same folder as the audio file
- Display progress in the console with colored output
First Run
The first time you run the script, it will download the Whisper model (~1.5GB for medium model). This only happens once, as the model is cached locally.
📝 Complete PowerShell Script
Here's the complete transcribe.ps1 script that powers the GPU-accelerated transcription:
# GPU-accelerated transcription using faster-whisper
# Folder containing audio files
$folder = "C:\Path\To\Your\Audio\Files\"
# Get all supported audio files
$files = Get-ChildItem -Path $folder -Filter *.mp3
# Python script to run faster-whisper
$pythonScript = @'
import sys
from faster_whisper import WhisperModel
# Load model with GPU support
# Options: tiny, base, small, medium, large-v2, large-v3
# device options: "cuda" for GPU, "cpu" for CPU
try:
model = WhisperModel("medium", device="cuda", compute_type="float16")
print("Using GPU acceleration")
except Exception as e:
print(f"GPU not available ({e}), falling back to CPU")
model = WhisperModel("medium", device="cpu", compute_type="int8")
audio_file = sys.argv[1]
output_file = sys.argv[2]
print(f"Transcribing: {audio_file}")
# Transcribe
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)
print(f"Detected language '{info.language}' with probability {info.language_probability}")
# Write to file
with open(output_file, 'w', encoding='utf-8') as f:
for segment in segments:
f.write(segment.text + "\n")
print(f"Transcription saved to: {output_file}")
'@
# Save Python script temporarily
$tempPythonScript = "$env:TEMP\transcribe_gpu.py"
$pythonScript | Out-File -FilePath $tempPythonScript -Encoding UTF8
foreach ($file in $files) {
Write-Host "Transcribing: $($file.Name)" -ForegroundColor Green
# Output filename
$outputFile = [System.IO.Path]::ChangeExtension($file.FullName, ".txt")
# Run Python script
python $tempPythonScript $file.FullName $outputFile
Write-Host "Completed: $($file.Name)" -ForegroundColor Cyan
Write-Host ""
}
# Cleanup
Remove-Item $tempPythonScript -ErrorAction SilentlyContinue
Write-Host "All files processed." -ForegroundColor Green
This script combines PowerShell for file handling with Python's faster-whisper library for the actual transcription, making it easy to batch process multiple audio files with GPU acceleration.
⚡ Performance
Speed Comparison
For a 60-minute audio file on RTX 5070 Ti:
| Method | Time | Speed |
|---|---|---|
| whisper-cli.exe (CPU) | ~45-60 min | 1x |
| faster-whisper (GPU) | ~8-12 min | 4-5x |
Memory Usage
| Model | VRAM Usage | Accuracy |
|---|---|---|
| tiny | ~1GB | Basic |
| small | ~2GB | Good |
| medium | ~5GB | High |
| large-v3 | ~10GB | Best |
🔧 Troubleshooting
GPU Not Detected
Error: Library cublas64_12.dll is not found or cannot be loaded
Solution: Make sure you installed PyTorch with the correct CUDA version:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Script Falls Back to CPU
If you see "GPU not available, falling back to CPU", check:
- GPU drivers are installed:
nvidia-smi - PyTorch can see GPU:
python -c "import torch; print(torch.cuda.is_available())" - Virtual environment is activated
- Correct CUDA version installed for your GPU
ModuleNotFoundError
Error: ModuleNotFoundError: No module named 'faster_whisper'
Solution: Make sure the virtual environment is activated and faster-whisper is installed:
.\venv_gpu\Scripts\Activate.ps1
pip install faster-whisper
Out of Memory Error
If you get CUDA out of memory errors:
- Use a smaller model (
smallinstead ofmedium) - Use
compute_type="int8"instead of"float16" - Close other GPU-intensive applications
🎓 Advanced Configuration
Compute Type Options
For better performance or lower memory usage, you can adjust the compute type:
# Best accuracy, highest VRAM usage
model = WhisperModel("medium", device="cuda", compute_type="float32")
# Good balance (default)
model = WhisperModel("medium", device="cuda", compute_type="float16")
# Lower VRAM usage, slightly lower accuracy
model = WhisperModel("medium", device="cuda", compute_type="int8")
Beam Size
Adjust the beam size for transcription quality vs speed tradeoff:
# Faster, less accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=1)
# Balanced (default)
segments, info = model.transcribe(audio_file, language="nl", beam_size=5)
# Slower, more accurate
segments, info = model.transcribe(audio_file, language="nl", beam_size=10)
💡 Why faster-whisper?
faster-whisper is preferred over the standard OpenAI Whisper implementation because:
- 4-5x faster transcription speed
- Lower memory usage (both VRAM and RAM)
- Same accuracy - uses identical models
- Better GPU utilization through CTranslate2 optimization
- Production-ready - widely used in real-world applications
🎯 Real-World Use Cases
- Podcast Transcription: Quickly transcribe podcast episodes for show notes and accessibility
- Meeting Minutes: Convert recorded meetings to searchable text documents
- Lecture Notes: Transcribe educational content for students
- Content Creation: Generate subtitles for video content
- Research: Transcribe interviews and focus groups for qualitative analysis
📦 Dependencies
The project uses the following Python packages:
- torch (with CUDA 12.8): PyTorch deep learning framework
- faster-whisper: Optimized Whisper implementation
- ctranslate2: Fast inference engine
- av: Audio/video processing
- onnxruntime: Runtime for ONNX models
💬 Final Thoughts
GPU-accelerated audio transcription has completely transformed my workflow. What used to take an hour now takes just 10-12 minutes, allowing me to process multiple audio files in the time it would have taken to transcribe a single file using CPU-only methods.
The combination of OpenAI's Whisper model and faster-whisper's optimization makes this an incredibly powerful tool for anyone who regularly needs to transcribe audio content. Whether you're a content creator, researcher, or just need to transcribe the occasional meeting, this solution offers both speed and accuracy.
Give it a try and experience the power of GPU-accelerated transcription yourself!