Automated Video Generation Pipeline
Python system transforming voice memos into 4K videos with Whisper transcription, intelligent image matching, and cinematic effects
Overview
Built an end-to-end video generation pipeline that takes a voice recording and automatically produces a polished 4K video with matched stock footage, burned-in captions, Ken Burns effects, and intelligent audio ducking.
The Problem
Creating "faceless" content videos is tedious: record narration, transcribe it, find matching footage for each segment, add captions, apply effects, mix audio. A 5-minute video takes 3-4 hours of editing. Content creators need automation.
Architecture Decisions
Why Whisper for transcription?
Word-level timestamps (critical for caption sync). Runs locally (no API costs at scale). High accuracy even with varied accents.
Why TF-IDF for image matching?
Fast similarity computation. Works without training data. Interpretable results (can debug matches).
Why MoviePy over raw FFmpeg?
Pythonic API for complex compositions. Built-in effects library. Easier timeline manipulation.
Key Features
- Whisper Transcription: Word-level timestamps for precise caption sync
- TF-IDF Image Matching: Intelligent pairing of transcript segments to stock footage
- Ken Burns Effects: Automated zoom/pan animations (zoom in, zoom out, pan left/right)
- Caption Burning: Styled captions at 85% screen position with word wrapping
- Audio Ducking: Music volume drops during speech, rises during silence
- 4K Output: Professional quality rendering
Challenges & Solutions
| Challenge | Solution |
|---|---|
| Caption sync drift | Used Whisper word timestamps with 50ms tolerance |
| Ken Burns jitter | Implemented easing functions for smooth animation |
| Memory on 4K render | Chunked processing with intermediate files |
| Image matching accuracy | Domain-specific keyword expansion (finance terms) |
Results
- ✓5-minute video generated in ~8 minutes
- ✓90%+ image relevance rating
- ✓Professional caption styling
- ✓Broadcast-quality 4K output
Code Sample - Ken Burns Effect
def apply_ken_burns(clip: VideoClip, effect: str, duration: float) -> VideoClip:
"""Apply cinematic Ken Burns effect to image clip."""
w, h = clip.size
if effect == "zoom_in":
# Start at 100%, end at 120%
def transform(t):
scale = 1 + (0.2 * t / duration)
return {
'scale': scale,
'center': (w/2, h/2)
}
elif effect == "zoom_out":
# Start at 120%, end at 100%
def transform(t):
scale = 1.2 - (0.2 * t / duration)
return {
'scale': scale,
'center': (w/2, h/2)
}
elif effect == "pan_left":
# Pan from right to left
def transform(t):
x_offset = (w * 0.1) * (1 - t / duration)
return {
'scale': 1.1,
'center': (w/2 + x_offset, h/2)
}
elif effect == "pan_right":
# Pan from left to right
def transform(t):
x_offset = (w * 0.1) * (t / duration)
return {
'scale': 1.1,
'center': (w/2 - w*0.1 + x_offset, h/2)
}
return clip.transform(transform, apply_to='mask')What I'd Do Differently
- →Add AI-powered image generation for missing stock footage
- →Implement B-roll detection for automatic cutaway timing
- →Build web UI for non-technical users