Automated Video Generation Pipeline

Python system transforming voice memos into 4K videos with Whisper transcription, intelligent image matching, and cinematic effects

PythonWhisperMoviePyFFmpegscikit-learnTF-IDFAutomation

Overview

Built an end-to-end video generation pipeline that takes a voice recording and automatically produces a polished 4K video with matched stock footage, burned-in captions, Ken Burns effects, and intelligent audio ducking.

The Problem

Creating "faceless" content videos is tedious: record narration, transcribe it, find matching footage for each segment, add captions, apply effects, mix audio. A 5-minute video takes 3-4 hours of editing. Content creators need automation.

Architecture Decisions

Why Whisper for transcription?

Word-level timestamps (critical for caption sync). Runs locally (no API costs at scale). High accuracy even with varied accents.

Why TF-IDF for image matching?

Fast similarity computation. Works without training data. Interpretable results (can debug matches).

Why MoviePy over raw FFmpeg?

Pythonic API for complex compositions. Built-in effects library. Easier timeline manipulation.

Key Features

Whisper Transcription: Word-level timestamps for precise caption sync
TF-IDF Image Matching: Intelligent pairing of transcript segments to stock footage
Ken Burns Effects: Automated zoom/pan animations (zoom in, zoom out, pan left/right)
Caption Burning: Styled captions at 85% screen position with word wrapping
Audio Ducking: Music volume drops during speech, rises during silence
4K Output: Professional quality rendering

Challenges & Solutions

Challenge	Solution
Caption sync drift	Used Whisper word timestamps with 50ms tolerance
Ken Burns jitter	Implemented easing functions for smooth animation
Memory on 4K render	Chunked processing with intermediate files
Image matching accuracy	Domain-specific keyword expansion (finance terms)

Results

✓5-minute video generated in ~8 minutes
✓90%+ image relevance rating
✓Professional caption styling
✓Broadcast-quality 4K output

Code Sample - Ken Burns Effect

python

def apply_ken_burns(clip: VideoClip, effect: str, duration: float) -> VideoClip:
    """Apply cinematic Ken Burns effect to image clip."""

    w, h = clip.size

    if effect == "zoom_in":
        # Start at 100%, end at 120%
        def transform(t):
            scale = 1 + (0.2 * t / duration)
            return {
                'scale': scale,
                'center': (w/2, h/2)
            }

    elif effect == "zoom_out":
        # Start at 120%, end at 100%
        def transform(t):
            scale = 1.2 - (0.2 * t / duration)
            return {
                'scale': scale,
                'center': (w/2, h/2)
            }

    elif effect == "pan_left":
        # Pan from right to left
        def transform(t):
            x_offset = (w * 0.1) * (1 - t / duration)
            return {
                'scale': 1.1,
                'center': (w/2 + x_offset, h/2)
            }

    elif effect == "pan_right":
        # Pan from left to right
        def transform(t):
            x_offset = (w * 0.1) * (t / duration)
            return {
                'scale': 1.1,
                'center': (w/2 - w*0.1 + x_offset, h/2)
            }

    return clip.transform(transform, apply_to='mask')

What I'd Do Differently

→Add AI-powered image generation for missing stock footage
→Implement B-roll detection for automatic cutaway timing
→Build web UI for non-technical users