Back to Case Studies

Automated Video Generation Pipeline

Python system transforming voice memos into 4K videos with Whisper transcription, intelligent image matching, and cinematic effects

PythonWhisperMoviePyFFmpegscikit-learnTF-IDFAutomation

Overview

Built an end-to-end video generation pipeline that takes a voice recording and automatically produces a polished 4K video with matched stock footage, burned-in captions, Ken Burns effects, and intelligent audio ducking.

The Problem

Creating "faceless" content videos is tedious: record narration, transcribe it, find matching footage for each segment, add captions, apply effects, mix audio. A 5-minute video takes 3-4 hours of editing. Content creators need automation.

Architecture Decisions

Why Whisper for transcription?

Word-level timestamps (critical for caption sync). Runs locally (no API costs at scale). High accuracy even with varied accents.

Why TF-IDF for image matching?

Fast similarity computation. Works without training data. Interpretable results (can debug matches).

Why MoviePy over raw FFmpeg?

Pythonic API for complex compositions. Built-in effects library. Easier timeline manipulation.

Key Features

  • Whisper Transcription: Word-level timestamps for precise caption sync
  • TF-IDF Image Matching: Intelligent pairing of transcript segments to stock footage
  • Ken Burns Effects: Automated zoom/pan animations (zoom in, zoom out, pan left/right)
  • Caption Burning: Styled captions at 85% screen position with word wrapping
  • Audio Ducking: Music volume drops during speech, rises during silence
  • 4K Output: Professional quality rendering

Challenges & Solutions

ChallengeSolution
Caption sync driftUsed Whisper word timestamps with 50ms tolerance
Ken Burns jitterImplemented easing functions for smooth animation
Memory on 4K renderChunked processing with intermediate files
Image matching accuracyDomain-specific keyword expansion (finance terms)

Results

  • 5-minute video generated in ~8 minutes
  • 90%+ image relevance rating
  • Professional caption styling
  • Broadcast-quality 4K output

Code Sample - Ken Burns Effect

python
def apply_ken_burns(clip: VideoClip, effect: str, duration: float) -> VideoClip:
    """Apply cinematic Ken Burns effect to image clip."""

    w, h = clip.size

    if effect == "zoom_in":
        # Start at 100%, end at 120%
        def transform(t):
            scale = 1 + (0.2 * t / duration)
            return {
                'scale': scale,
                'center': (w/2, h/2)
            }

    elif effect == "zoom_out":
        # Start at 120%, end at 100%
        def transform(t):
            scale = 1.2 - (0.2 * t / duration)
            return {
                'scale': scale,
                'center': (w/2, h/2)
            }

    elif effect == "pan_left":
        # Pan from right to left
        def transform(t):
            x_offset = (w * 0.1) * (1 - t / duration)
            return {
                'scale': 1.1,
                'center': (w/2 + x_offset, h/2)
            }

    elif effect == "pan_right":
        # Pan from left to right
        def transform(t):
            x_offset = (w * 0.1) * (t / duration)
            return {
                'scale': 1.1,
                'center': (w/2 - w*0.1 + x_offset, h/2)
            }

    return clip.transform(transform, apply_to='mask')

What I'd Do Differently

  • Add AI-powered image generation for missing stock footage
  • Implement B-roll detection for automatic cutaway timing
  • Build web UI for non-technical users