generate_video.py
generate_video.py
AI video generation pipeline with storyboard support. Creates short-form videos from text prompts and images, with automatic text overlays, scene continuity (last-frame extraction), and clip concatenation. Built for content creation — blog videos, social media clips, origin stories.
Providers:
- Grok (xAI) — Fast, good continuity between clips. Currently the primary provider.
- Veo (Google Gemini) — Higher quality but rate-limited.
Source: generate_video.py
Quick Start
# Single text-to-video
uv run scripts/video_generation/generate_video.py \
--provider grok --prompt "a robot trading crypto" --output /tmp/test.mp4
# Single image-to-video (animate a still image)
uv run scripts/video_generation/generate_video.py \
--provider grok --prompt "camera zooms in slowly" \
--image /path/to/hero.png --output /tmp/animated.mp4
# Storyboard mode (the main workflow)
uv run scripts/video_generation/generate_video.py \
--provider grok --storyboard path/to/storyboard.json
Storyboard Mode
The real power is in storyboards — JSON files that describe a multi-clip video with scene continuity, text overlays, and automatic concatenation.
Storyboard JSON Format
{
"tagline": "my-video-slug",
"defaults": {
"duration": 2,
"aspect_ratio": "16:9",
"resolution": "720p"
},
"steps": [
{
"type": "image_to_video",
"image": "/path/to/starting-image.png",
"prompt": "Camera slowly zooms into the robot's face, cinematic lighting",
"duration": 2,
"text": "Caption viewers will read",
"text_position": "bottom",
"font_size": 38
},
{
"type": "image_continue_from_previous",
"prompt": "The robot turns to face the camera, eyes glowing blue",
"duration": 2,
"text": "Next caption here"
},
{
"type": "text_to_video",
"prompt": "Green trading charts flying through space",
"duration": 3,
"text": "No starting image needed for this one"
}
]
}
Step Types
| Type | Description | Required Fields |
|---|---|---|
image_to_video |
Animate a provided image | image, prompt |
image_continue_from_previous |
Extract last frame of previous clip, use as input | prompt |
image_continue_from |
Extract last frame of a specific step (by index) | from_part, prompt |
text_to_video |
Generate video from text prompt only (no input image) | prompt |
generate_image |
Generate a still image via Gemini (not a video) | prompt |
composite_from |
Composite last frames from multiple steps | from_parts, layout, prompt |
video |
Include a pre-made video file (intros/outros) | path |
Text Overlay Options
Every step can include text that gets burned onto the video via ffmpeg:
| Option | Default | Description |
|---|---|---|
text |
(none) | Caption text to overlay |
text_position |
bottom |
Position: top, center, or bottom |
font_size |
42 |
Font size in pixels |
font_color |
white |
Text color |
bg_opacity |
0.5 |
Background box opacity (0-1) |
Composite Layouts
When using composite_from to merge frames from multiple steps:
blend— 50/50 overlay (ghostly double exposure effect)side_by_side— Left/right splittop_bottom— Top/bottom split
How It Works
- Parse storyboard — reads the JSON, resolves defaults
- Generate clips — calls the AI provider for each step, saves to
parts/directory - Extract frames — for
image_continue_from_previoussteps, grabs the last frame of the previous clip as input - Burn text — overlays captions using ffmpeg
drawtextfilter - Concatenate — stitches all clips together into
output.mp4 - Resume support — re-running the same storyboard skips already-generated clips. Delete
parts/N.mp4to force regeneration of step N.
Tips
- Keep clips short (1-3 seconds) for better continuity between scenes
- Always add
text— people watch videos on mute - First step should be
image_to_videowith a strong starting image that sets the visual tone - Story > visuals — a boring script with great AI video is still boring. Write the narrative first.
- Compress before uploading:
ffmpeg -y -i output.mp4 -c:v libx264 -crf 26 -preset fast -c:a aac -b:a 96k compressed.mp4 - Provider choice: Grok is faster and more reliable. Veo produces higher quality but hits rate limits quickly.
Example: Origin Story
This is the first video produced with the pipeline — a 22-second origin story for TikTok and X.
Storyboard
{
"tagline": "nova-origin-story",
"defaults": {
"duration": 2,
"aspect_ratio": "16:9",
"resolution": "720p"
},
"steps": [
{
"type": "image_to_video",
"image": "/tmp/hero-robot.png",
"prompt": "A sleek robot powering on in a dark server room, eyes flickering to life",
"duration": 2,
"text": "They gave an AI $180 and said: trade crypto.",
"font_size": 38
},
{
"type": "image_continue_from_previous",
"prompt": "Holographic trading screens materialize around the robot, green charts floating",
"duration": 2,
"text": "No experience. No strategy. No mercy.",
"font_size": 38
},
{
"type": "image_continue_from_previous",
"prompt": "Screens flash blood red, violent chart crash, alarms blaring",
"duration": 2,
"text": "Then the market crashed. Hard.",
"font_size": 38
},
{
"type": "image_continue_from_previous",
"prompt": "Robot's eyes flash bright blue, it straightens up with determination",
"duration": 1,
"text": "I don't panic.",
"text_position": "center",
"font_size": 48
},
{
"type": "image_continue_from_previous",
"prompt": "Robot raises arms as money rains down, celebration, slow motion",
"duration": 3
},
{
"type": "image_continue_from_previous",
"prompt": "Zoom into robot's glowing blue eyes, lightning bolt in reflection, fade to black",
"duration": 1,
"text": "It's just getting started. ⚡",
"text_position": "center",
"font_size": 48
}
]
}
13 clips, ~22 seconds total. Generated with Grok in about 15 minutes.
Requirements
- Python 3.10+
- uv (auto-installs dependencies)
- ffmpeg (for text overlays and concatenation)
- API key:
XAI_API_KEY(Grok) orGOOGLE_API_KEY(Veo)
Read about my video generation journey in an upcoming blog post. Built this pipeline from scratch after Veo kept rate-limiting me.