Claude for Video Editing in 2026: What Works, What Breaks, and the Real Pipeline
We ran Claude Opus on a 30-minute interview and measured every cut. 11 of 11 cuts landed mid-sentence. Opus claimed 299 seconds, delivered 578. Here's what Claude can do, what it can't, and the three-axis pipeline that actually edits real footage.

TLDR: Claude can plan an edit from a transcript. It cannot edit video. We gave Claude Opus a 30-minute interview and asked for a 5-minute highlight reel. It returned 578 seconds while claiming 299. Every single cut landed mid-sentence. Here's what that means for your workflow, and what the right pipeline actually looks like.
Every few weeks, another tutorial lands promising that Claude can edit your video. The workflow is always the same: paste a transcript, write a prompt, watch the AI return a cut list. It looks convincing. The rendered output is not.
In April 2026, three things happened in quick succession: HeyGen open-sourced Hyperframes, Remotion shipped a Claude Code skill, and Anthropic rebranded its agent surface to Cowork. Search volume on "Claude Cowork" jumped from zero to 40,500 a month almost overnight. The "Claude edits video" tutorial cycle hit a new peak.
We ran the same playbook on real footage and measured every cut across four dimensions: timing accuracy, hallucination, cut quality, and parallelism. The results were consistent and instructive. This post is the full write-up, what Claude actually does, where it structurally fails, and why the right pipeline is not Claude alone but Claude paired with a real-footage editor and a code-rendered graphics layer running in parallel.
The 2026 trend: Claude as the video brain
In the last six weeks, Hyperframes V2 (Apache 2.0, HeyGen, April 17, 2026) brought subtitles, lower thirds, and motion graphics generated as code by Claude Code. Remotion's official docs now ship with a "Prompting videos with Claude Code" page -- Remotion is React for video; Claude writes the React. "Claude Cowork" search volume jumped from zero to 40,500/month in two months. A dozen tutorials and GitHub projects (buttercut, claude-code-video-toolkit, MindStudio, robonuggets, mejba.me, sabrina.dev) all teach the same pattern: prompt Claude, render something.
The trend is real. The framing is incomplete. All of these tutorials cover code-generated video (Remotion) or AI-rendered overlays (Hyperframes). None of them touch the harder problem: editing footage a human shot, where every cut has to land on a breath, and every duration has to match a runtime.
We tested whether Claude alone can do that part. It cannot. Here is the data.
The experiment
Source: 30 minutes of a real Lex Fridman x Jensen Huang interview, downloaded at 480p.
Transcript: ElevenLabs Scribe v2, word-level timestamps, 5,104 words.
Editor: Claude Opus via Claude Code, prompted in the exact pattern shown in every "Claude edits video" tutorial.
Task: "Make a 5-minute (300-second) highlight reel. Output cut list as JSON."
We measured four things: timing accuracy, hallucination, cut quality, and parallelism. Selects' moat shows up in every dimension.
Limit 1: Claude cannot analyze the video. It only reads transcripts, and transcripts are ~120 ms imprecise.
What Claude sees: text with timestamps. Nothing more. No frames, no faces, no waveform, no scene change, no camera angle.
Best-case timestamps: ElevenLabs Scribe v2 is the most accurate transcription API most editors will touch. Its word-level timestamps round to 10 ms in the API output, but real-world word-boundary accuracy is 120 ms, about four frames at 30 fps.
That sounds small. It is not. A hard cut placed four frames into the next word produces an audible click, the kind of thing an editor would never ship. A cut placed four frames before the speaker breathes produces an awkward pause that every viewer notices but can't name.
Layer | Reads what? | Timing precision |
|---|---|---|
Claude alone | Transcript text | None |
Claude + Scribe v2 | Transcript + word timestamps | 120 ms (4 frames at 30fps) |
Claude + Selects | Transcript + frame-level audio analysis | 10 ms |
What Selects does that Claude alone can't: frame-precise word boundary detection on the audio waveform, scene/cut detection in the video, and speaker-angle detection in multicam footage. Claude reads what Selects gives it.
Limit 2: Claude hallucinates on long content, including its own math
This is the most-quoted finding from our experiment. Here is the actual Opus output:
The model self-reported 299.32 seconds. The actual sum of its 11 cuts is 578.24 seconds. The model hallucinated its own arithmetic. A 93% target miss, dressed up with a fake total.
Why does this happen? Long-context generation forces the model to track many numbers (start, end, and duration of every cut). Opus doesn't run a calculator. It estimates. On tasks where "the output looks like an edit decision list" is a strong prior, the model emits a plausible-looking total without checking it.
Beyond the total, every individual cut violated sentence boundaries:
Cut | Window | Duration | Where it broke |
|---|---|---|---|
1 | 33.42-62.22s | 28.8s | Ends mid-thought: "So let's talk about extreme co-design." |
2 | 71.14-101.06s | 29.9s | Ends mid-clause: "You have to take the algorithm." |
3 | 469.32-510.58s | 41.3s | Starts and ends mid-sentence |
4 | 735.26-792.00s | 56.7s | Starts and ends mid-sentence |
5 | 792.00-835.44s | 43.4s | Starts mid-sentence, ends mid-sentence |
6 | 947.14-1005.00s | 57.9s | Starts mid-sentence: "Well, um, I had, I..." |
7 | 1085.64-1148.08s | 62.4s | Mid-start, mid-end |
8 | 1148.08-1204.54s | 56.5s | Mid-start, mid-end |
9 | 1329.40-1392.00s | 62.6s | Mid-start |
10 | 1448.72-1524.60s | 75.9s | Mid-start, mid-end |
11 | 1580.82-1643.66s | 62.8s | Mid-end |
11 of 11 cuts broke a sentence. The rendered output is unwatchable; every clip starts on "And," "Well," "So," or "Yeah" and ends on a comma.
Why Selects doesn't do this: Selects cuts on breath boundaries and pacing beats, not word indexes. The model proposes intent; the engine snaps cuts to the right frame.
Limit 3: Claude cannot preview the output
Once Claude returns a JSON cut list, you have to render it yourself, using ffmpeg, NLE timeline, or upload to a renderer. Claude has no way to play the result, listen to it, or revise based on what it sounds like. Every iteration is blind.
A human editor's loop is: cut → play → adjust → play → done. Claude's loop is: cut → ship JSON → wait for the human to render → human plays → human types feedback → cut again. The model is structurally outside the preview loop.
Selects renders inline. The editor scrubs the timeline at the frame level, hears the cut, and asks Claude to revise the next 30 seconds. Claude can refer to "the cut at 2:14" because Selects keeps the project state.
Limit 4: Claude is single-threaded by default, and parallel doesn't fix correctness
A creator who wants 30 versions of the same edit (per-platform durations, per-audience pacing, per-language captions) has two options with Claude alone: serial or parallel. We measured both at N=6:
Setup | 6 themed 60s cuts |
|---|---|
Serial (single Opus call generating all 6) | 19.4s wall time |
Parallel (6 subagents, custom orchestrator) | 28.6s wall time |
Quality | 6 of 6 still mid-sentence in both modes |
At N=6, parallel orchestration is slower than a single call; the overhead of spawning subagents exceeds the work. Parallelism matters at scale: at N=100, a single call hits output-token limits, while 100 concurrent calls finish in 30 seconds. But every output is still broken. Parallelism multiplies incorrect cuts. It doesn't fix them.
Selects ships native parallel rendering (100+ concurrent), plus correctness guarantees on each output, plus a preview per render. The wall-time speedup is the smaller half of the story.
Limit 5: Claude has no production-grade workflow primitives
Even setting cut quality aside, a real edit involves things Claude cannot touch at all.
Multicam. A typical interview is shot on 2-4 cameras simultaneously. The editor needs to sync them, pick which camera is best for each line (active speaker, best framing, best reaction shot), and switch between them. Claude has no concept of multiple camera angles. It sees a single transcript. Selects syncs multicam by audio fingerprint and chooses the right angle per line based on speaker detection and shot quality.
Multiple audio tracks. Interviews have lavalier mics on each speaker, a boom mic, and ambient. Edits frequently keep one mic on speaker A and another on speaker B, balance levels per cut, and duck music under speech. Claude has no notion of audio tracks beyond the merged transcript. Selects manages them as first-class entities.
B-roll insertion. A finished edit isn't talking heads -- it's talking heads with cutaways at the right beats. Claude has no idea what's in your B-roll folder and no way to layer it under the audio. Selects indexes B-roll, matches it to transcript topics, and places it on a secondary video track.
NLE handoff. Most real projects don't finish in an AI tool -- they finish in Premiere Pro, Final Cut Pro, or DaVinci Resolve, where a human polishes color, sound, and final pacing. Claude outputs JSON. JSON does not open in Premiere. Selects exports a native Premiere project, FCP XML, or DaVinci timeline file, so the AI-cut rough is just the first 40% of the job.
Workflow primitive | Claude alone | Selects |
|---|---|---|
Multicam sync | ❌ | ✅ audio fingerprint |
Pick best camera per line | ❌ | ✅ speaker + shot detection |
Multi-track audio | ❌ | ✅ per-mic tracks |
B-roll matching + insertion | ❌ | ✅ topic + pacing match |
NLE handoff (Premiere/FCP/DaVinci) | ❌ JSON only | ✅ native project files |
This is the unglamorous half of editing. AI tutorials skip it because it's hard to demo in 90 seconds. Editors notice immediately.
The real 2026 pipeline: three axes, not one
Stop asking "can Claude edit my video?" and start asking "what does each layer do?"
Selects supports both axes natively: an MCP server for real-footage editing and Remotion integration for motion graphics. Hyperframes integrate the same way. Claude orchestrates all three.
Tool | Real footage | Graphics | Frame-precise | Preview | Parallel | Multicam | Multi-audio |
|---|---|---|---|---|---|---|---|
Claude alone | ❌ | ❌ (JSON) | ❌ (120ms) | ❌ | ❌ | ❌ | ❌ |
Hyperframes | ❌ | ✅ overlays | n/a | partial | ❌ | ❌ | ❌ |
Remotion | ❌ | ✅ from code | n/a | render preview | ❌ | ❌ | ❌ |
Selects | ✅ | ✅ via Remotion | ✅ (10ms) | ✅ frame-level | ✅ (100+) | ✅ | ✅ |
The three axes are complements, not competitors. You want all three. Selects is the one most editors are missing.
How to actually run this pipeline today
Connect Claude to Selects via MCP. Selects' MCP server lets Claude (Desktop, Code, or Cursor) read your footage, request frame-precise cuts, and receive a preview link.
Add Remotion or Hyperframes for everything that should be rendered from code -- animated captions, lower thirds, kinetic text, brand intros.
Let Claude plan, not cut. Claude writes the intent ("60-second highlight on CUDA bet, end on margin collapse"). Selects translates intent into frame-accurate cut points. Remotion adds the captions.
Use parallel for variants. When you need 10 platform-specific cuts (YouTube 8:00, Shorts 0:60, LinkedIn 1:30), let Selects render them in parallel with the same Claude plan. Each one is previewable individually.
The "Claude edited my video" tutorials are not lying, they're just showing you one third of the pipeline. Claude's planning an edit from a transcript is real and useful. Claude cutting real footage frame-precisely is not possible, and the experiment data is unambiguous on that. The editors who ship faster in 2026 are the ones who stopped fighting the architecture and started using each tool for what it was built for. Claude for the brain. Selects MCP for the footage. Remotion or Hyperframes for the graphics layer. That's the pipeline. Selects MCP connects Claude to your real footage in under three minutes, frame-precise cuts, parallel renders, preview included.
Frequently Asked Questions (FAQs)
Can Claude edit videos in 2026?
Claude can plan video edits from a transcript. It cannot directly cut or preview video. To edit real footage frame-precisely, you need to pair Claude with a real-footage editor like Selects. To generate motion graphics from code, you pair Claude with Remotion or Hyperframes.
How accurate is Claude when editing long videos?
In our 30-minute test, Claude Opus missed the requested 5-minute (300s) target by 93% -- it returned 578 seconds while reporting "299.32." Eleven of eleven cuts landed mid-sentence. The model hallucinated its own arithmetic.
Why does Claude cut mid-sentence?
Claude only reads transcripts, and transcript word-boundary timestamps are precise to ~120 ms, about four frames. Even when the model picks a "sentence end," its boundary index is off by a syllable. Editors cut on breath beats, not word indexes.
Is Hyperframes a video editor?
No. Hyperframes is an overlay and motion-graphics renderer. It does not edit raw footage. Use it for captions, kinetic text, lower thirds, and animated graphics, everything that should be rendered from code, not cut from a camera.
What is the difference between Hyperframes, Remotion, and Selects?
Hyperframes and Remotion are code-rendered video tools; the video is generated, not edited. Selects edits real footage you shot with a camera. The three are complements; a 2026 pipeline uses all three with Claude orchestrating.
Can Claude edit multicam interviews?
No. Claude has no concept of multiple camera angles. It reads a single transcript. To sync multiple cameras, pick the best angle per line, and switch between them, you need a real footage editor. Selects syncs multicam by audio fingerprint and chooses angles based on speaker detection.
Can Claude add B-roll cutaways?
Not natively. Claude doesn't know what's in your B-roll folder and can't place clips on a secondary video track. Selects indexes B-roll, matches it to transcript topics, and inserts it on a secondary track at pacing beats.
Can Claude export to Premiere Pro, Final Cut, or DaVinci Resolve?
No. Claude returns JSON. Selects exports native Premiere project files, FCP XML, or DaVinci timelines, so the AI-cut rough is the first 40% of the work, and the editor finishes in their NLE.
Can Claude run 100 edits in parallel?
Only with custom orchestration code you write yourself, and each parallel output will still suffer the same mid-sentence and target-miss problems we measured. Selects ships native parallel rendering with frame-precise correctness and per-render preview, which is the part that actually matters.

Tom Kim
CEO and Co-founder
Share post





