feat(skills): remotion-to-hyperframes eval harness (2/7) by jrusso1020 · Pull Request #507 · heygen-com/hyperframes

jrusso1020 · 2026-04-27T05:22:43Z

What

Eval harness for the remotion-to-hyperframes skill — three scripts that the skill calls into:

scripts/render_diff.sh — <baseline.mp4> <translated.mp4> → JSON summary { mean, min, p05, p95, frame_count, pass, threshold } of per-frame SSIM. Configurable threshold via R2HF_SSIM_THRESHOLD.
scripts/frame_strip.sh — <baseline.mp4> <translated.mp4> → side-by-side comparison strip (strip.png) at N evenly-spaced timestamps. For visual debugging when SSIM fails.
scripts/lint_source.py — <remotion-src-dir> → blockers / warnings / infos in human or JSON format. Blocks on the out-of-scope patterns the skill refuses to translate (useState, useEffect with deps, async calculateMetadata, @remotion/lambda imports, third-party React UI libraries).

Why

This is #2 in a 7-PR stack (#506 is #1).

Building the eval before writing the skill prompt — the rule from skill-creator's iteration loop. Without a deterministic measure of "did this translation work", the skill is tuned on vibes. The harness lands first so PR3+ can use it to validate every fixture as it's added.

The harness deliberately accepts paths to already-rendered MP4s instead of running the renderers itself. Reasons:

The skill consumer may not have both Remotion + HyperFrames toolchains installed. The harness only needs ffmpeg and python3.
In CI, renders happen in Docker with pinned versions; the harness runs after.
Decoupling render from compare lets the skill orchestrator (PR 7) inject mocks for unit testing.

How

render_diff.sh uses ffmpeg's ssim filter with a stats_file to capture per-frame data, then a small inline Python parses All:N from each line into mean/min/p05/p95.
frame_strip.sh samples timestamps from 5%–95% of duration to skip fade-in/fade-out artifacts that always SSIM-bias one direction.
lint_source.py is a single-file regex linter — TypeScript AST would be more correct, but ~200 lines of regex catches every blocker in the design and avoids pulling in tree-sitter-typescript as a runtime dep. If a real-world Remotion project misses a blocker, we add the regex (PR 6 will do this against the actual T1–T4 corpus).

Stack

#506 (1/7) ← scaffold
this PR (2/7) ← eval harness
3/7 ← T1+T2 corpus (uses this harness)
4/7 ← T3 corpus (data-driven)
5/7 ← T4 corpus (escape-hatch fixtures)
6/7 ← references/*.md
7/7 ← SKILL.md body

Test plan

Unit tests added: scripts/tests/smoke.sh
Manual testing performed
Documentation updated (covered in PR 7)

scripts/tests/smoke.sh exercises every script against synthetic inputs:

==> smoke: render_diff.sh against identical inputs
    identical inputs → mean SSIM=1.0 (pass=True)
==> smoke: render_diff.sh against different inputs
    different inputs → mean SSIM=0.374791 (correctly failed at 0.99)
==> smoke: frame_strip.sh produces a strip
    strip.png written (161863 bytes)
==> smoke: lint_source.py on clean fixture (expect exit 0)
    clean.tsx → 0 blockers, 2 info findings
==> smoke: lint_source.py on blocker fixture (expect exit 1)
    blocker.tsx → 5 blockers detected (correctly refused)

✅ smoke tests passed

Two test fixtures:

clean.tsx — uses every supported Remotion API (useCurrentFrame, interpolate, spring, Sequence, AbsoluteFill, staticFile, Audio, Img). Lint returns 0 blockers, 2 info-level staticFile notices.
blocker.tsx — uses every blocker (useState, useEffect with deps, delayRender, @mui/material import, async calculateMetadata). Lint returns 5 blockers and recommends the runtime interop pattern.

Smoke test passes locally on a clean checkout. ffmpeg version: 4.4.x (system).

miguel-heygen

I found a blocker in the eval harness: the linter misses common Remotion syntax that the skill relies on for refusal decisions. Since this script is the front door for the whole translation workflow, these false negatives need coverage before this layer should merge.

miguel-heygen · 2026-04-27T20:05:35Z

+        # Multi-line bodies are common, so we use re.DOTALL and look for the
+        # closing `} , [ <something> ]` signature instead of trying to parse
+        # the body. The body itself can contain commas and nested closures.
+        re.compile(r"useEffect\s*\([\s\S]*?\}\s*,\s*\[[^\]]+\]", re.DOTALL),


This only catches useEffect calls whose first argument body contains a closing } before the dependency array. Common concise-arrow effects like useEffect(() => fetch("/x"), [frame]) return 0 blockers with the current linter, even though this rule is supposed to be the hard gate for side-effectful Remotion sources. I reproduced that locally against this script. Please add coverage for expression-bodied effects, and probably useLayoutEffect if the intent is to catch side-effect hooks generally.

miguel-heygen · 2026-04-27T20:05:35Z

+            findings.append(Finding(str(path), line, col, severity, rule, message, rec))
+
+    # Custom hook detection: any `function useXxx(` or `const useXxx = ` defined in this file.
+    for m in re.finditer(r"^\s*(?:function|const|let)\s+(use[A-Z]\w+)\b", src, re.MULTILINE):


This custom-hook detector misses the most common exported hook form: export const useFadeIn = (...) => { ... }. I tested that shape locally and it returns 0 warnings. Since T4 treats custom hooks as a required warning class, this should handle optional export and arrow-function declarations before relying on it in the skill workflow.

jrusso1020 · 2026-04-27T23:10:14Z

@miguel-heygen — addressed in the amended commit 2a309f24:

Fix 1: r2hf/use-effect-deps regex no longer anchors on }. New pattern matches expression-bodied effects (useEffect(() => fetch(...), [deps])) by anchoring on the , [<non-empty>] deps signature instead. Also now catches useLayoutEffect (same hook semantics, was missing).

re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL)

Fix 2: custom-hook detector handles export const useFoo = .... Optional export (with optional default), and var added alongside function/const/let:

r"^\s*(?:export\s+(?:default\s+)?)?(?:function|const|let|var)\s+(use[A-Z]\w+)\b"

Smoke regression coverage in scripts/tests/fixtures/blocker.tsx: added an expression-bodied useEffect, a useLayoutEffect, and an export const useFadeMixed = ... custom hook. Smoke now reports 8 blockers (was 5), so future regex regressions on these specific cases will fail the smoke test.

Bonus fix landing in this same amend: demoted r2hf/lambda-import from BLOCKER to WARNING — the lambda-only policy contradiction you flagged in #515/#516/#517 reduces to "no special case needed" once severity matches policy. Lambda config drops via the existing translate-after-dropping-warnings flow. See per-PR replies for the downstream changes.

miguel-heygen

The previous missing detections are fixed, but the new useEffect regex is now too broad and can false-positive clean mount-only effects by matching a later unrelated array argument.

miguel-heygen · 2026-04-27T23:14:52Z

+        # Anchor on the `, [<non-empty>]` deps signature with a lazy match for
+        # the function arg. Leading `[\s\S]*?` is greedy-bounded by the deps
+        # match itself, so we don't over-match across multiple useEffect calls.
+        re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL),


This regex can span past the end of a clean useEffect(..., []) call and grab a later , [x]) from unrelated code, turning a mount-only effect into a blocker. I reproduced it with useEffect(() => { console.log('mounted'); }, []); const values = pick('x', [frame]);: the linter exits 1 and reports r2hf/use-effect-deps at the mount-only effect. The earlier missing expression-bodied effects are fixed, but this needs a bounded hook-call parser or at least coverage for empty-deps effects followed by later array arguments before the lint contract is reliable.

jrusso1020 · 2026-04-27T23:34:43Z

@miguel-heygen — addressed in 90845dd2:

The over-match was a fundamental regex limitation (no balanced-paren matching), so I replaced the regex with a proper paren-walk:

for m in re.finditer(r"\buse(?:Layout)?Effect\s*\(", src):
    end = _find_matching_paren(src, m.end() - 1)
    if end is None: continue
    call = src[m.start() : end + 1]
    m2 = re.search(r",\s*\[([^\]]*)\]\s*$", call[:-1])
    if m2 and m2.group(1).strip():
        # ... emit blocker

_find_matching_paren() is a small helper that tracks paren depth while skipping '…', "…", and `…` string literals. The check m2.group(1).strip() explicitly excludes empty [] deps.

Regression coverage added to clean.tsx — Miguel's exact repro is now part of the smoke fixture:

useEffect(() => { console.log("mounted"); }, []);
const _picked = pick("x", [frame]);

Verification — Miguel's repro returns 0 blockers (was 1 false-positive). A real useEffect(() => fetch(), [frame]) mixed in still correctly fires:

$ python3 lint_source.py /tmp/miguel-repro.tsx --json | jq -c '{blockers, warnings, infos}'
{"blockers":0,"warnings":0,"infos":0}

$ python3 lint_source.py /tmp/miguel-mixed.tsx --json | jq -c '.findings[].rule'
"r2hf/use-effect-deps"

Smoke still 8 blockers on blocker.tsx / 0 blockers on clean.tsx.

Downstream PRs cleanly rebased.

miguel-heygen

Re-checked the latest head against my requested-change thread. The exact false-positive repro now returns 0 blockers, the expression-bodied useEffect(..., [deps]) and exported custom-hook cases still detect correctly, and the smoke suite passes. The added fixture coverage addresses the regression, so this is good from my side.

Adds the deterministic eval primitives the skill calls into: scripts/render_diff.sh SSIM diff between two MP4s, JSON summary, configurable threshold scripts/frame_strip.sh side-by-side comparison strip for visual debugging scripts/lint_source.py pre-translation lint over Remotion source — blocks/warnings/infos The harness is decoupled from the render pipeline: it accepts paths to already-rendered MP4s. The skill orchestrator (PR 7) drives both renders and feeds the outputs in. This keeps the harness usable in CI, in sandboxes, and on any machine that has ffmpeg without needing the full Remotion + HyperFrames toolchain. Lint catches the patterns from the skill's out-of-scope list: - useState / useReducer (state-machine driven animation) - useEffect with deps (side effects) - async calculateMetadata (Promise-returning composition metadata) - @remotion/lambda imports - third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI) - delayRender / useCallback / useMemo (warnings) - staticFile / interpolateColors (info — translatable but flagged) Smoke test (scripts/tests/smoke.sh) exercises all three scripts against synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99, different ffmpeg testsrc videos fail at 0.99, frame_strip produces a strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers on a fixture that uses useState + useEffect + MUI + async metadata. Validated locally: smoke.sh exits 0.

jrusso1020 · 2026-04-28T00:00:55Z

Ran /simplify on the stack. Two structural cleanups landed in #507's amend (70e0b8bf):

1. RULES is now a uniform registry of Rule dataclasses with (rule_id, severity, matcher, message, recommendation).

The earlier shape was a list[tuple[Pattern, …]] for half the rules and three imperative blocks below the loop for the other half (use-effect-deps with paren-walk, custom-hook with name templating, third-party-react-ui with package matching). The docstring listed all rules as peers; the data structure didn't agree. Now every rule fits one shape:

@dataclass
class Rule:
    rule_id: str
    severity: str
    matcher: Callable[[str], Iterable[tuple[int, str | None]]]
    message: str
    recommendation: str

The matcher yields (offset, override_message) tuples — None for fixed messages, a templated string for rules that need to embed the matched text (custom hook name, package name). lint_file() is now ~25 lines instead of ~80.

2. frame_strip.sh collapsed to a single ffmpeg invocation.

Previously: 3 ffmpeg processes per timestamp (extract baseline frame, extract translated frame, hstack the pair) + 1 final vstack — at 8 samples that's 25 ffmpeg startups (~150-300ms each on Linux).

Now: one ffmpeg with two inputs, a select='eq(n,F1)+eq(n,F2)+…' filter to pick the sampled frames, then hstack per pair and vstack of the rows in the same filter graph. Benchmark on synthetic input with 8 samples: 0.27s end-to-end (was ~1-2s for the same workload).

Smoke still green: 8 blockers on blocker.tsx, 0 blockers + 2 infos on clean.tsx, strip.png renders. Miguel's repro case unchanged.

This was referenced Apr 27, 2026

feat(skills): remotion-to-hyperframes corpus T1+T2 (3/7) #508

Merged

feat(skills): remotion-to-hyperframes corpus T3 (4/7) #509

Merged

jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 51c058f to 00de7e4 Compare April 27, 2026 18:40

This was referenced Apr 27, 2026

feat(skills): remotion-to-hyperframes corpus T4 (5/7) #515

Merged

feat(skills): remotion-to-hyperframes references (6/7) #516

Merged

feat(skills): remotion-to-hyperframes SKILL.md + orchestrator (7/7) #517

Merged

miguel-heygen approved these changes Apr 27, 2026

View reviewed changes

miguel-heygen requested changes Apr 27, 2026

View reviewed changes

jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 00de7e4 to 2a309f2 Compare April 27, 2026 23:03

jrusso1020 requested a review from miguel-heygen April 27, 2026 23:11

miguel-heygen requested changes Apr 27, 2026

View reviewed changes

jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 2a309f2 to 90845dd Compare April 27, 2026 23:31

jrusso1020 requested a review from miguel-heygen April 27, 2026 23:34

miguel-heygen approved these changes Apr 27, 2026

View reviewed changes

jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 90845dd to 70e0b8b Compare April 27, 2026 23:54

jrusso1020 requested a review from miguel-heygen April 28, 2026 00:01

jrusso1020 marked this pull request as ready for review April 28, 2026 00:29

jrusso1020 deleted the branch main April 28, 2026 05:12

jrusso1020 closed this Apr 28, 2026

jrusso1020 reopened this Apr 28, 2026

jrusso1020 changed the base branch from skill/r2hf-scaffold to main April 28, 2026 05:13

jrusso1020 merged commit b27385b into main Apr 28, 2026
26 checks passed

jrusso1020 deleted the skill/r2hf-eval-harness branch April 28, 2026 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): remotion-to-hyperframes eval harness (2/7)#507

feat(skills): remotion-to-hyperframes eval harness (2/7)#507
jrusso1020 merged 1 commit intomainfrom
skill/r2hf-eval-harness

jrusso1020 commented Apr 27, 2026

Uh oh!

miguel-heygen left a comment

Uh oh!

miguel-heygen Apr 27, 2026

Uh oh!

miguel-heygen Apr 27, 2026

Uh oh!

jrusso1020 commented Apr 27, 2026 •

edited

Loading

Uh oh!

miguel-heygen left a comment

Uh oh!

miguel-heygen Apr 27, 2026

Uh oh!

jrusso1020 commented Apr 27, 2026

Uh oh!

miguel-heygen left a comment

Uh oh!

jrusso1020 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrusso1020 commented Apr 27, 2026

What

Why

How

Stack

Test plan

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

miguel-heygen Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jrusso1020 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jrusso1020 commented Apr 27, 2026

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

jrusso1020 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jrusso1020 commented Apr 27, 2026 •

edited

Loading