Skip to content

feat(skills): remotion-to-hyperframes eval harness (2/7)#507

Merged
jrusso1020 merged 1 commit intomainfrom
skill/r2hf-eval-harness
Apr 28, 2026
Merged

feat(skills): remotion-to-hyperframes eval harness (2/7)#507
jrusso1020 merged 1 commit intomainfrom
skill/r2hf-eval-harness

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

Eval harness for the remotion-to-hyperframes skill — three scripts that the skill calls into:

  • scripts/render_diff.sh<baseline.mp4> <translated.mp4> → JSON summary { mean, min, p05, p95, frame_count, pass, threshold } of per-frame SSIM. Configurable threshold via R2HF_SSIM_THRESHOLD.
  • scripts/frame_strip.sh<baseline.mp4> <translated.mp4> → side-by-side comparison strip (strip.png) at N evenly-spaced timestamps. For visual debugging when SSIM fails.
  • scripts/lint_source.py<remotion-src-dir> → blockers / warnings / infos in human or JSON format. Blocks on the out-of-scope patterns the skill refuses to translate (useState, useEffect with deps, async calculateMetadata, @remotion/lambda imports, third-party React UI libraries).

Why

This is #2 in a 7-PR stack (#506 is #1).

Building the eval before writing the skill prompt — the rule from skill-creator's iteration loop. Without a deterministic measure of "did this translation work", the skill is tuned on vibes. The harness lands first so PR3+ can use it to validate every fixture as it's added.

The harness deliberately accepts paths to already-rendered MP4s instead of running the renderers itself. Reasons:

  1. The skill consumer may not have both Remotion + HyperFrames toolchains installed. The harness only needs ffmpeg and python3.
  2. In CI, renders happen in Docker with pinned versions; the harness runs after.
  3. Decoupling render from compare lets the skill orchestrator (PR 7) inject mocks for unit testing.

How

  • render_diff.sh uses ffmpeg's ssim filter with a stats_file to capture per-frame data, then a small inline Python parses All:N from each line into mean/min/p05/p95.
  • frame_strip.sh samples timestamps from 5%–95% of duration to skip fade-in/fade-out artifacts that always SSIM-bias one direction.
  • lint_source.py is a single-file regex linter — TypeScript AST would be more correct, but ~200 lines of regex catches every blocker in the design and avoids pulling in tree-sitter-typescript as a runtime dep. If a real-world Remotion project misses a blocker, we add the regex (PR 6 will do this against the actual T1–T4 corpus).

Stack

#506 (1/7) ← scaffold
this PR (2/7) ← eval harness
3/7 ← T1+T2 corpus (uses this harness)
4/7 ← T3 corpus (data-driven)
5/7 ← T4 corpus (escape-hatch fixtures)
6/7 ← references/*.md
7/7 ← SKILL.md body

Test plan

  • Unit tests added: scripts/tests/smoke.sh
  • Manual testing performed
  • Documentation updated (covered in PR 7)

scripts/tests/smoke.sh exercises every script against synthetic inputs:

==> smoke: render_diff.sh against identical inputs
    identical inputs → mean SSIM=1.0 (pass=True)
==> smoke: render_diff.sh against different inputs
    different inputs → mean SSIM=0.374791 (correctly failed at 0.99)
==> smoke: frame_strip.sh produces a strip
    strip.png written (161863 bytes)
==> smoke: lint_source.py on clean fixture (expect exit 0)
    clean.tsx → 0 blockers, 2 info findings
==> smoke: lint_source.py on blocker fixture (expect exit 1)
    blocker.tsx → 5 blockers detected (correctly refused)

✅ smoke tests passed

Two test fixtures:

  • clean.tsx — uses every supported Remotion API (useCurrentFrame, interpolate, spring, Sequence, AbsoluteFill, staticFile, Audio, Img). Lint returns 0 blockers, 2 info-level staticFile notices.
  • blocker.tsx — uses every blocker (useState, useEffect with deps, delayRender, @mui/material import, async calculateMetadata). Lint returns 5 blockers and recommends the runtime interop pattern.

Smoke test passes locally on a clean checkout. ffmpeg version: 4.4.x (system).

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a blocker in the eval harness: the linter misses common Remotion syntax that the skill relies on for refusal decisions. Since this script is the front door for the whole translation workflow, these false negatives need coverage before this layer should merge.

# Multi-line bodies are common, so we use re.DOTALL and look for the
# closing `} , [ <something> ]` signature instead of trying to parse
# the body. The body itself can contain commas and nested closures.
re.compile(r"useEffect\s*\([\s\S]*?\}\s*,\s*\[[^\]]+\]", re.DOTALL),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only catches useEffect calls whose first argument body contains a closing } before the dependency array. Common concise-arrow effects like useEffect(() => fetch("/x"), [frame]) return 0 blockers with the current linter, even though this rule is supposed to be the hard gate for side-effectful Remotion sources. I reproduced that locally against this script. Please add coverage for expression-bodied effects, and probably useLayoutEffect if the intent is to catch side-effect hooks generally.

findings.append(Finding(str(path), line, col, severity, rule, message, rec))

# Custom hook detection: any `function useXxx(` or `const useXxx = ` defined in this file.
for m in re.finditer(r"^\s*(?:function|const|let)\s+(use[A-Z]\w+)\b", src, re.MULTILINE):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This custom-hook detector misses the most common exported hook form: export const useFadeIn = (...) => { ... }. I tested that shape locally and it returns 0 warnings. Since T4 treats custom hooks as a required warning class, this should handle optional export and arrow-function declarations before relying on it in the skill workflow.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 00de7e4 to 2a309f2 Compare April 27, 2026 23:03
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented Apr 27, 2026

@miguel-heygen — addressed in the amended commit 2a309f24:

Fix 1: r2hf/use-effect-deps regex no longer anchors on }. New pattern matches expression-bodied effects (useEffect(() => fetch(...), [deps])) by anchoring on the , [<non-empty>] deps signature instead. Also now catches useLayoutEffect (same hook semantics, was missing).

re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL)

Fix 2: custom-hook detector handles export const useFoo = .... Optional export (with optional default), and var added alongside function/const/let:

r"^\s*(?:export\s+(?:default\s+)?)?(?:function|const|let|var)\s+(use[A-Z]\w+)\b"

Smoke regression coverage in scripts/tests/fixtures/blocker.tsx: added an expression-bodied useEffect, a useLayoutEffect, and an export const useFadeMixed = ... custom hook. Smoke now reports 8 blockers (was 5), so future regex regressions on these specific cases will fail the smoke test.

Bonus fix landing in this same amend: demoted r2hf/lambda-import from BLOCKER to WARNING — the lambda-only policy contradiction you flagged in #515/#516/#517 reduces to "no special case needed" once severity matches policy. Lambda config drops via the existing translate-after-dropping-warnings flow. See per-PR replies for the downstream changes.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous missing detections are fixed, but the new useEffect regex is now too broad and can false-positive clean mount-only effects by matching a later unrelated array argument.

# Anchor on the `, [<non-empty>]` deps signature with a lazy match for
# the function arg. Leading `[\s\S]*?` is greedy-bounded by the deps
# match itself, so we don't over-match across multiple useEffect calls.
re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex can span past the end of a clean useEffect(..., []) call and grab a later , [x]) from unrelated code, turning a mount-only effect into a blocker. I reproduced it with useEffect(() => { console.log('mounted'); }, []); const values = pick('x', [frame]);: the linter exits 1 and reports r2hf/use-effect-deps at the mount-only effect. The earlier missing expression-bodied effects are fixed, but this needs a bounded hook-call parser or at least coverage for empty-deps effects followed by later array arguments before the lint contract is reliable.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 2a309f2 to 90845dd Compare April 27, 2026 23:31
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

@miguel-heygen — addressed in 90845dd2:

The over-match was a fundamental regex limitation (no balanced-paren matching), so I replaced the regex with a proper paren-walk:

for m in re.finditer(r"\buse(?:Layout)?Effect\s*\(", src):
    end = _find_matching_paren(src, m.end() - 1)
    if end is None: continue
    call = src[m.start() : end + 1]
    m2 = re.search(r",\s*\[([^\]]*)\]\s*$", call[:-1])
    if m2 and m2.group(1).strip():
        # ... emit blocker

_find_matching_paren() is a small helper that tracks paren depth while skipping '…', "…", and `…` string literals. The check m2.group(1).strip() explicitly excludes empty [] deps.

Regression coverage added to clean.tsx — Miguel's exact repro is now part of the smoke fixture:

useEffect(() => { console.log("mounted"); }, []);
const _picked = pick("x", [frame]);

Verification — Miguel's repro returns 0 blockers (was 1 false-positive). A real useEffect(() => fetch(), [frame]) mixed in still correctly fires:

$ python3 lint_source.py /tmp/miguel-repro.tsx --json | jq -c '{blockers, warnings, infos}'
{"blockers":0,"warnings":0,"infos":0}

$ python3 lint_source.py /tmp/miguel-mixed.tsx --json | jq -c '.findings[].rule'
"r2hf/use-effect-deps"

Smoke still 8 blockers on blocker.tsx / 0 blockers on clean.tsx.

Downstream PRs cleanly rebased.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked the latest head against my requested-change thread. The exact false-positive repro now returns 0 blockers, the expression-bodied useEffect(..., [deps]) and exported custom-hook cases still detect correctly, and the smoke suite passes. The added fixture coverage addresses the regression, so this is good from my side.

Adds the deterministic eval primitives the skill calls into:

  scripts/render_diff.sh    SSIM diff between two MP4s, JSON summary, configurable threshold
  scripts/frame_strip.sh    side-by-side comparison strip for visual debugging
  scripts/lint_source.py    pre-translation lint over Remotion source — blocks/warnings/infos

The harness is decoupled from the render pipeline: it accepts paths to
already-rendered MP4s. The skill orchestrator (PR 7) drives both renders
and feeds the outputs in. This keeps the harness usable in CI, in
sandboxes, and on any machine that has ffmpeg without needing the full
Remotion + HyperFrames toolchain.

Lint catches the patterns from the skill's out-of-scope list:
- useState / useReducer (state-machine driven animation)
- useEffect with deps (side effects)
- async calculateMetadata (Promise-returning composition metadata)
- @remotion/lambda imports
- third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI)
- delayRender / useCallback / useMemo (warnings)
- staticFile / interpolateColors (info — translatable but flagged)

Smoke test (scripts/tests/smoke.sh) exercises all three scripts against
synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99,
different ffmpeg testsrc videos fail at 0.99, frame_strip produces a
strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers
on a fixture that uses useState + useEffect + MUI + async metadata.

Validated locally: smoke.sh exits 0.
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 90845dd to 70e0b8b Compare April 27, 2026 23:54
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Ran /simplify on the stack. Two structural cleanups landed in #507's amend (70e0b8bf):

1. RULES is now a uniform registry of Rule dataclasses with (rule_id, severity, matcher, message, recommendation).

The earlier shape was a list[tuple[Pattern, …]] for half the rules and three imperative blocks below the loop for the other half (use-effect-deps with paren-walk, custom-hook with name templating, third-party-react-ui with package matching). The docstring listed all rules as peers; the data structure didn't agree. Now every rule fits one shape:

@dataclass
class Rule:
    rule_id: str
    severity: str
    matcher: Callable[[str], Iterable[tuple[int, str | None]]]
    message: str
    recommendation: str

The matcher yields (offset, override_message) tuples — None for fixed messages, a templated string for rules that need to embed the matched text (custom hook name, package name). lint_file() is now ~25 lines instead of ~80.

2. frame_strip.sh collapsed to a single ffmpeg invocation.

Previously: 3 ffmpeg processes per timestamp (extract baseline frame, extract translated frame, hstack the pair) + 1 final vstack — at 8 samples that's 25 ffmpeg startups (~150-300ms each on Linux).

Now: one ffmpeg with two inputs, a select='eq(n,F1)+eq(n,F2)+…' filter to pick the sampled frames, then hstack per pair and vstack of the rows in the same filter graph. Benchmark on synthetic input with 8 samples: 0.27s end-to-end (was ~1-2s for the same workload).

Smoke still green: 8 blockers on blocker.tsx, 0 blockers + 2 infos on clean.tsx, strip.png renders. Miguel's repro case unchanged.

@jrusso1020 jrusso1020 marked this pull request as ready for review April 28, 2026 00:29
@jrusso1020 jrusso1020 deleted the branch main April 28, 2026 05:12
@jrusso1020 jrusso1020 closed this Apr 28, 2026
@jrusso1020 jrusso1020 reopened this Apr 28, 2026
@jrusso1020 jrusso1020 changed the base branch from skill/r2hf-scaffold to main April 28, 2026 05:13
@jrusso1020 jrusso1020 merged commit b27385b into main Apr 28, 2026
26 checks passed
@jrusso1020 jrusso1020 deleted the skill/r2hf-eval-harness branch April 28, 2026 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants