feat(skills): remotion-to-hyperframes eval harness (2/7)#507
Conversation
51c058f to
00de7e4
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
I found a blocker in the eval harness: the linter misses common Remotion syntax that the skill relies on for refusal decisions. Since this script is the front door for the whole translation workflow, these false negatives need coverage before this layer should merge.
| # Multi-line bodies are common, so we use re.DOTALL and look for the | ||
| # closing `} , [ <something> ]` signature instead of trying to parse | ||
| # the body. The body itself can contain commas and nested closures. | ||
| re.compile(r"useEffect\s*\([\s\S]*?\}\s*,\s*\[[^\]]+\]", re.DOTALL), |
There was a problem hiding this comment.
This only catches useEffect calls whose first argument body contains a closing } before the dependency array. Common concise-arrow effects like useEffect(() => fetch("/x"), [frame]) return 0 blockers with the current linter, even though this rule is supposed to be the hard gate for side-effectful Remotion sources. I reproduced that locally against this script. Please add coverage for expression-bodied effects, and probably useLayoutEffect if the intent is to catch side-effect hooks generally.
| findings.append(Finding(str(path), line, col, severity, rule, message, rec)) | ||
|
|
||
| # Custom hook detection: any `function useXxx(` or `const useXxx = ` defined in this file. | ||
| for m in re.finditer(r"^\s*(?:function|const|let)\s+(use[A-Z]\w+)\b", src, re.MULTILINE): |
There was a problem hiding this comment.
This custom-hook detector misses the most common exported hook form: export const useFadeIn = (...) => { ... }. I tested that shape locally and it returns 0 warnings. Since T4 treats custom hooks as a required warning class, this should handle optional export and arrow-function declarations before relying on it in the skill workflow.
00de7e4 to
2a309f2
Compare
|
@miguel-heygen — addressed in the amended commit Fix 1: re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL)Fix 2: custom-hook detector handles r"^\s*(?:export\s+(?:default\s+)?)?(?:function|const|let|var)\s+(use[A-Z]\w+)\b"Smoke regression coverage in Bonus fix landing in this same amend: demoted |
miguel-heygen
left a comment
There was a problem hiding this comment.
The previous missing detections are fixed, but the new useEffect regex is now too broad and can false-positive clean mount-only effects by matching a later unrelated array argument.
| # Anchor on the `, [<non-empty>]` deps signature with a lazy match for | ||
| # the function arg. Leading `[\s\S]*?` is greedy-bounded by the deps | ||
| # match itself, so we don't over-match across multiple useEffect calls. | ||
| re.compile(r"\buse(?:Layout)?Effect\s*\([\s\S]*?,\s*\[[^\]]+\]\s*\)", re.DOTALL), |
There was a problem hiding this comment.
This regex can span past the end of a clean useEffect(..., []) call and grab a later , [x]) from unrelated code, turning a mount-only effect into a blocker. I reproduced it with useEffect(() => { console.log('mounted'); }, []); const values = pick('x', [frame]);: the linter exits 1 and reports r2hf/use-effect-deps at the mount-only effect. The earlier missing expression-bodied effects are fixed, but this needs a bounded hook-call parser or at least coverage for empty-deps effects followed by later array arguments before the lint contract is reliable.
2a309f2 to
90845dd
Compare
|
@miguel-heygen — addressed in The over-match was a fundamental regex limitation (no balanced-paren matching), so I replaced the regex with a proper paren-walk: for m in re.finditer(r"\buse(?:Layout)?Effect\s*\(", src):
end = _find_matching_paren(src, m.end() - 1)
if end is None: continue
call = src[m.start() : end + 1]
m2 = re.search(r",\s*\[([^\]]*)\]\s*$", call[:-1])
if m2 and m2.group(1).strip():
# ... emit blocker
Regression coverage added to useEffect(() => { console.log("mounted"); }, []);
const _picked = pick("x", [frame]);Verification — Miguel's repro returns 0 blockers (was 1 false-positive). A real Smoke still 8 blockers on Downstream PRs cleanly rebased. |
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-checked the latest head against my requested-change thread. The exact false-positive repro now returns 0 blockers, the expression-bodied useEffect(..., [deps]) and exported custom-hook cases still detect correctly, and the smoke suite passes. The added fixture coverage addresses the regression, so this is good from my side.
Adds the deterministic eval primitives the skill calls into: scripts/render_diff.sh SSIM diff between two MP4s, JSON summary, configurable threshold scripts/frame_strip.sh side-by-side comparison strip for visual debugging scripts/lint_source.py pre-translation lint over Remotion source — blocks/warnings/infos The harness is decoupled from the render pipeline: it accepts paths to already-rendered MP4s. The skill orchestrator (PR 7) drives both renders and feeds the outputs in. This keeps the harness usable in CI, in sandboxes, and on any machine that has ffmpeg without needing the full Remotion + HyperFrames toolchain. Lint catches the patterns from the skill's out-of-scope list: - useState / useReducer (state-machine driven animation) - useEffect with deps (side effects) - async calculateMetadata (Promise-returning composition metadata) - @remotion/lambda imports - third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI) - delayRender / useCallback / useMemo (warnings) - staticFile / interpolateColors (info — translatable but flagged) Smoke test (scripts/tests/smoke.sh) exercises all three scripts against synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99, different ffmpeg testsrc videos fail at 0.99, frame_strip produces a strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers on a fixture that uses useState + useEffect + MUI + async metadata. Validated locally: smoke.sh exits 0.
90845dd to
70e0b8b
Compare
|
Ran /simplify on the stack. Two structural cleanups landed in #507's amend ( 1. The earlier shape was a @dataclass
class Rule:
rule_id: str
severity: str
matcher: Callable[[str], Iterable[tuple[int, str | None]]]
message: str
recommendation: strThe matcher yields 2. Previously: 3 ffmpeg processes per timestamp (extract baseline frame, extract translated frame, hstack the pair) + 1 final vstack — at 8 samples that's 25 ffmpeg startups (~150-300ms each on Linux). Now: one ffmpeg with two inputs, a Smoke still green: 8 blockers on |
What
Eval harness for the
remotion-to-hyperframesskill — three scripts that the skill calls into:scripts/render_diff.sh—<baseline.mp4> <translated.mp4>→ JSON summary{ mean, min, p05, p95, frame_count, pass, threshold }of per-frame SSIM. Configurable threshold viaR2HF_SSIM_THRESHOLD.scripts/frame_strip.sh—<baseline.mp4> <translated.mp4>→ side-by-side comparison strip (strip.png) at N evenly-spaced timestamps. For visual debugging when SSIM fails.scripts/lint_source.py—<remotion-src-dir>→ blockers / warnings / infos in human or JSON format. Blocks on the out-of-scope patterns the skill refuses to translate (useState, useEffect with deps, async calculateMetadata,@remotion/lambdaimports, third-party React UI libraries).Why
This is #2 in a 7-PR stack (#506 is #1).
Building the eval before writing the skill prompt — the rule from skill-creator's iteration loop. Without a deterministic measure of "did this translation work", the skill is tuned on vibes. The harness lands first so PR3+ can use it to validate every fixture as it's added.
The harness deliberately accepts paths to already-rendered MP4s instead of running the renderers itself. Reasons:
ffmpegandpython3.How
render_diff.shuses ffmpeg'sssimfilter with astats_fileto capture per-frame data, then a small inline Python parsesAll:Nfrom each line into mean/min/p05/p95.frame_strip.shsamples timestamps from 5%–95% of duration to skip fade-in/fade-out artifacts that always SSIM-bias one direction.lint_source.pyis a single-file regex linter — TypeScript AST would be more correct, but ~200 lines of regex catches every blocker in the design and avoids pulling intree-sitter-typescriptas a runtime dep. If a real-world Remotion project misses a blocker, we add the regex (PR 6 will do this against the actual T1–T4 corpus).Stack
#506 (1/7) ← scaffold
this PR (2/7) ← eval harness
3/7 ← T1+T2 corpus (uses this harness)
4/7 ← T3 corpus (data-driven)
5/7 ← T4 corpus (escape-hatch fixtures)
6/7 ← references/*.md
7/7 ← SKILL.md body
Test plan
scripts/tests/smoke.shscripts/tests/smoke.shexercises every script against synthetic inputs:Two test fixtures:
clean.tsx— uses every supported Remotion API (useCurrentFrame,interpolate,spring,Sequence,AbsoluteFill,staticFile,Audio,Img). Lint returns 0 blockers, 2 info-levelstaticFilenotices.blocker.tsx— uses every blocker (useState,useEffectwith deps,delayRender,@mui/materialimport, asynccalculateMetadata). Lint returns 5 blockers and recommends the runtime interop pattern.Smoke test passes locally on a clean checkout. ffmpeg version: 4.4.x (system).