Skip to content

feat(skills): remotion-to-hyperframes corpus T3 (4/7)#509

Merged
jrusso1020 merged 3 commits into
mainfrom
skill/r2hf-corpus-t3
Apr 28, 2026
Merged

feat(skills): remotion-to-hyperframes corpus T3 (4/7)#509
jrusso1020 merged 3 commits into
mainfrom
skill/r2hf-corpus-t3

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented Apr 27, 2026

What

Tier 3 of the test corpus — a purpose-built data-driven fixture (option 2 from the stack discussion, not a port of PR #214's examples/remotion-full/) that exercises the realistic shape of a production Remotion composition.

Stargazed.tsx — 10 s @ 30 fps, 1280×720:

Sequence 0–3 s    TitleScene    (title + subtitle, spring + linear fade)
Sequence 3–7 s    StatsScene    (3 reused StatCards staggered 12 frames apart)
Sequence 7–10 s   OutroScene    (UnderlinedText with scaleX-from-left underline)

Custom React subcomponents:

  • StatCard(label, value, color, delayInFrames) — used 3× with different props
  • AnimatedNumber(from, to, durationInFrames) — frame-driven count-up
  • UnderlinedText(text, color) — text with scaleX underline reveal

End-to-end validation

Validated mean SSIM: 0.953 · threshold 0.90 · margin +0.038 from p05 (0.938).

Frame-strip inspection: the count-up shows minor frame-level digit mismatches mid-animation (Remotion 913, HF 1032 around 3.5 s) but converges to identical final values. Both formulas (Remotion's manual 1 - (1-t)^3 and GSAP's power3.out) are mathematically equivalent — the offset comes from sub-frame seek timing of when GSAP's onUpdate callback fires relative to Remotion's per-frame React render. No SSIM impact above the noise floor.

The wider threshold gap vs T1/T2 (0.04 vs 0.02 below p05) reflects T3's bigger approximation budget: 2 spring instances + count-up timing + font fallback on multiple text sizes (160 px title, 72 px stat number, 80 px outro). Mean SSIM below 0.90 = structural mismatch (wrong durations, wrong stagger, missing prop wiring), not approximation drift.

Why

This is #4 in the 7-PR stack and the largest tier in the corpus. T1 + T2 covered the basic API surface (Sequence, AbsoluteFill, interpolate, spring, Audio, Img, staticFile). T3 adds the shape of real production compositions:

  • <Composition schema={z.object({...})} defaultProps={...} /> — Zod-typed props
  • nested array prop (stats[]) materialized as repeated markup
  • custom React subcomponents reused with different props
  • frame-driven count-up animation (AnimatedNumber)
  • two different spring configs in the same composition
  • useVideoConfig() for fps

If a translation passes T3, the skill correctly handles the patterns 80% of real-world Remotion code uses.

The choice not to port examples/remotion-full/ from PR #214: that fixture's HF half uses the runtime adapter pattern (Remotion's render pipeline running inside the HF page). Including it in the corpus would mix runtime-adapter idioms into the translation evaluation, which is the wrong target — the skill produces pure HTML+GSAP. A purpose-built fixture exercising the same APIs is more honest.

How

Translation choices documented in README.md and expected.json:

Remotion HyperFrames
<Composition schema={z.object({...})} defaultProps={...} /> data-* attrs on root #stage div
nested array prop (stats[]) repeated HTML markup with per-instance data-* attrs
custom React subcomponent inline repeated HTML using the component's prop interface as the template
<AnimatedNumber from={0} to={value} dur={45} /> (cubic ease-out count-up) tween on { v: 0 } object with onUpdate rewriting textContent, ease power3.out
spring({damping:12, stiffness:100}) back.out(1.4) over ~0.7 s
spring({damping:14, stiffness:90}) back.out(1.2) over ~0.7 s
delayInFrames={i * 12} (per-instance) GSAP timeline offset (i * 0.4) s

Same Remotion config as PR #508: setVideoImageFormat("png") + setColorSpace("bt709") to match HF's yuv420p output.

Stack

#506 (1/7) — scaffold
#507 (2/7) — eval harness
#508 (3/7) — T1 + T2 fixtures
this PR (4/7) — T3 data-driven fixture
5/7 — T4 escape-hatch fixtures
6/7 — references/*.md (translation map)
7/7 — SKILL.md body + corpus orchestrator

Test plan

  • lint_source.py over T3 Remotion source: 9 files scanned, 0 blockers / 0 warnings / 0 infos
  • All fixture .tsx, .ts, .json, .md files pass oxfmt --check and oxlint
  • Lefthook pre-commit (lint + format + typecheck) passes
  • End-to-end render + SSIM diff (mean 0.953, ≥ 0.90 threshold)

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T3 has the same threshold-contract drift as T1/T2. The fixture itself is useful, but the README and executable metadata need to agree.

to count from 0 to the target. `AnimatedNumber` itself derives the displayed
value from `useCurrentFrame()` + a manual `1 - (1 - t)^3` ease.

## The lossy parts (and why threshold = 0.85)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the threshold is 0.85 and line 56 says below 0.85 is the structural-mismatch signal, but expected.json sets ssim_threshold to 0.90 and the PR body/final skill also cite 0.90. Please align this README with the threshold the orchestrator actually enforces.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from 08fa028 to abaa743 Compare April 27, 2026 23:04
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t3 branch from 0697f35 to 7e93bf8 Compare April 27, 2026 23:05
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented Apr 27, 2026

@miguel-heygen — addressed in the amended commit 7e93bf8f:

T3 README: The "## The lossy parts (and why threshold = 0.85)" header is now "= 0.90" (matching expected.json). The closing paragraph "A mean SSIM below 0.85 in T3 indicates a structural mismatch" → "below 0.90". Validated mean (0.953) appended so the reader sees the calibrated number alongside the gate.

grep -n "0\.85" over T3 returns no matches now.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest head. The T3 README now uses the same 0.90 threshold as expected.json and the final skill, so my prior blocker is resolved.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from abaa743 to 2649d6d Compare April 27, 2026 23:31
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t3 branch from 7e93bf8 to 2ab99c6 Compare April 27, 2026 23:31
Adds the deterministic eval primitives the skill calls into:

  scripts/render_diff.sh    SSIM diff between two MP4s, JSON summary, configurable threshold
  scripts/frame_strip.sh    side-by-side comparison strip for visual debugging
  scripts/lint_source.py    pre-translation lint over Remotion source — blocks/warnings/infos

The harness is decoupled from the render pipeline: it accepts paths to
already-rendered MP4s. The skill orchestrator (PR 7) drives both renders
and feeds the outputs in. This keeps the harness usable in CI, in
sandboxes, and on any machine that has ffmpeg without needing the full
Remotion + HyperFrames toolchain.

Lint catches the patterns from the skill's out-of-scope list:
- useState / useReducer (state-machine driven animation)
- useEffect with deps (side effects)
- async calculateMetadata (Promise-returning composition metadata)
- @remotion/lambda imports
- third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI)
- delayRender / useCallback / useMemo (warnings)
- staticFile / interpolateColors (info — translatable but flagged)

Smoke test (scripts/tests/smoke.sh) exercises all three scripts against
synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99,
different ffmpeg testsrc videos fail at 0.99, frame_strip produces a
strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers
on a fixture that uses useState + useEffect + MUI + async metadata.

Validated locally: smoke.sh exits 0.
Adds the first two test fixtures the skill is graded against. Each fixture
ships:
  - remotion-src/  full Remotion project (package.json, src/, remotion.config.ts, tsconfig.json)
  - hf-src/        hand-translated HyperFrames composition (index.html)
  - expected.json  tier metadata + SSIM threshold + translation notes + measured validation
  - README.md      human walk-through of the translation choices
  - setup.sh       (T2 only) generates binary assets (PNG, WAV) via ffmpeg

T1 — title-card-fade
- 3 s @ 30 fps, 1280x720
- Single AbsoluteFill, single useCurrentFrame interpolate
  with multi-segment input [0,15,75,90] -> [0,1,1,0]
- Validated mean SSIM 0.974, threshold 0.95
  (~0.025 gap from font-fallback divergence between Remotion's bundled
   Chromium and HF's chrome-headless-shell)

T2 — title-image-outro
- 6 s @ 30 fps, 1280x720, three Sequences (TitleScene, ImageScene, OutroScene)
- Exercises spring, interpolate, Audio, Img, staticFile
- Spring -> GSAP back.out(1.4) translation
- Validated mean SSIM 0.985, threshold 0.95
  (translation came out cleaner than predicted; spring->back.out drift was
   smaller than the ~0.05 budget I'd expected)
- setup.sh generates a 200x200 blue PNG and a 6 s silent WAV via ffmpeg
  so binaries stay out of the repo

Calibration done end-to-end: rendered Remotion baseline + HF translation,
ran scripts/render_diff.sh, set thresholds ~0.02 below measured p05.

Critical Remotion config: setVideoImageFormat("png") + setColorSpace("bt709").
The default JPEG output writes yuvj420p (full-range) which costs ~0.05 SSIM
vs HF's yuv420p (limited-range). Both fixtures' remotion.config.ts encode
this so render_diff.sh measures translation fidelity, not encoder differences.

Both fixtures lint clean (0 blockers via scripts/lint_source.py).
T2 staticFile() references correctly flagged as info-level findings.

The fixtures are not yet wired into CI — that comes with PR 7's orchestrator.
For now, render and eval are documented in each README and run by hand.
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from 2649d6d to 9ff46d7 Compare April 27, 2026 23:54
Adds the data-driven tier — a purpose-built fixture (option 2 from the
stack discussion, not a port of PR #214's examples/remotion-full/) that
exercises the realistic shape of a production Remotion composition
without using the runtime adapter.

Stargazed.tsx (10s @ 30fps, 1280x720):
  Sequence 0-3s    TitleScene   (title + subtitle)
  Sequence 3-7s    StatsScene   (3 reused StatCards staggered 12 frames apart)
  Sequence 7-10s   OutroScene   (UnderlinedText with scaleX-from-left underline)

Composition shape exercises:
  - <Composition schema={z.object({...})} defaultProps={...} />
  - nested array prop (stats[]) materialized as repeated HTML
  - custom React subcomponents (StatCard, AnimatedNumber, UnderlinedText)
    reused with different props
  - per-instance delay via prop (delayInFrames -> GSAP timeline offset)
  - frame-driven count-up (AnimatedNumber, manual cubic ease-out)
  - two different spring configs in the same composition
    (damping:12 -> back.out(1.4), damping:14 -> back.out(1.2))
  - useCurrentFrame, useVideoConfig

Translation choices documented in README.md and expected.json:
  - Zod props -> data-* on root #stage div
  - Custom subcomponents inline as repeated HTML using prop interface
    as the template
  - AnimatedNumber's frame-driven count-up -> GSAP onUpdate tween on a
    { v: 0 } counter object, ease power3.out
  - Two different spring configs -> two different back.out overshoots
    (1.4 vs 1.2 approximates the damping difference)
  - delayInFrames={i * 12} -> GSAP offset (i * 0.4)s

Validated end-to-end: rendered Remotion baseline + HF translation, ran
scripts/render_diff.sh.
  measured mean SSIM 0.953
  measured min  SSIM 0.927
  measured p05  SSIM 0.938
  threshold 0.90 (~0.04 below p05)

The wider gap vs T1/T2 reflects T3's bigger approximation budget
(2 spring instances + count-up timing + font fallback on multiple text
sizes). Mean SSIM below 0.90 = structural mismatch (wrong durations,
wrong stagger, missing prop wiring), not approximation drift.

Same Remotion config as PR 3: setVideoImageFormat("png") +
setColorSpace("bt709") to match HF's yuv420p output.

Lint: 9 files scanned, 0 blockers / 0 warnings / 0 infos.
oxlint, oxfmt, typecheck all pass.

The fixture is not yet wired into CI; render + diff is documented in
README.md and runs by hand via the harness from PR 2. PR 7's orchestrator
will wire all four tiers into a CI eval run.
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t3 branch from 2ab99c6 to efa7164 Compare April 27, 2026 23:54
@jrusso1020 jrusso1020 marked this pull request as ready for review April 28, 2026 00:29
@jrusso1020 jrusso1020 changed the base branch from skill/r2hf-corpus-t1-t2 to main April 28, 2026 05:13
@jrusso1020 jrusso1020 merged commit 1294523 into main Apr 28, 2026
20 checks passed
@jrusso1020 jrusso1020 deleted the skill/r2hf-corpus-t3 branch April 28, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants