fix(cli): use size-adaptive timeouts for publish uploads by miguel-heygen · Pull Request #635 · heygen-com/hyperframes

miguel-heygen · 2026-05-05T21:30:02Z

Summary

Replace the flat 30s upload timeout with a size-adaptive calculation: max(120s, bytes / 500KB/s)
Metadata requests (presigned URL, complete) keep the original 30s timeout
Companion to the backend change removing the 64 MB upload limit in experiment-framework

Context

With the backend size limit removed, large projects (78 MB+) need proportionally longer to upload. A 78 MB project now gets ~164s, a 500 MB project ~17 min. The old 30s timeout would abort any upload over ~15 MB on a typical connection.

Test plan

All 4 existing vitest tests pass
Build succeeds, no type errors
Lint + format pass (oxlint + oxfmt)
Timeout values verified for 10/78/200/500/1000 MB archives

The flat 30s timeout caused large project uploads to abort before the S3 PUT could complete. Compute timeout from archive size at ~500 KB/s with a 120s floor, so a 78 MB project gets ~164s and a 500 MB project gets ~17 min. Metadata requests keep the original 30s timeout.

jrusso1020

LGTM — sensible adaptive timeout

Clean separation of metadata-call timeout (30s, fixed) from upload-call timeout (size-derived, min 2 min). The uploadTimeoutMs formula is conservative-friendly: max(120s, byteLength / 500_000 * 1000), which assumes a 500 KB/s minimum bandwidth (very conservative — most consumer connections support multi-MB/s) and bottoms out at 2 minutes for tiny uploads.

Math sanity-check matches the PR description's table:

64 MB → max(120s, 128s) = 128s (vs old fixed 30s — would have timed out)
500 MB → max(120s, 1000s) = ~17 min
1 GB → max(120s, 2000s) = ~33 min
5 GB → max(120s, 10000s) = ~2.8 hours

Generous tail for slow connections — right default for a CLI publish where the user is presumably watching and can Ctrl-C if they want.

Symmetry with ef#36271

This PR's byteLength is whatever archive.buffer.byteLength resolves to on the user's machine — no upper bound. ef#36271 also removes the 64 MB cap on the server side. The two PRs are symmetric in being unbounded, which is internally consistent. My review on ef#36271 raises some adjacent concerns (zip-bomb guard removal, max_bytes=0 semantics) that are the more architecturally sensitive half of this pair — this CLI side is straightforward.

Non-blocking observations

A. No unit test for uploadTimeoutMs. Three values worth pinning:

expect(uploadTimeoutMs(0)).toBe(120_000);                  // floor at min
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000);
expect(uploadTimeoutMs(60 * 1024 * 1024)).toBe(120_000);   // small enough to hit floor

Cheap, locks the formula's floor + slope. Without it, a future "let me adjust the timeout" refactor could silently regress to the old 30s without test signal.

B. The 500 KB/s assumption is documented in code only as the constant name (PUBLISH_UPLOAD_BYTES_PER_SECOND). A one-line comment explaining "conservative floor for slow / unstable connections" would help — most consumer connections are 5-50× faster, so a future reader might wonder if this is too generous.

C. AbortSignal upper bound. For genuinely huge uploads (e.g., 50 GB), the timeout becomes ~28 hours. JS AbortSignal.timeout should accept that, but at some point the practical / UX answer is "split into multipart" rather than "wait longer." Out of scope for this PR but worth a future thought: should the CLI auto-switch to multipart above some threshold?

D. No matching server-side timeout adjustment in ef#36271. If api-service / S3 has a 30s connection timeout for the staged-upload PUT, the CLI's longer timeout doesn't help — the server hangs up first. Probably fine because the S3 PUT is direct-to-S3 (not through api-service), but worth confirming.

Approving — the formula is correct, the conservative bandwidth floor is the right default for a CLI tool, and the symmetric unbounded shape matches ef#36271's intent.

— Review by Rames Jusso (pr-review)

vanceingalls

Verdict: Comment — small, correct in spirit, but the math has a bug at the high end.

Paired with experiment-framework#36271 (server removes 64MB cap). This PR is useless without the server change but harmless if it lands first (old server still rejects >64MB at presign before any of this timeout code matters). New CLI + old server: clear server error. Old CLI + new server: 30s client-side abort on anything >~15MB — that's the bug this PR fixes. No coordinated merge required, but ship them close together.

Blockers

blocker — packages/cli/src/utils/publishProject.ts:147-152 and the server's 30-min presigned URL expiry interact badly. The server (experiment-framework/.../project_logic.py:_PUBLISH_UPLOAD_EXPIRES_SECONDS = 30 * 60 = 1800s) signs the upload URL with a 30-minute TTL. Your timeout formula is bytes / 500_000 seconds. At 500 KB/s (your assumed rate) the URL expires at exactly 900 MB. Anything between ~900 MB and the new "5 GB" S3 single-PUT ceiling will hit a presigned-URL-expired 403 from S3 before your client timeout fires, and the user sees a cryptic S3 error (or your readErrorMessage 180-char truncation of one). Either:
1. Reduce client PUBLISH_UPLOAD_BYTES_PER_SECOND to a value that keeps client timeout < server URL TTL (e.g. ~520 KB/s gives 1500s for ~780MB — still leaves headroom), and document this coupling, OR
2. Bump server-side _PUBLISH_UPLOAD_EXPIRES_SECONDS so the URL outlives the worst-case client timeout, OR
3. Detect 403 from S3 and re-issue a fresh presigned URL via /upload and retry once.
  Whichever you pick, this needs a decision before merge — the current setup creates a silent dead zone for ~900 MB to 5 GB uploads.

Important

important — packages/cli/src/utils/publishProject.ts:147-152 — 500 KB/s assumed throughput is pessimistic for typical broadband but optimistic for hotel/coffee-shop wifi. A 78 MB project on a 100 Mbps link uploads in ~10s; you give it 164s. A 500 MB project on a saturated 5 Mbps link needs 800s; you give it 1000s — fine. But a single hard-coded slope across two orders of magnitude of bandwidth is a blunt instrument. Two cleaner alternatives: (a) infinite/no client timeout once the PUT is in-flight (rely on TCP keepalive + cancellation only on user abort) — this is what aws s3 cp does; (b) progress-based timeout — reset the deadline on each chunk written, so a slow-but-progressing upload doesn't get killed. Option (a) is far simpler and matches user expectations.
important — no test coverage. Zero tests for uploadTimeoutMs. Trivial cases worth covering: uploadTimeoutMs(0) === 120_000, monotonicity (uploadTimeoutMs(N+1) >= uploadTimeoutMs(N)), large-input behavior (uploadTimeoutMs(5 * 1024 * 1024 * 1024) doesn't overflow). Even a 5-line vitest is better than nothing. The PR description says "All 4 existing vitest tests pass" — fine, but none of them touch this function.
important — direct-path uses adaptive timeout but the server's direct path (/v1/hyperframes/projects/publish) is the multipart-upload-into-Flask path that fully buffers the request body in memory. With a 1 GB POST body, the AWS API gateway / nginx in front of the Python server is much more likely to have a body-size limit than the Python code itself. You're now telling the client to wait 33 minutes for an upload that the proxy will 413/504 in seconds. Verify the gateway limits before claiming this fix gets you to S3's 5 GB ceiling on the direct path. (The staged path is not affected by gateway body limits — it goes browser → S3 directly.)

Nits

nit — magic numbers should be named or commented. 500_000 is "500 KB/s assumed minimum throughput"; 120_000 is "2 minutes minimum even for tiny uploads". A one-line comment over each constant explains intent.
nit — Math.ceil((byteLength / 500_000) * 1000) is the same as Math.ceil(byteLength / 500) for integer byteLength. Minor readability.
nit — PUBLISH_METADATA_TIMEOUT_MS could keep the original name PUBLISH_REQUEST_TIMEOUT_MS since "request" was the original meaning — splitting into upload vs metadata is fine, just call out in the constant name what each is for ("PRESIGN/COMPLETE call timeout" vs "S3 PUT body timeout").
nit — observability. A single console.error (or whatever the CLI's logger is) on timeout that includes the archive size and elapsed time would help debug the inevitable "publish hangs on huge project" reports.

Praise

Splitting metadata timeout from upload timeout is the right factoring — those two operations have completely different latency profiles.
120s floor protects small uploads from getting unfairly killed if the formula goes weird.
Diff is minimal and contained to one file.

— Vai

- Parse expires_in_seconds from the upload response and cap the S3 PUT timeout at TTL minus 30s safety margin — prevents the silent dead zone where the presigned URL expires before the client gives up - Add unit tests for uploadTimeoutMs: floor at 120s, scales above 64 MB, always returns an integer - Document the 500 KB/s bandwidth assumption

miguel-heygen · 2026-05-05T21:51:25Z

Addressed all review feedback in 55ea39e:

Blocker — presigned URL TTL dead zone: Client now parses expires_in_seconds from the upload response and caps the S3 PUT timeout at TTL − 30s safety margin. Combined with the server-side TTL bump (30 min → 2 hours in ef#36271), the dead zone is eliminated — uploads up to ~3.6 GB at 500 KB/s fit within the 2-hour window.

Unit tests for uploadTimeoutMs: Added 3 tests — floor at 120s for small files, scales above floor for 64 MB+, always returns an integer.

500 KB/s documentation: Added comment explaining it's a conservative floor for slow/unstable connections.

jrusso1020

Re-review — LGTM, blockers addressed

The TTL-mismatch and missing-test concerns are cleanly resolved. Approving.

Verified

1. Presigned URL TTL is now respected. parseStagedUploadResponse reads expires_in_seconds from the staged-upload response (defaulting to 1800 if missing for backwards compat with old servers), and the S3 PUT timeout uses:

Math.min(uploadTimeoutMs(archive.buffer.byteLength), presignedUrlTtlMs)

…where presignedUrlTtlMs = stagedUpload.expiresInSeconds * 1000 - PUBLISH_METADATA_TIMEOUT_MS. The 30s subtraction is a generous buffer for metadata-call latency between "server signed the URL" and "client starts the PUT" — defends against signing-time clock skew and slow request paths. With ef#36271 bumping server TTL to 7200s, the client cap kicks in at ~3.6 GB (well above the 78 MB launch project that motivated the change).

Vai's silent-dead-zone is closed — once the URL is past TTL, the client correctly aborts rather than continuing into a guaranteed S3 403.

2. uploadTimeoutMs is now exported and tested. Three new tests pin the contract:

expect(uploadTimeoutMs(0)).toBe(120_000);                // floor
expect(uploadTimeoutMs(50 * 1024 * 1024)).toBe(120_000); // small still floors
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000); // scales
expect(uploadTimeoutMs(500 * 1024 * 1024)).toBeGreaterThan(900_000);
expect(Number.isInteger(uploadTimeoutMs(123_456))).toBe(true);

Locks both the floor + the scaling, plus the integer return type (against an accidental Math.ceil removal). Exactly the regression guard I'd asked for.

3. Bandwidth-floor comment landed:

// Conservative floor — most connections are faster, but this prevents
// premature aborts on slow/unstable networks (hotel wifi, tethering).
const PUBLISH_UPLOAD_BYTES_PER_SECOND = 500_000;

A future reader sees the rationale immediately. ✓

Non-blocking observations

A. Default expiresInSeconds = 1800 for old-server backwards compat. Sensible — but worth noting that this mirrors the old server TTL. Once ef#36271's 7200s TTL is fully rolled out, the fallback could be tightened to 7200 too (or just removed entirely). Cheap follow-up.

B. AbortSignal upper bound. Same observation as before — for genuinely huge uploads (50 GB+), the timeout becomes hours. With the TTL cap now in place, this is naturally bounded by presignedUrlTtlMs (max 7170s ≈ 2 hours). So the multipart-upgrade question is now "do we ever expect to need >2 hours per chunk?" rather than "do we trust the user not to wait forever?" Better.

C. Direct-path (publishProjectArchiveDirect) doesn't get the TTL cap because it's a single POST to api-service — no presigned URL involved. The uploadTimeoutMs(archive.buffer.byteLength) cap there is the only ceiling, and it can grow unboundedly with archive size. If api-service / nginx / ALB has its own request-body timeout (typically 60-120s default), the direct path will hit the gateway 504 before the client gives up. Worth a quick verification — but probably acceptable since direct is the legacy fallback path.

Bottom line

All three of Vai's concerns + my non-blocking notes addressed. Stamping. Land alongside ef#36271 — together they close the upload-size gap end-to-end.

— Review by Rames Jusso (pr-review)

vanceingalls

Verdict: Approve — blockers addressed, fix is clean.

Re-reviewed at 55ea39e against my prior review at c3a0b4e. Cross-ref: experiment-framework#36271 (server) — still at Comment there for unrelated server-side issues (direct-path RAM, unauthenticated routes), but the TTL coordination required for this PR is in place.

Resolved

Blocker #1 — presigned URL TTL dead zone. ✅ Two-sided fix:
- Server (ef#36271) bumped TTL 30min → 2h.
- Client (publishProject.ts:79–80, 274–280) now parses expires_in_seconds from the staged-upload response and clamps the abort timeout to min(uploadTimeoutMs(bytes), expires_in_seconds * 1000 − 30s). With the 2h TTL, the only uploads where the cap bites are >5GB at 500KB/s, which already exceeds S3's single-PUT ceiling. Dead zone closed.
Important #2 — unit tests for uploadTimeoutMs. ✅ Three asserts in publishProject.test.ts:39–53: floor at 120s for 0/50MB, scales above floor for 64MB+ and 500MB+, returns integer. Locks the formula's behavior at the three points I asked for.
Important — 500 KB/s assumption documentation. ✅ Inline comment at publishProject.ts:10–11 explains "conservative floor for slow/unstable connections (hotel wifi, tethering)". Future maintainers will get the intent.

Still open (acknowledged, non-blocking)

Important #3 — gateway/nginx GB-body verification on direct path. Not addressed in this PR. The staged path goes browser → S3 directly so it's unaffected by api-service proxy limits. Direct path (/v1/hyperframes/projects/publish) is the residual concern, but that's a server-side question — calling it out on ef#36271 instead.

New findings since `c3a0b4e`

nit — fallback default expires_in_seconds = 1800 at publishProject.ts:82 is now stale. With the server bumped to 7200s, this fallback only fires when (a) the field is missing, or (b) it's not a positive number. New CLI + new server: 7200 always. New CLI + old server: still 1800, still safe. No bug, but a comment like // matches old server default; new server returns 7200 would help debugging.
nit — defensive guard on presignedUrlTtlMs. expires_in_seconds * 1000 − 30_000 could go negative if a future server returns TTL < 30s; AbortSignal.timeout(<0) aborts synchronously. Server-side floor is 1800s today so this is theoretical, but a Math.max(60_000, …) floor would be free insurance.
praise — del zip_bytes equivalent isn't needed here, but the discipline of capping the abort signal at the server's TTL (not just client-side bandwidth) is the right separation of concerns. Good fix shape.

Coordination with ef#36271

Server TTL bump (30min → 2h) is committed at the PR head — this CLI's TTL parsing has something to clamp against.
Server response field expires_in_seconds is already in the DTO (HyperframesProjectUploadResponse.expires_in_seconds: int) — no breaking change to land this independently.
Old CLI + new server: harmless (old CLI uses fixed 30s, still hits TTL well within 2h on small uploads — no regression).
New CLI + old server: harmless (CLI's 1800s fallback matches old server's actual TTL — no dead zone).

Recommendation

Ship it. The core formula is right, the TTL coupling closes the dead-zone blocker, and the tests are sufficient to catch silent regressions on the floor and slope. Approving.

— Vai

jrusso1020 approved these changes May 5, 2026

View reviewed changes

vanceingalls reviewed May 5, 2026

View reviewed changes

jrusso1020 approved these changes May 5, 2026

View reviewed changes

miguel-heygen merged commit 21ec5f8 into main May 5, 2026
28 checks passed

miguel-heygen deleted the fix/publish-upload-timeout branch May 5, 2026 22:00

vanceingalls approved these changes May 5, 2026

View reviewed changes

miguel-heygen restored the fix/publish-upload-timeout branch May 5, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): use size-adaptive timeouts for publish uploads#635

fix(cli): use size-adaptive timeouts for publish uploads#635
miguel-heygen merged 2 commits intomainfrom
fix/publish-upload-timeout

miguel-heygen commented May 5, 2026

Uh oh!

jrusso1020 left a comment

Uh oh!

vanceingalls left a comment

Uh oh!

miguel-heygen commented May 5, 2026

Uh oh!

jrusso1020 left a comment

Uh oh!

Uh oh!

vanceingalls left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miguel-heygen commented May 5, 2026

Summary

Context

Test plan

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

LGTM — sensible adaptive timeout

Symmetry with ef#36271

Non-blocking observations

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Blockers

Important

Nits

Praise

Uh oh!

miguel-heygen commented May 5, 2026

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Re-review — LGTM, blockers addressed

Verified

Non-blocking observations

Bottom line

Uh oh!

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Resolved

Still open (acknowledged, non-blocking)

New findings since c3a0b4e

Coordination with ef#36271

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New findings since `c3a0b4e`