Skip to content

fix(cli): use size-adaptive timeouts for publish uploads#635

Merged
miguel-heygen merged 2 commits intomainfrom
fix/publish-upload-timeout
May 5, 2026
Merged

fix(cli): use size-adaptive timeouts for publish uploads#635
miguel-heygen merged 2 commits intomainfrom
fix/publish-upload-timeout

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

Summary

  • Replace the flat 30s upload timeout with a size-adaptive calculation: max(120s, bytes / 500KB/s)
  • Metadata requests (presigned URL, complete) keep the original 30s timeout
  • Companion to the backend change removing the 64 MB upload limit in experiment-framework

Context

With the backend size limit removed, large projects (78 MB+) need proportionally longer to upload. A 78 MB project now gets ~164s, a 500 MB project ~17 min. The old 30s timeout would abort any upload over ~15 MB on a typical connection.

Test plan

  • All 4 existing vitest tests pass
  • Build succeeds, no type errors
  • Lint + format pass (oxlint + oxfmt)
  • Timeout values verified for 10/78/200/500/1000 MB archives

The flat 30s timeout caused large project uploads to abort before the
S3 PUT could complete. Compute timeout from archive size at ~500 KB/s
with a 120s floor, so a 78 MB project gets ~164s and a 500 MB project
gets ~17 min. Metadata requests keep the original 30s timeout.
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — sensible adaptive timeout

Clean separation of metadata-call timeout (30s, fixed) from upload-call timeout (size-derived, min 2 min). The uploadTimeoutMs formula is conservative-friendly: max(120s, byteLength / 500_000 * 1000), which assumes a 500 KB/s minimum bandwidth (very conservative — most consumer connections support multi-MB/s) and bottoms out at 2 minutes for tiny uploads.

Math sanity-check matches the PR description's table:

  • 64 MB → max(120s, 128s) = 128s (vs old fixed 30s — would have timed out)
  • 500 MB → max(120s, 1000s) = ~17 min
  • 1 GB → max(120s, 2000s) = ~33 min
  • 5 GB → max(120s, 10000s) = ~2.8 hours

Generous tail for slow connections — right default for a CLI publish where the user is presumably watching and can Ctrl-C if they want.

Symmetry with ef#36271

This PR's byteLength is whatever archive.buffer.byteLength resolves to on the user's machine — no upper bound. ef#36271 also removes the 64 MB cap on the server side. The two PRs are symmetric in being unbounded, which is internally consistent. My review on ef#36271 raises some adjacent concerns (zip-bomb guard removal, max_bytes=0 semantics) that are the more architecturally sensitive half of this pair — this CLI side is straightforward.

Non-blocking observations

A. No unit test for uploadTimeoutMs. Three values worth pinning:

expect(uploadTimeoutMs(0)).toBe(120_000);                  // floor at min
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000);
expect(uploadTimeoutMs(60 * 1024 * 1024)).toBe(120_000);   // small enough to hit floor

Cheap, locks the formula's floor + slope. Without it, a future "let me adjust the timeout" refactor could silently regress to the old 30s without test signal.

B. The 500 KB/s assumption is documented in code only as the constant name (PUBLISH_UPLOAD_BYTES_PER_SECOND). A one-line comment explaining "conservative floor for slow / unstable connections" would help — most consumer connections are 5-50× faster, so a future reader might wonder if this is too generous.

C. AbortSignal upper bound. For genuinely huge uploads (e.g., 50 GB), the timeout becomes ~28 hours. JS AbortSignal.timeout should accept that, but at some point the practical / UX answer is "split into multipart" rather than "wait longer." Out of scope for this PR but worth a future thought: should the CLI auto-switch to multipart above some threshold?

D. No matching server-side timeout adjustment in ef#36271. If api-service / S3 has a 30s connection timeout for the staged-upload PUT, the CLI's longer timeout doesn't help — the server hangs up first. Probably fine because the S3 PUT is direct-to-S3 (not through api-service), but worth confirming.

Approving — the formula is correct, the conservative bandwidth floor is the right default for a CLI tool, and the symmetric unbounded shape matches ef#36271's intent.

— Review by Rames Jusso (pr-review)

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Comment — small, correct in spirit, but the math has a bug at the high end.

Paired with experiment-framework#36271 (server removes 64MB cap). This PR is useless without the server change but harmless if it lands first (old server still rejects >64MB at presign before any of this timeout code matters). New CLI + old server: clear server error. Old CLI + new server: 30s client-side abort on anything >~15MB — that's the bug this PR fixes. No coordinated merge required, but ship them close together.

Blockers

  • blocker — packages/cli/src/utils/publishProject.ts:147-152 and the server's 30-min presigned URL expiry interact badly. The server (experiment-framework/.../project_logic.py:_PUBLISH_UPLOAD_EXPIRES_SECONDS = 30 * 60 = 1800s) signs the upload URL with a 30-minute TTL. Your timeout formula is bytes / 500_000 seconds. At 500 KB/s (your assumed rate) the URL expires at exactly 900 MB. Anything between ~900 MB and the new "5 GB" S3 single-PUT ceiling will hit a presigned-URL-expired 403 from S3 before your client timeout fires, and the user sees a cryptic S3 error (or your readErrorMessage 180-char truncation of one). Either:
    1. Reduce client PUBLISH_UPLOAD_BYTES_PER_SECOND to a value that keeps client timeout < server URL TTL (e.g. ~520 KB/s gives 1500s for ~780MB — still leaves headroom), and document this coupling, OR
    2. Bump server-side _PUBLISH_UPLOAD_EXPIRES_SECONDS so the URL outlives the worst-case client timeout, OR
    3. Detect 403 from S3 and re-issue a fresh presigned URL via /upload and retry once.
      Whichever you pick, this needs a decision before merge — the current setup creates a silent dead zone for ~900 MB to 5 GB uploads.

Important

  • important — packages/cli/src/utils/publishProject.ts:147-152 — 500 KB/s assumed throughput is pessimistic for typical broadband but optimistic for hotel/coffee-shop wifi. A 78 MB project on a 100 Mbps link uploads in ~10s; you give it 164s. A 500 MB project on a saturated 5 Mbps link needs 800s; you give it 1000s — fine. But a single hard-coded slope across two orders of magnitude of bandwidth is a blunt instrument. Two cleaner alternatives: (a) infinite/no client timeout once the PUT is in-flight (rely on TCP keepalive + cancellation only on user abort) — this is what aws s3 cp does; (b) progress-based timeout — reset the deadline on each chunk written, so a slow-but-progressing upload doesn't get killed. Option (a) is far simpler and matches user expectations.

  • important — no test coverage. Zero tests for uploadTimeoutMs. Trivial cases worth covering: uploadTimeoutMs(0) === 120_000, monotonicity (uploadTimeoutMs(N+1) >= uploadTimeoutMs(N)), large-input behavior (uploadTimeoutMs(5 * 1024 * 1024 * 1024) doesn't overflow). Even a 5-line vitest is better than nothing. The PR description says "All 4 existing vitest tests pass" — fine, but none of them touch this function.

  • important — direct-path uses adaptive timeout but the server's direct path (/v1/hyperframes/projects/publish) is the multipart-upload-into-Flask path that fully buffers the request body in memory. With a 1 GB POST body, the AWS API gateway / nginx in front of the Python server is much more likely to have a body-size limit than the Python code itself. You're now telling the client to wait 33 minutes for an upload that the proxy will 413/504 in seconds. Verify the gateway limits before claiming this fix gets you to S3's 5 GB ceiling on the direct path. (The staged path is not affected by gateway body limits — it goes browser → S3 directly.)

Nits

  • nit — magic numbers should be named or commented. 500_000 is "500 KB/s assumed minimum throughput"; 120_000 is "2 minutes minimum even for tiny uploads". A one-line comment over each constant explains intent.
  • nit — Math.ceil((byteLength / 500_000) * 1000) is the same as Math.ceil(byteLength / 500) for integer byteLength. Minor readability.
  • nit — PUBLISH_METADATA_TIMEOUT_MS could keep the original name PUBLISH_REQUEST_TIMEOUT_MS since "request" was the original meaning — splitting into upload vs metadata is fine, just call out in the constant name what each is for ("PRESIGN/COMPLETE call timeout" vs "S3 PUT body timeout").
  • nit — observability. A single console.error (or whatever the CLI's logger is) on timeout that includes the archive size and elapsed time would help debug the inevitable "publish hangs on huge project" reports.

Praise

  • Splitting metadata timeout from upload timeout is the right factoring — those two operations have completely different latency profiles.
  • 120s floor protects small uploads from getting unfairly killed if the formula goes weird.
  • Diff is minimal and contained to one file.

— Vai

- Parse expires_in_seconds from the upload response and cap the S3 PUT
  timeout at TTL minus 30s safety margin — prevents the silent dead
  zone where the presigned URL expires before the client gives up
- Add unit tests for uploadTimeoutMs: floor at 120s, scales above 64 MB,
  always returns an integer
- Document the 500 KB/s bandwidth assumption
@miguel-heygen
Copy link
Copy Markdown
Collaborator Author

Addressed all review feedback in 55ea39e:

Blocker — presigned URL TTL dead zone: Client now parses expires_in_seconds from the upload response and caps the S3 PUT timeout at TTL − 30s safety margin. Combined with the server-side TTL bump (30 min → 2 hours in ef#36271), the dead zone is eliminated — uploads up to ~3.6 GB at 500 KB/s fit within the 2-hour window.

Unit tests for uploadTimeoutMs: Added 3 tests — floor at 120s for small files, scales above floor for 64 MB+, always returns an integer.

500 KB/s documentation: Added comment explaining it's a conservative floor for slow/unstable connections.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — LGTM, blockers addressed

The TTL-mismatch and missing-test concerns are cleanly resolved. Approving.

Verified

1. Presigned URL TTL is now respected. parseStagedUploadResponse reads expires_in_seconds from the staged-upload response (defaulting to 1800 if missing for backwards compat with old servers), and the S3 PUT timeout uses:

Math.min(uploadTimeoutMs(archive.buffer.byteLength), presignedUrlTtlMs)

…where presignedUrlTtlMs = stagedUpload.expiresInSeconds * 1000 - PUBLISH_METADATA_TIMEOUT_MS. The 30s subtraction is a generous buffer for metadata-call latency between "server signed the URL" and "client starts the PUT" — defends against signing-time clock skew and slow request paths. With ef#36271 bumping server TTL to 7200s, the client cap kicks in at ~3.6 GB (well above the 78 MB launch project that motivated the change).

Vai's silent-dead-zone is closed — once the URL is past TTL, the client correctly aborts rather than continuing into a guaranteed S3 403.

2. uploadTimeoutMs is now exported and tested. Three new tests pin the contract:

expect(uploadTimeoutMs(0)).toBe(120_000);                // floor
expect(uploadTimeoutMs(50 * 1024 * 1024)).toBe(120_000); // small still floors
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000); // scales
expect(uploadTimeoutMs(500 * 1024 * 1024)).toBeGreaterThan(900_000);
expect(Number.isInteger(uploadTimeoutMs(123_456))).toBe(true);

Locks both the floor + the scaling, plus the integer return type (against an accidental Math.ceil removal). Exactly the regression guard I'd asked for.

3. Bandwidth-floor comment landed:

// Conservative floor — most connections are faster, but this prevents
// premature aborts on slow/unstable networks (hotel wifi, tethering).
const PUBLISH_UPLOAD_BYTES_PER_SECOND = 500_000;

A future reader sees the rationale immediately. ✓

Non-blocking observations

A. Default expiresInSeconds = 1800 for old-server backwards compat. Sensible — but worth noting that this mirrors the old server TTL. Once ef#36271's 7200s TTL is fully rolled out, the fallback could be tightened to 7200 too (or just removed entirely). Cheap follow-up.

B. AbortSignal upper bound. Same observation as before — for genuinely huge uploads (50 GB+), the timeout becomes hours. With the TTL cap now in place, this is naturally bounded by presignedUrlTtlMs (max 7170s ≈ 2 hours). So the multipart-upgrade question is now "do we ever expect to need >2 hours per chunk?" rather than "do we trust the user not to wait forever?" Better.

C. Direct-path (publishProjectArchiveDirect) doesn't get the TTL cap because it's a single POST to api-service — no presigned URL involved. The uploadTimeoutMs(archive.buffer.byteLength) cap there is the only ceiling, and it can grow unboundedly with archive size. If api-service / nginx / ALB has its own request-body timeout (typically 60-120s default), the direct path will hit the gateway 504 before the client gives up. Worth a quick verification — but probably acceptable since direct is the legacy fallback path.

Bottom line

All three of Vai's concerns + my non-blocking notes addressed. Stamping. Land alongside ef#36271 — together they close the upload-size gap end-to-end.

— Review by Rames Jusso (pr-review)

@miguel-heygen miguel-heygen merged commit 21ec5f8 into main May 5, 2026
28 checks passed
@miguel-heygen miguel-heygen deleted the fix/publish-upload-timeout branch May 5, 2026 22:00
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve — blockers addressed, fix is clean.

Re-reviewed at 55ea39e against my prior review at c3a0b4e. Cross-ref: experiment-framework#36271 (server) — still at Comment there for unrelated server-side issues (direct-path RAM, unauthenticated routes), but the TTL coordination required for this PR is in place.

Resolved

  • Blocker #1 — presigned URL TTL dead zone. ✅ Two-sided fix:
    • Server (ef#36271) bumped TTL 30min → 2h.
    • Client (publishProject.ts:79–80, 274–280) now parses expires_in_seconds from the staged-upload response and clamps the abort timeout to min(uploadTimeoutMs(bytes), expires_in_seconds * 1000 − 30s). With the 2h TTL, the only uploads where the cap bites are >5GB at 500KB/s, which already exceeds S3's single-PUT ceiling. Dead zone closed.
  • Important #2 — unit tests for uploadTimeoutMs. ✅ Three asserts in publishProject.test.ts:39–53: floor at 120s for 0/50MB, scales above floor for 64MB+ and 500MB+, returns integer. Locks the formula's behavior at the three points I asked for.
  • Important — 500 KB/s assumption documentation. ✅ Inline comment at publishProject.ts:10–11 explains "conservative floor for slow/unstable connections (hotel wifi, tethering)". Future maintainers will get the intent.

Still open (acknowledged, non-blocking)

  • Important #3 — gateway/nginx GB-body verification on direct path. Not addressed in this PR. The staged path goes browser → S3 directly so it's unaffected by api-service proxy limits. Direct path (/v1/hyperframes/projects/publish) is the residual concern, but that's a server-side question — calling it out on ef#36271 instead.

New findings since c3a0b4e

  • nit — fallback default expires_in_seconds = 1800 at publishProject.ts:82 is now stale. With the server bumped to 7200s, this fallback only fires when (a) the field is missing, or (b) it's not a positive number. New CLI + new server: 7200 always. New CLI + old server: still 1800, still safe. No bug, but a comment like // matches old server default; new server returns 7200 would help debugging.
  • nit — defensive guard on presignedUrlTtlMs. expires_in_seconds * 1000 − 30_000 could go negative if a future server returns TTL < 30s; AbortSignal.timeout(<0) aborts synchronously. Server-side floor is 1800s today so this is theoretical, but a Math.max(60_000, …) floor would be free insurance.
  • praise — del zip_bytes equivalent isn't needed here, but the discipline of capping the abort signal at the server's TTL (not just client-side bandwidth) is the right separation of concerns. Good fix shape.

Coordination with ef#36271

  • Server TTL bump (30min → 2h) is committed at the PR head — this CLI's TTL parsing has something to clamp against.
  • Server response field expires_in_seconds is already in the DTO (HyperframesProjectUploadResponse.expires_in_seconds: int) — no breaking change to land this independently.
  • Old CLI + new server: harmless (old CLI uses fixed 30s, still hits TTL well within 2h on small uploads — no regression).
  • New CLI + old server: harmless (CLI's 1800s fallback matches old server's actual TTL — no dead zone).

Recommendation

Ship it. The core formula is right, the TTL coupling closes the dead-zone blocker, and the tests are sufficient to catch silent regressions on the floor and slope. Approving.

— Vai

@miguel-heygen miguel-heygen restored the fix/publish-upload-timeout branch May 5, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants