fix(cli): use size-adaptive timeouts for publish uploads#635
fix(cli): use size-adaptive timeouts for publish uploads#635miguel-heygen merged 2 commits intomainfrom
Conversation
The flat 30s timeout caused large project uploads to abort before the S3 PUT could complete. Compute timeout from archive size at ~500 KB/s with a 120s floor, so a 78 MB project gets ~164s and a 500 MB project gets ~17 min. Metadata requests keep the original 30s timeout.
jrusso1020
left a comment
There was a problem hiding this comment.
LGTM — sensible adaptive timeout
Clean separation of metadata-call timeout (30s, fixed) from upload-call timeout (size-derived, min 2 min). The uploadTimeoutMs formula is conservative-friendly: max(120s, byteLength / 500_000 * 1000), which assumes a 500 KB/s minimum bandwidth (very conservative — most consumer connections support multi-MB/s) and bottoms out at 2 minutes for tiny uploads.
Math sanity-check matches the PR description's table:
- 64 MB → max(120s, 128s) = 128s (vs old fixed 30s — would have timed out)
- 500 MB → max(120s, 1000s) = ~17 min
- 1 GB → max(120s, 2000s) = ~33 min
- 5 GB → max(120s, 10000s) = ~2.8 hours
Generous tail for slow connections — right default for a CLI publish where the user is presumably watching and can Ctrl-C if they want.
Symmetry with ef#36271
This PR's byteLength is whatever archive.buffer.byteLength resolves to on the user's machine — no upper bound. ef#36271 also removes the 64 MB cap on the server side. The two PRs are symmetric in being unbounded, which is internally consistent. My review on ef#36271 raises some adjacent concerns (zip-bomb guard removal, max_bytes=0 semantics) that are the more architecturally sensitive half of this pair — this CLI side is straightforward.
Non-blocking observations
A. No unit test for uploadTimeoutMs. Three values worth pinning:
expect(uploadTimeoutMs(0)).toBe(120_000); // floor at min
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000);
expect(uploadTimeoutMs(60 * 1024 * 1024)).toBe(120_000); // small enough to hit floorCheap, locks the formula's floor + slope. Without it, a future "let me adjust the timeout" refactor could silently regress to the old 30s without test signal.
B. The 500 KB/s assumption is documented in code only as the constant name (PUBLISH_UPLOAD_BYTES_PER_SECOND). A one-line comment explaining "conservative floor for slow / unstable connections" would help — most consumer connections are 5-50× faster, so a future reader might wonder if this is too generous.
C. AbortSignal upper bound. For genuinely huge uploads (e.g., 50 GB), the timeout becomes ~28 hours. JS AbortSignal.timeout should accept that, but at some point the practical / UX answer is "split into multipart" rather than "wait longer." Out of scope for this PR but worth a future thought: should the CLI auto-switch to multipart above some threshold?
D. No matching server-side timeout adjustment in ef#36271. If api-service / S3 has a 30s connection timeout for the staged-upload PUT, the CLI's longer timeout doesn't help — the server hangs up first. Probably fine because the S3 PUT is direct-to-S3 (not through api-service), but worth confirming.
Approving — the formula is correct, the conservative bandwidth floor is the right default for a CLI tool, and the symmetric unbounded shape matches ef#36271's intent.
— Review by Rames Jusso (pr-review)
vanceingalls
left a comment
There was a problem hiding this comment.
Verdict: Comment — small, correct in spirit, but the math has a bug at the high end.
Paired with experiment-framework#36271 (server removes 64MB cap). This PR is useless without the server change but harmless if it lands first (old server still rejects >64MB at presign before any of this timeout code matters). New CLI + old server: clear server error. Old CLI + new server: 30s client-side abort on anything >~15MB — that's the bug this PR fixes. No coordinated merge required, but ship them close together.
Blockers
- blocker —
packages/cli/src/utils/publishProject.ts:147-152and the server's 30-min presigned URL expiry interact badly. The server (experiment-framework/.../project_logic.py:_PUBLISH_UPLOAD_EXPIRES_SECONDS = 30 * 60= 1800s) signs the upload URL with a 30-minute TTL. Your timeout formula isbytes / 500_000seconds. At 500 KB/s (your assumed rate) the URL expires at exactly 900 MB. Anything between ~900 MB and the new "5 GB" S3 single-PUT ceiling will hit a presigned-URL-expired 403 from S3 before your client timeout fires, and the user sees a cryptic S3 error (or yourreadErrorMessage180-char truncation of one). Either:- Reduce client
PUBLISH_UPLOAD_BYTES_PER_SECONDto a value that keeps client timeout < server URL TTL (e.g. ~520 KB/s gives 1500s for ~780MB — still leaves headroom), and document this coupling, OR - Bump server-side
_PUBLISH_UPLOAD_EXPIRES_SECONDSso the URL outlives the worst-case client timeout, OR - Detect 403 from S3 and re-issue a fresh presigned URL via
/uploadand retry once.
Whichever you pick, this needs a decision before merge — the current setup creates a silent dead zone for ~900 MB to 5 GB uploads.
- Reduce client
Important
-
important —
packages/cli/src/utils/publishProject.ts:147-152— 500 KB/s assumed throughput is pessimistic for typical broadband but optimistic for hotel/coffee-shop wifi. A 78 MB project on a 100 Mbps link uploads in ~10s; you give it 164s. A 500 MB project on a saturated 5 Mbps link needs 800s; you give it 1000s — fine. But a single hard-coded slope across two orders of magnitude of bandwidth is a blunt instrument. Two cleaner alternatives: (a) infinite/no client timeout once the PUT is in-flight (rely on TCP keepalive + cancellation only on user abort) — this is whataws s3 cpdoes; (b) progress-based timeout — reset the deadline on each chunk written, so a slow-but-progressing upload doesn't get killed. Option (a) is far simpler and matches user expectations. -
important — no test coverage. Zero tests for
uploadTimeoutMs. Trivial cases worth covering:uploadTimeoutMs(0) === 120_000, monotonicity (uploadTimeoutMs(N+1) >= uploadTimeoutMs(N)), large-input behavior (uploadTimeoutMs(5 * 1024 * 1024 * 1024)doesn't overflow). Even a 5-line vitest is better than nothing. The PR description says "All 4 existing vitest tests pass" — fine, but none of them touch this function. -
important — direct-path uses adaptive timeout but the server's direct path (
/v1/hyperframes/projects/publish) is the multipart-upload-into-Flask path that fully buffers the request body in memory. With a 1 GB POST body, the AWS API gateway / nginx in front of the Python server is much more likely to have a body-size limit than the Python code itself. You're now telling the client to wait 33 minutes for an upload that the proxy will 413/504 in seconds. Verify the gateway limits before claiming this fix gets you to S3's 5 GB ceiling on the direct path. (The staged path is not affected by gateway body limits — it goes browser → S3 directly.)
Nits
- nit — magic numbers should be named or commented.
500_000is "500 KB/s assumed minimum throughput";120_000is "2 minutes minimum even for tiny uploads". A one-line comment over each constant explains intent. - nit —
Math.ceil((byteLength / 500_000) * 1000)is the same asMath.ceil(byteLength / 500)for integer byteLength. Minor readability. - nit —
PUBLISH_METADATA_TIMEOUT_MScould keep the original namePUBLISH_REQUEST_TIMEOUT_MSsince "request" was the original meaning — splitting into upload vs metadata is fine, just call out in the constant name what each is for ("PRESIGN/COMPLETE call timeout" vs "S3 PUT body timeout"). - nit — observability. A single
console.error(or whatever the CLI's logger is) on timeout that includes the archive size and elapsed time would help debug the inevitable "publish hangs on huge project" reports.
Praise
- Splitting metadata timeout from upload timeout is the right factoring — those two operations have completely different latency profiles.
- 120s floor protects small uploads from getting unfairly killed if the formula goes weird.
- Diff is minimal and contained to one file.
— Vai
- Parse expires_in_seconds from the upload response and cap the S3 PUT timeout at TTL minus 30s safety margin — prevents the silent dead zone where the presigned URL expires before the client gives up - Add unit tests for uploadTimeoutMs: floor at 120s, scales above 64 MB, always returns an integer - Document the 500 KB/s bandwidth assumption
|
Addressed all review feedback in 55ea39e: Blocker — presigned URL TTL dead zone: Client now parses Unit tests for 500 KB/s documentation: Added comment explaining it's a conservative floor for slow/unstable connections. |
jrusso1020
left a comment
There was a problem hiding this comment.
Re-review — LGTM, blockers addressed
The TTL-mismatch and missing-test concerns are cleanly resolved. Approving.
Verified
1. Presigned URL TTL is now respected. parseStagedUploadResponse reads expires_in_seconds from the staged-upload response (defaulting to 1800 if missing for backwards compat with old servers), and the S3 PUT timeout uses:
Math.min(uploadTimeoutMs(archive.buffer.byteLength), presignedUrlTtlMs)…where presignedUrlTtlMs = stagedUpload.expiresInSeconds * 1000 - PUBLISH_METADATA_TIMEOUT_MS. The 30s subtraction is a generous buffer for metadata-call latency between "server signed the URL" and "client starts the PUT" — defends against signing-time clock skew and slow request paths. With ef#36271 bumping server TTL to 7200s, the client cap kicks in at ~3.6 GB (well above the 78 MB launch project that motivated the change).
Vai's silent-dead-zone is closed — once the URL is past TTL, the client correctly aborts rather than continuing into a guaranteed S3 403.
2. uploadTimeoutMs is now exported and tested. Three new tests pin the contract:
expect(uploadTimeoutMs(0)).toBe(120_000); // floor
expect(uploadTimeoutMs(50 * 1024 * 1024)).toBe(120_000); // small still floors
expect(uploadTimeoutMs(64 * 1024 * 1024)).toBeGreaterThan(120_000); // scales
expect(uploadTimeoutMs(500 * 1024 * 1024)).toBeGreaterThan(900_000);
expect(Number.isInteger(uploadTimeoutMs(123_456))).toBe(true);Locks both the floor + the scaling, plus the integer return type (against an accidental Math.ceil removal). Exactly the regression guard I'd asked for.
3. Bandwidth-floor comment landed:
// Conservative floor — most connections are faster, but this prevents
// premature aborts on slow/unstable networks (hotel wifi, tethering).
const PUBLISH_UPLOAD_BYTES_PER_SECOND = 500_000;A future reader sees the rationale immediately. ✓
Non-blocking observations
A. Default expiresInSeconds = 1800 for old-server backwards compat. Sensible — but worth noting that this mirrors the old server TTL. Once ef#36271's 7200s TTL is fully rolled out, the fallback could be tightened to 7200 too (or just removed entirely). Cheap follow-up.
B. AbortSignal upper bound. Same observation as before — for genuinely huge uploads (50 GB+), the timeout becomes hours. With the TTL cap now in place, this is naturally bounded by presignedUrlTtlMs (max 7170s ≈ 2 hours). So the multipart-upgrade question is now "do we ever expect to need >2 hours per chunk?" rather than "do we trust the user not to wait forever?" Better.
C. Direct-path (publishProjectArchiveDirect) doesn't get the TTL cap because it's a single POST to api-service — no presigned URL involved. The uploadTimeoutMs(archive.buffer.byteLength) cap there is the only ceiling, and it can grow unboundedly with archive size. If api-service / nginx / ALB has its own request-body timeout (typically 60-120s default), the direct path will hit the gateway 504 before the client gives up. Worth a quick verification — but probably acceptable since direct is the legacy fallback path.
Bottom line
All three of Vai's concerns + my non-blocking notes addressed. Stamping. Land alongside ef#36271 — together they close the upload-size gap end-to-end.
— Review by Rames Jusso (pr-review)
vanceingalls
left a comment
There was a problem hiding this comment.
Verdict: Approve — blockers addressed, fix is clean.
Re-reviewed at 55ea39e against my prior review at c3a0b4e. Cross-ref: experiment-framework#36271 (server) — still at Comment there for unrelated server-side issues (direct-path RAM, unauthenticated routes), but the TTL coordination required for this PR is in place.
Resolved
- Blocker #1 — presigned URL TTL dead zone. ✅ Two-sided fix:
- Server (ef#36271) bumped TTL 30min → 2h.
- Client (
publishProject.ts:79–80, 274–280) now parsesexpires_in_secondsfrom the staged-upload response and clamps the abort timeout tomin(uploadTimeoutMs(bytes), expires_in_seconds * 1000 − 30s). With the 2h TTL, the only uploads where the cap bites are >5GB at 500KB/s, which already exceeds S3's single-PUT ceiling. Dead zone closed.
- Important #2 — unit tests for
uploadTimeoutMs. ✅ Three asserts inpublishProject.test.ts:39–53: floor at 120s for 0/50MB, scales above floor for 64MB+ and 500MB+, returns integer. Locks the formula's behavior at the three points I asked for. - Important — 500 KB/s assumption documentation. ✅ Inline comment at
publishProject.ts:10–11explains "conservative floor for slow/unstable connections (hotel wifi, tethering)". Future maintainers will get the intent.
Still open (acknowledged, non-blocking)
- Important #3 — gateway/nginx GB-body verification on direct path. Not addressed in this PR. The staged path goes browser → S3 directly so it's unaffected by api-service proxy limits. Direct path (
/v1/hyperframes/projects/publish) is the residual concern, but that's a server-side question — calling it out on ef#36271 instead.
New findings since c3a0b4e
- nit — fallback default
expires_in_seconds = 1800atpublishProject.ts:82is now stale. With the server bumped to 7200s, this fallback only fires when (a) the field is missing, or (b) it's not a positive number. New CLI + new server: 7200 always. New CLI + old server: still 1800, still safe. No bug, but a comment like// matches old server default; new server returns 7200would help debugging. - nit — defensive guard on
presignedUrlTtlMs.expires_in_seconds * 1000 − 30_000could go negative if a future server returns TTL < 30s;AbortSignal.timeout(<0)aborts synchronously. Server-side floor is 1800s today so this is theoretical, but aMath.max(60_000, …)floor would be free insurance. - praise —
del zip_bytesequivalent isn't needed here, but the discipline of capping the abort signal at the server's TTL (not just client-side bandwidth) is the right separation of concerns. Good fix shape.
Coordination with ef#36271
- Server TTL bump (30min → 2h) is committed at the PR head — this CLI's TTL parsing has something to clamp against.
- Server response field
expires_in_secondsis already in the DTO (HyperframesProjectUploadResponse.expires_in_seconds: int) — no breaking change to land this independently. - Old CLI + new server: harmless (old CLI uses fixed 30s, still hits TTL well within 2h on small uploads — no regression).
- New CLI + old server: harmless (CLI's 1800s fallback matches old server's actual TTL — no dead zone).
Recommendation
Ship it. The core formula is right, the TTL coupling closes the dead-zone blocker, and the tests are sufficient to catch silent regressions on the floor and slope. Approving.
— Vai
Summary
max(120s, bytes / 500KB/s)Context
With the backend size limit removed, large projects (78 MB+) need proportionally longer to upload. A 78 MB project now gets ~164s, a 500 MB project ~17 min. The old 30s timeout would abort any upload over ~15 MB on a typical connection.
Test plan