Setup
- driver:
zarr (kvstore: gcs)
OpenMode: create | delete_existing
- A single client process opens the store once for write at job startup. No
concurrent writers on the same path.
Symptom
tensorstore::Open intermittently returns ALREADY_EXISTS on the .zarray
write, even though delete_existing=true should remove any prior state
first. We see it on a small fraction of paths in long batch runs (~6 out of
several thousand). Subsequent retries of the same path (seconds later)
succeed, and the same code path normally succeeds.
ALREADY_EXISTS: Error opening "zarr" driver: Error writing
gs://.../foo.zarr/.zarray
[source locations='tensorstore/internal/cache/kvs_backed_cache.h:220
tensorstore/driver/driver.cc:115']
Likely cause
A timing window inside delete_existing's implementation: the delete of
.zarray is acknowledged, but the create's existence check sees stale
metadata and fails. Plausibly a metadata-cache / consistency window in the
GCS kvstore layer.
Suggestion
Either (a) make delete_existing internally retry on ALREADY_EXISTS for a
short window, or (b) document that callers should retry. Right now the
failure mode looks like a code bug — create | delete_existing should be
atomic from the caller's perspective — rather than a transient
remote-storage hiccup.
Setup
zarr(kvstore:gcs)OpenMode:create | delete_existingconcurrent writers on the same path.
Symptom
tensorstore::Openintermittently returnsALREADY_EXISTSon the.zarraywrite, even though
delete_existing=trueshould remove any prior statefirst. We see it on a small fraction of paths in long batch runs (~6 out of
several thousand). Subsequent retries of the same path (seconds later)
succeed, and the same code path normally succeeds.
ALREADY_EXISTS: Error opening "zarr" driver: Error writing
gs://.../foo.zarr/.zarray
[source locations='tensorstore/internal/cache/kvs_backed_cache.h:220
tensorstore/driver/driver.cc:115']
Likely cause
A timing window inside
delete_existing's implementation: the delete of.zarrayis acknowledged, but the create's existence check sees stalemetadata and fails. Plausibly a metadata-cache / consistency window in the
GCS kvstore layer.
Suggestion
Either (a) make
delete_existinginternally retry onALREADY_EXISTSfor ashort window, or (b) document that callers should retry. Right now the
failure mode looks like a code bug —
create | delete_existingshould beatomic from the caller's perspective — rather than a transient
remote-storage hiccup.