tensorstore::Open with create | delete_existing intermittently fails ALREADY_EXISTS on GCS

  ## Setup                                                                                                                                                                                                                         
  - driver: `zarr` (kvstore: `gcs`)                      
  - `OpenMode`: `create | delete_existing`
  - A single client process opens the store once for write at job startup. No                                                                                                                                                      
    concurrent writers on the same path.                                                                                                                                                                                           
                                                                                                                                                                                                                                   
  ## Symptom                                                                                                                                                                                                                       
  `tensorstore::Open` intermittently returns `ALREADY_EXISTS` on the `.zarray`                                                                                                                                                     
  write, even though `delete_existing=true` should remove any prior state                                                                                                                                                          
  first. We see it on a small fraction of paths in long batch runs (~6 out of                                                                                                                                                      
  several thousand). Subsequent retries of the same path (seconds later)                                                                                                                                                           
  succeed, and the same code path normally succeeds.                                                                                                                                                                               
                                                                                                                                                                                                                                   
  ALREADY_EXISTS: Error opening "zarr" driver: Error writing                                                                                                                                                                       
  gs://.../foo.zarr/.zarray                                                                                                                                                                                                        
  [source locations='tensorstore/internal/cache/kvs_backed_cache.h:220                                                                                                                                                             
  tensorstore/driver/driver.cc:115']                                                                                                                                                                                               
                                                                                                                                                                                                                                   
  ## Likely cause                                                                                                                                                                                                                  
  A timing window inside `delete_existing`'s implementation: the delete of                                                                                                                                                         
  `.zarray` is acknowledged, but the create's existence check sees stale                                                                                                                                                           
  metadata and fails. Plausibly a metadata-cache / consistency window in the                                                                                                                                                       
  GCS kvstore layer.                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                   
  ## Suggestion                                                                                                                                                                                                                    
  Either (a) make `delete_existing` internally retry on `ALREADY_EXISTS` for a                                                                                                                                                     
  short window, or (b) document that callers should retry. Right now the                                                                                                                                                           
  failure mode looks like a code bug — `create | delete_existing` should be                                                                                                                                                        
  atomic from the caller's perspective — rather than a transient                                                                                                                                                                   
  remote-storage hiccup.                                                                                                                                                                                                           
                                                                                                                                                                                                                                   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorstore::Open with create | delete_existing intermittently fails ALREADY_EXISTS on GCS #290

Setup

Symptom

Likely cause

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tensorstore::Open with create | delete_existing intermittently fails ALREADY_EXISTS on GCS #290

Description

Setup

Symptom

Likely cause

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions