Skip to content

[Bug]: Job stuck in submitted and run cannot stop when fleet is at capacity #3886

@jvstme

Description

@jvstme

Steps to reproduce

  1. Create a cloud fleet nodes: 1 or an SSH fleet with 1 node.
  2. Start a service with 1 replica on this fleet.
  3. Scale the service to 2 replicas.
  4. Stop the run.

Actual behaviour

  • After step 3, the second replica is stuck in submitted.
  • After step 4, the run is stuck in terminating, no jobs are being stopped.
NAME                  BACKEND          GPU  PRICE           STATUS       SUBMITTED
 test-service                           -    -               terminating  10 mins ago
    group=0 replica=0  aws (us-east-2)  -    $0.0006 (spot)  running      10 mins ago
            replica=1                   -    -               submitted    8 mins ago

Expected behaviour

  • After step 3, the second replica fails with FAILED_TO_START_DUE_TO_NO_CAPACITY.
  • After step 4, the run stops.

dstack version

0.20.19

Server logs

[22:43:13] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 875865df-e75b-429c-80fc-7c9306ec487c                                      
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_submitted:337 job(875865)test-service-0-1: assignment has started                                  
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_submitted:591 job(875865)test-service-0-1: fleet test-fleet is full, retrying assignment           
           DEBUG    dstack._internal.server.background.pipeline_tasks.base:364 Processed jobs item 875865df-e75b-429c-80fc-7c9306ec487c in 0.040                              
[22:43:15] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing runs item c0dc9825-8d9c-41fe-9b0f-2636ef128292                                      
           DEBUG    dstack._internal.server.background.pipeline_tasks.runs:797 Failed to lock run c0dc9825-8d9c-41fe-9b0f-2636ef128292 jobs. The run will be processed later. 
           DEBUG    dstack._internal.server.background.pipeline_tasks.base:364 Processed runs item c0dc9825-8d9c-41fe-9b0f-2636ef128292 in 0.018

Additional information

Introduced in 0.20.19

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions