Skip to content

[Bug]: _submit_job_to_runner is unrecoverably broken once any runner API call after /api/submit failed #3740

@un-def

Description

@un-def

Steps to reproduce

Build dstack-runner with the following patch:

diff --git runner/internal/runner/api/http.go runner/internal/runner/api/http.go
index 34220acc6..dfd9db99c 100644
--- runner/internal/runner/api/http.go
+++ runner/internal/runner/api/http.go
@@ -130,6 +130,11 @@ func (s *Server) uploadCodePostHandler(w http.ResponseWriter, r *http.Request) (
 		return nil, &api.Error{Status: http.StatusConflict}
 	}

+	if !s.uploadCodeCalledOnce {
+		s.uploadCodeCalledOnce = true
+		return nil, &api.Error{Status: http.StatusInternalServerError}
+	}
+
 	r.Body = http.MaxBytesReader(w, r.Body, maxBodySize)

 	if err := s.executor.WriteRepoBlob(r.Body); err != nil {
diff --git runner/internal/runner/api/server.go runner/internal/runner/api/server.go
index 11b76d887..4872f38ef 100644
--- runner/internal/runner/api/server.go
+++ runner/internal/runner/api/server.go
@@ -27,6 +27,8 @@ type Server struct {
 	executor  executor.Executor
 	cancelRun context.CancelFunc

+	uploadCodeCalledOnce bool
+
 	metricsCollector *metrics.MetricsCollector

 	version string

It emulates a flaky failure of /api/upload_code handler (e.g., a network issue) – the first call fails, all consecutive calls succeed.

Deploy this build and submit a run as usual.

Actual behaviour

Once _submit_job_to_runner fails in any runner's API call other than /api/submit, all consecutive attempts are deemed to fail as the previous attempt changed runner's executor state to WaitCode|WaitRun and /api/submit rejects submission since the state is not WaitSubmit.

If we ignore 409 in /api/submit call, the submission process recovers:

--- src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
+++ src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
@@ -1342,16 +1342,19 @@ def _submit_job_to_runner(
     if runner_client.healthcheck() is None:
         return _SubmitJobToRunnerResult(success=success_if_not_available)

-    runner_client.submit_job(
-        run=run,
-        job=job,
-        cluster_info=cluster_info,
-        # Do not send all the secrets since interpolation is already done by the server.
-        # TODO: Passing secrets may be necessary for filtering out secret values from logs.
-        secrets={},
-        repo_credentials=repo_credentials,
-        instance_env=instance_env,
-    )
+    try:
+        runner_client.submit_job(
+            run=run,
+            job=job,
+            cluster_info=cluster_info,
+            # Do not send all the secrets since interpolation is already done by the server.
+            # TODO: Passing secrets may be necessary for filtering out secret values from logs.
+            secrets={},
+            repo_credentials=repo_credentials,
+            instance_env=instance_env,
+        )
+    except Exception:
+        pass

Expected behaviour

No response

dstack version

0.20.15

Server logs

[12:00:33] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:29.311805
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1355 job(741f2c)task-0-0: uploading file archive(s)
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1358 job(741f2c)task-0-0: uploading code
           DEBUG    dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 500 Server Error: Internal Server Error for url: http://localhost:39461/api/upload_code
           WARNING  dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:29.827807
           INFO     dstack._internal.server.services.events:205 Emitting event: Job became unreachable. Event targets: job(741f2c)task-0-0. Actor: system
[12:00:45] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:41.777487
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
           DEBUG    dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 409 Client Error: Conflict for url: http://localhost:35811/api/submit
           WARNING  dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:42.311724
           <... and "409 Client Error" failure repeated again and again until provisioning timeout exceeded>

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions