Add instances option to target specific fleet instances by fededagos · Pull Request #3925 · dstackai/dstack

fededagos · 2026-06-01T21:02:39Z

Adds an instances option to run configurations (dev environments, tasks, services) that restricts a run to specific existing fleet instances.

Syntax

Long forms:

instances:
  - fleet: my-fleet
    instance: 3
  - name: my-fleet-1
  - hostname: worker-1

Short form for matching by instance name:

instances:
  - my-fleet-1

The fleet form also supports <project name>/<fleet name> for fleets from another project.

Behavior

instances has allow-list semantics: a run is placed only on a matching existing instance.
When instances is set, dstack never provisions new instances to satisfy the run.
If no matching instance is available, the run fails with a no-capacity error; retry can be used to wait for a selected busy instance to free up.
A run is rejected up front if it specifies fewer instances than required by its node count.
New-capacity backend offers are skipped when instances is set because they cannot satisfy the selector.

Implementation

Adds strict selector models to ProfileParams: name, hostname, and fleet + instance, while preserving the string shorthand as an instance-name selector.
Reuses the existing fleet/instance offer selection path and filters loaded instances by the selected instance selectors.
Supports both backend fleets and SSH fleets.
Supports qualified fleet references without broadening query complexity for the common unqualified case.
Keeps the change backward compatible by omitting unset instances for older client/server compatibility paths.
No DB schema change; the field is stored in the existing run/profile JSON.

Docs

Updated the shared fleet-management snippet and protips guide. The docs promote the explicit syntax first and keep the short instance-name syntax in a collapsible section.

Testing

uv run ruff check .
uv run pyright -p .
uv run pytest — 2607 passed, 1055 skipped
Local end-to-end testing with dstack server:
- Backend AWS fleet: baseline fleets plus all four instances syntaxes completed successfully.
- SSH fleet: created an EC2 instance from the dstack AWS AMI, configured it as an SSH fleet, and verified all four instances syntaxes completed successfully.
- Negative case: nonexistent instance selector failed with “Failed to use specified instances” and did not provision another backend instance.

AI Assistance

This PR includes AI-assisted changes. The original PR noted Claude Code assistance; follow-up schema, implementation review, tests, docs, and E2E verification were assisted by Codex.

Introduce an `instances` run profile option that pins a run to specific existing fleet instances (nodes). Each value matches an instance by its name (e.g. `my-fleet-0`) or by its hostname/IP address. When set, `filter_instances` keeps only matching instances and the job assignment phase never provisions new capacity to satisfy a node selector, terminating with a no-capacity error instead.

Reject runs that target fewer instances than the number of nodes they require, surfaced during planning via `validate_run_spec_and_set_defaults`. Exclude new-capacity backend offers from the run plan when `instances` is set, since they are never provisioned and would otherwise mislead the `dstack apply`/`dstack offer` output.

Add a 'Targeting specific instances' section to the shared fleets snippet (dev environments, tasks, services) and a corresponding tip in the protips guide.

Handle an explicit empty `instances` list consistently across the assignment gate, plan output, and instance filtering by checking `is not None` instead of truthiness, so an empty list targets existing instances only (rather than silently allowing new-capacity provisioning and showing unusable offers). Add regression tests ensuring the instance selector is applied on the multinode and shared-instances filter paths.

jvstme · 2026-06-10T11:50:15Z

+def _validate_fleet_instance_selector_fleet(v: str) -> str:
+    EntityReference.parse(v)
+    return v
+
+
+class FleetInstanceSelector(CoreModel):
+    fleet: Annotated[
+        str,
+        Field(
+            description=(
+                "The fleet name. For fleets owned by the current project, specify the fleet name."
+                " For a fleet from another project, specify `<project name>/<fleet name>`"
+            ),
+            min_length=1,
+        ),
+    ]
+    instance: Annotated[int, Field(description="The fleet instance number", ge=0)]
+
+    _validate_fleet = validator("fleet", allow_reuse=True)(_validate_fleet_instance_selector_fleet)


(nit) I would annotate fleet as EntityReference instead of str. That would:

remove the need for parsing it on each access;

allow the type checker to enforce correct usage across the codebase;

allow to optionally use the verbose object notation in configurations, which would be consistent with other properties (1, 2);
instances: - fleet: project: main name: my-fleet instance: 0

And also add str as an option for this field in schema_extra

Annotated fleet as EntityReference, with the string shorthand parsed in a pre-validator and str kept in the schema via schema_extra, following the fleets field pattern. The object notation is now accepted in configurations.

jvstme · 2026-06-10T11:59:37Z

+    _validate_fleet = validator("fleet", allow_reuse=True)(_validate_fleet_instance_selector_fleet)
+
+
+InstanceSelector = Union[InstanceNameSelector, InstanceHostnameSelector, FleetInstanceSelector]


(nit) Not described in the .dstack.yml references. Consider adding a section similar to volumes

Added instances sections to the dev-environment, task, and service references with tabs per selector type and a short-syntax note.

jvstme · 2026-06-10T12:29:06Z

+async def _load_fleet_project_if_needed(
+    session: AsyncSession,
+    fleet_model: Optional[FleetModel],
+) -> None:
+    if fleet_model is None or "project" not in sa_inspect(fleet_model).unloaded:
+        return
+    await session.execute(
+        select(FleetModel)
+        .where(FleetModel.id == fleet_model.id)
+        .options(joinedload(FleetModel.project))
+        .execution_options(populate_existing=True)
+    )


(nit) This is a rather unusual pattern for our codebase. Our typical pattern is to load all the required relationships when fetching the model from the database (in this case, in _load_submitted_job_context), which I think is preferred, as it avoids extra roundtrips to the database

Folded the fleet project load into the existing _load_submitted_job_context / _fetch_run_model_for_submitted_job queries and removed _load_fleet_project_if_needed.

jvstme · 2026-06-10T12:39:01Z

+def instance_matches_hostname_selector(
+    instance: InstanceModel, selector: InstanceHostnameSelector
+) -> bool:
+    candidates = set()
+    jpd = get_instance_provisioning_data(instance)
+    if jpd is not None and jpd.hostname is not None:
+        candidates.add(jpd.hostname.lower())
+    rci = get_instance_remote_connection_info(instance)
+    if rci is not None:
+        candidates.add(rci.host.lower())
+    return selector.hostname.lower() in candidates


Match by private_ip too? I would expect it based on InstanceHostnameSelector.hostname description.

The fleet instance hostname or IP address

The selector now also matches JobProvisioningData.internal_ip.

jvstme · 2026-06-10T13:04:40Z

+def instance_matches_fleet_instance_selector(
+    instance: InstanceModel,
+    selector: FleetInstanceSelector,
+    *,
+    project: Optional[ProjectModel] = None,
+    fleet: Optional[FleetModel] = None,
+) -> bool:
+    fleet_ref = EntityReference.parse(selector.fleet)
+
+    if fleet is None:
+        # Avoid triggering a lazy load in async code.
+        if "fleet" in sa_inspect(instance).unloaded or instance.fleet is None:
+            return False
+        fleet = instance.fleet
+
+    if fleet.name.lower() != fleet_ref.name.lower():
+        return False
+    if instance.instance_num != selector.instance:
+        return False
+
+    if fleet_ref.project is None:
+        if project is not None and fleet.project_id != project.id:
+            return False
+        return True
+
+    if "project" in sa_inspect(fleet).unloaded or fleet.project is None:
+        return False
+    return fleet.project.name.lower() == fleet_ref.project.lower()


(nit) Looks quite error-prone to me. The function relies on several data sources at once (fleet vs instance.fleet, project vs fleet.project vs instance.fleet.project), and silently returns a potentially incorrect result if the caller doesn't provide the right arguments (e.g., if fleet is not provided and instance.fleet is not loaded).

I.e., the correctness of the function depends on the caller following some implicit contracts. To me, it's quite difficult to tell whether all code paths follow these contracts and whether they will keep doing so in the future.

If possible, I'd prefer the function (and any dependent code paths) to only accept instance and selector and require all the necessary relationships to be loaded.

Reworked the contract: the matchers now take only the instance, the selector, and the current project (required — needed to interpret unqualified fleet references). The fleet argument and the unloaded-relationship fallbacks are gone: instance.fleet is populated at load time (set_committed_value — SQLAlchemy doesn't populate the reverse many-to-one when loading through FleetModel.instances) and fleet.project is always eager-loaded, so a missing relationship fails loudly instead of silently not matching. This also removed the conditional load_fleet_project plumbing.

jvstme · 2026-06-10T13:09:39Z

 ) -> List[InstanceModel]:
+    fleet_load = joinedload(InstanceModel.fleet)
+    if load_fleet_project:
+        fleet_load = fleet_load.joinedload(FleetModel.project)


(nit) .load_only()?

Added .load_only(ProjectModel.name) to the fleet project loads.

jvstme · 2026-06-10T13:29:06Z

+    # If `instances` is set, backend offers cannot satisfy the run. Otherwise,
+    # keep the existing optimization that skips backend requests when pool
+    # capacity is already enough.


(nit) The comment explains how this PR changes the code ("keep the existing optimization") instead of explaining what the code does. The reader won't understand the comment without seeing the previous version

Rewrote the comment to describe the current behavior.

jvstme · 2026-06-10T13:51:02Z

+    instances = run_spec.merged_profile.instances
+    if instances is not None:
+        nodes_required_num = get_nodes_required_num(run_spec)
+        if len(instances) < nodes_required_num:
+            raise ServerClientError(
+                f"`instances` specifies {len(instances)} instance(s)"
+                f" but the run requires {nodes_required_num} nodes."
+                " Specify at least as many instances as nodes."
+            )


Even if there are less instances than nodes_required_num, they may still be able to accommodate the run if they have enough blocks.

There are a few other places in the PR that appear to not take blocks into consideration (search by required_instance_offers)

Right about services: replicas can pack onto one instance with enough idle blocks, so the up-front check now applies only to multinode tasks, where each node takes a whole instance (min_blocks = total_blocks for multinode in get_shared_instances_with_offers, and multinode skips instances with any busy block). For the same reason the required_instance_offers comparisons should be blocks-safe: len(jobs_to_provision) > 1 only happens in the multinode master path, and each instance yields at most one offer (the block-size loop breaks on first match), so offer count equals distinct usable instances. Let me know if there's a path I'm missing.

fededagos and others added 5 commits June 1, 2026 11:19

Document targeting specific fleet instances

d352153

Add a 'Targeting specific instances' section to the shared fleets snippet (dev environments, tasks, services) and a corresponding tip in the protips guide.

Support strict instance selectors

b25e68a

peterschmidt85 changed the title ~~Add instances option to target specific fleet nodes~~ Add instances option to target specific fleet instances Jun 5, 2026

peterschmidt85 requested a review from jvstme June 5, 2026 16:26

jvstme reviewed Jun 10, 2026

View reviewed changes

fededagos added 7 commits June 10, 2026 12:05

Describe current behavior in skip_backend_offers comment

068ca75

Match instance internal IP in hostname selector

3f47cfc

Annotate FleetInstanceSelector.fleet as EntityReference

99a62df

Test invalid fleet selector references are rejected

f6a7c17

Require loaded relationships for instance selector matching

257cef6

Apply instances node-count validation only to multinode tasks

1d84eb1

Document instances in .dstack.yml references

2aa4324

		_validate_fleet = validator("fleet", allow_reuse=True)(_validate_fleet_instance_selector_fleet)


		InstanceSelector = Union[InstanceNameSelector, InstanceHostnameSelector, FleetInstanceSelector]

Uh oh!

Conversation

fededagos commented Jun 1, 2026 • edited by peterschmidt85 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Syntax

Behavior

Implementation

Docs

Testing

AI Assistance

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fededagos commented Jun 1, 2026 •

edited by peterschmidt85

Loading