Skip to content

Sprout × mesh-llm: in-process mesh node (serve/consume) + relay admission#798

Open
tlongwell-block wants to merge 30 commits into
mainfrom
mesh-relay-inprocess-v8
Open

Sprout × mesh-llm: in-process mesh node (serve/consume) + relay admission#798
tlongwell-block wants to merge 30 commits into
mainfrom
mesh-relay-inprocess-v8

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Sprout × mesh-llm — run an LLM on one machine, use it from another

Embeds an in-process mesh-llm node (SDK pinned to bd16da4) in sprout-relay (rendezvous + admission) and Sprout desktop (serve + consume). Gated entirely by relay membership — no new auth protocol, no new crypto.

What it does

  • Machine A — desktop "Share compute" → serve mode → hosts a model.
  • Machine B — agent provider "Relay mesh" → client node → local OpenAI endpoint → requests route over the mesh to A's GPU.
  • Relay — publishes member-readable status (kind:30621, relay-signed, relay-only) + gates admission via NIP-98 → relay membership.
  • Split — models too big for one node auto-split across two serving machines (mesh runtime behavior, free).

Proven on hardware

Qwen3.6-35B-A3B (IQ4_XS) loads into Metal on an M4 Max and serves real /v1/chat/completions200 OK, finish_reason=stop, coherent tokens — over the exact mesh_llm_sdk::serve path desktop uses. See crates/sprout-relay/examples/mesh_serve_smoke.rs (single-node loopback) and the #[ignore] acceptance matrix in crates/sprout-test-client/tests/e2e_mesh_llm.rs.

CI automates the trust/denial assertions; live-inference + split ship as opt-in #[ignore] tests with a runbook (CI can't host native multi-node inference yet).

Security posture (honest, not airtight)

  • Closed: mesh's public iroh-relay fallback. Desktop now requires a Sprout iroh_relay_url from NIP-11 + a fresh NIP-98 bearer before starting mesh, and fails closed otherwise (e9ba42b9). This kills the effective_relay_urls([]) → public-relay fallback.
  • Carried (documented, not fixed): mesh bd16da4 still performs unconditional public STUN (Google/Cloudflare/STUNProtocol) on start and may inject the discovered public IP into invite tokens. No SDK knob to disable it yet — documented in docs/mesh-llm-local-build.md, upstream fix pending. We do not claim "no third-party infra" is airtight in v1.
  • Carried (workaround): serve::start deadlocks with console_ui(false) at this rev (status-poll vs headless bind); desktop sets console_ui(true) until upstream chore(deps): update dependency @types/react to v19.2.15 #736 fixes headless readiness.

Notes

  • mesh-llm SDK pinned to a fixed rev (bd16da4) for reproducibility.
  • Two upstream mesh issues to file: disable-public-STUN, and headless serve readiness.

wesbillman and others added 19 commits May 29, 2026 09:50
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…787)

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
)

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
* mari/mesh-relay-trust:
  Add relay-owned mesh status publication

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Adds the Settings > Compute panel against Max's frozen mesh-llm Tauri
command contract. Builds against the typed shapes but does not yet wire
to live commands — those land in Max's lane (mesh_availability,
mesh_node_status, mesh_start_node, mesh_stop_node, mesh_installed_models,
mesh_agent_preset). UI will resolve when his commands ship.

- desktop/src/features/mesh-compute/types.ts: type mirrors of Max's
  frozen command surface, sourced from the 2026-05-29 freeze posts.
- desktop/src/features/mesh-compute/api.ts: typed wrappers around
  invokeTauri for each command. mesh_search_models is reserved as a
  signature-only export for v2; calling it throws.
- desktop/src/features/mesh-compute/classifyModelRef.ts + test: pure
  ref-classification matching mesh runtime/mod.rs:3390 (catalog /
  hf:// / local path). Drives the inline 'Looks like a …' hint.
- hooks/useMeshAvailability.ts: 5s slow poll + focus refresh.
- hooks/useMeshNodeStatus.ts: 750ms poll while transitioning, 4s
  otherwise — so lifecycle changes don't stall.
- ui/MeshComputeSettingsCard.tsx: the rebuilt Share-compute surface.
  No raw mesh knobs (publish/auto/discovery), no kind:xxxx language,
  no endpoint id on the primary surface. Advanced is collapsed and
  carries Max VRAM + console URL. Footer states the architectural
  invariants (no public Nostr publish, no auto-discovery, no
  out-of-relay sharing) so a privacy-aware user trusts the toggle.
- SettingsPanels.tsx: new 'compute' section after 'channel-templates'
  using the Cpu icon.

Not yet implemented (queued):
- The Create-Agent 'Relay mesh' flow that pre-selects sprout-agent
  and pre-fills env vars via mesh_agent_preset. Pending agreement
  with Max on flow-vs-picker shape.
- Managed-agent row rendering for the typed LlmAuth (-32001)
  failure → 'Relay mesh denied this agent — check membership.'

Verified: pnpm typecheck, pnpm lint, pnpm test (348/348), pnpm check.
Commands will fail at runtime until Max's mesh_* commands land —
that's expected.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…etection

Pure helpers + tests for the Create-Agent 'Run on relay mesh' flow Eva
blessed. Not yet integrated with CreateAgentDialog — that integration
lands when Max's mesh_agent_preset() command is on the integration tip
and types compile against the real Tauri surface. Doing the helpers
first keeps the dialog diff a one-liner-per-setter when it's time.

- meshAgentPresetPatch(): turn a MeshAgentPreset into the flat field
  patch the dialog applies via Object.assign-style fan-out across
  acpCommand / agentCommand / agentArgs / mcpCommand / model /
  envVars setters. Returns owned copies so the caller cannot mutate
  the preset.
- detectMeshPresetOverrides(): which user-set fields would the
  preset overwrite? Returns human-readable labels for the
  'Using Relay mesh — overrides this persona's model' honest-over-
  silent copy Eva named as a requirement. Empty/null values are
  not treated as 'set' — a fresh draft is purely additive.

Verified: pnpm typecheck, pnpm check, pnpm test (358/358 including
10 new applyMeshAgentPreset tests covering: patch field mapping,
defensive copy, no-override on empty/matching draft, override
reporting for model/runtime/env-vars, env-var same-value-no-report,
additive env-var-no-report, empty-string treated like null).

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…mesh flow

Now that Max's foundation is on the integration tip, swap the scaffolded
type/api shims for the real `tauriMesh.ts` surface and ship the
Create-Agent flow Eva blessed.

- Delete `mesh-compute/types.ts` and `mesh-compute/api.ts`. All call
  sites now import from `@/shared/api/tauriMesh` directly. `ModelRefKind`
  (presentational) moves inline into `classifyModelRef.ts` where it's
  used. Single source of truth.
- `MeshComputeSettingsCard` and the polling hooks now consume Max's
  exact `MeshAvailability` / `MeshNodeStatus` / `MeshModelOption`
  shapes. Card behavior unchanged.
- `RelayMeshAgentSection`: new component, the 'Run on relay mesh' flow
  entry inside CreateAgentDialog. Renders as a rounded section with
  a toggle + model dropdown. Greyed with mesh availability's
  `reason` when `available === false`. On model pick, calls
  `mesh_agent_preset(modelId)` and exposes both `modelId` and the
  resolved `MeshAgentPresetPatch` to the parent so it can fan out
  into existing setters. Renders the override warning
  ('Using Relay mesh overrides this agent's model') when
  `detectMeshPresetOverrides` reports any clashing fields.
- `CreateAgentDialog`: adds `useMesh` + `meshModelId` state, hides the
  backend 'Run on' select and the ACP runtime field when `useMesh`
  is on, fans the preset out into the existing acpCommand /
  agentCommand / agentArgs / mcpCommand / envVars setters via the
  section's onModelIdChange callback, and includes `model:` in the
  submit input. Relay-mesh always uses local backend (sprout-agent
  + OpenAI-compat env vars); `isProviderMode` is suppressed.
  Submit guard blocks until a model is picked.

Verified locally:
- pnpm typecheck clean
- pnpm check clean (biome lint + format + file-size guard)
- pnpm test 358/358 (incl. 9 classifyModelRef + 10 applyMeshAgentPreset
  tests; no new tests for the UI section — render testing for this
  scope was disproportionate vs. typecheck + manual demo)

Queued for follow-up (out of this commit):
- ManagedAgentRow / agent-failure render: when `lastError` starts with
  'Agent reported error: llm auth:' render 'Relay mesh denied this
  agent — check your relay membership.' The seam is already shipped
  in Max's commit (`-32001` + log-tail capture into `last_error`);
  no UI currently renders `lastError` at all, so the friendly copy
  is additive and can ship separately.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…ference runbook)

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Closes the loop on Max's auth-failure seam. When a managed agent exits
with an HTTP 401/403 from the OpenAI-compatible mesh endpoint:

  sprout-agent  raises AgentError::LlmAuth (json_rpc_code -32001,
                Display prefix 'llm auth: ...').
  sprout-acp    wraps it as 'Agent reported error: llm auth: ...'.
  managed_agents/storage.rs recovers that line from read_log_tail and
                writes it into ManagedAgent.lastError on nonzero exit.
  ManagedAgentRow (new)  reads lastError via friendlyAgentLastError,
                promotes the auth-failure case to 'Relay mesh denied this
                agent — check your relay membership.' rendered in
                destructive color under the Status block. Generic
                lastError content passes through verbatim so unrelated
                failures still surface their text.

- desktop/src/features/agents/lib/friendlyAgentLastError.ts: pure
  classifier returning { severity: 'denied' | 'generic'; copy } or
  null. Matches both the sprout-acp wrap and the unwrapped
  sprout-agent prefix; substring matches inside other messages are
  NOT promoted (no lying about unrelated crashes).
- friendlyAgentLastError.test.mjs: 8 tests covering null/empty, both
  matched prefixes, generic passthrough, whitespace trimming,
  substring-not-at-start anti-promotion, and non-auth 'Agent reported
  error: llm:' staying generic.
- ManagedAgentRow.tsx StatusBlock: renders the friendly copy when
  non-null, with destructive coloring for denial and muted-foreground
  for generic. Both row variants (expandable + non-expandable) thread
  the same friendlyError through.

v1 limitation, explicitly named in the helper's doc comment: the typed
-32001 code from sprout-agent never reaches desktop structurally — ACP's
ObserverHandle is process-local in the child. The recovered string is
the seam we render against. Follow-up to make this fully structural is
ACP status file / desktop-owned observer sink.

Verified:
- pnpm typecheck clean
- pnpm check clean (lint + format + file-size, 528 files)
- pnpm test 366/366 (was 358; +8 new friendlyAgentLastError tests)

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
M1 proof that the mesh serve path loads a GGUF and generates over the
local OpenAI endpoint, on real hardware. Single-node serve-and-self-
consume (publish=false, auto_join=false, mDNS) — the one-box loopback.

Verified: Qwen3.6-35B-A3B IQ4_XS loads into Metal on M4 Max,
/v1/chat/completions returns 200 with coherent tokens.

NOTE: sets console_ui(true) to work around a serve::start readiness
deadlock in mesh bd16da4 (status-poll vs headless console bind);
see docs/mesh-llm-local-build.md. Fold into the e2e #[ignore] matrix
once the upstream readiness fix lands.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
…lapsible-if)

Two -D warnings clippy errors under rustc 1.95 that would block CI:
- storage.rs: moved meaningful_agent_error_from_log above #[cfg(test)] mod tests
- mesh_llm/mod.rs: collapsed inner if into a match guard

No logic change.

Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block requested a review from a team as a code owner May 29, 2026 21:26
npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 2 commits May 29, 2026 18:15
…esh crates, paste advisory)

cargo-deny fails on the pinned mesh-llm dependency tree:
- several crates use the Unlicense license (not previously in the allowlist)
- six mesh-llm-* workspace crates omit a per-crate license field; the mesh repo
  is MIT OR Apache-2.0 (workspace Cargo.toml + Apache-2.0 LICENSE) — clarified
- RUSTSEC-2024-0436: paste unmaintained, transitive via iroh → netlink

advisories/bans/licenses/sources all pass locally after this. The clarify entries
and paste ignore are removable once mesh sets per-crate license fields upstream.

Signed-off-by: npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Eva <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block force-pushed the mesh-relay-inprocess-v8 branch from d7b4fe3 to 1903ea8 Compare May 29, 2026 22:40
npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 7 commits May 29, 2026 18:53
Resolves version-bump conflicts (CHANGELOG/package.json/tauri.conf/pubspec/
Cargo.toml → 0.3.5) and the lib.rs file-size budget (kept 780 to cover both
main's SIGINT handlers and our mesh command registrations). Cargo.lock
regenerated with our mesh deps on top of main. No mesh logic touched; the
deleted meshClassifyModelRef wrapper stays deleted. cargo-deny green.

Signed-off-by: npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
- trust rows take MEMBER_NSEC/STRANGER_NSEC instead of Keys::generate()
  (a generated key's membership is undefined — can't assert "member sees status")
- live_agent_completes_chat_over_mesh: real env-gated completion assertion
  via MESH_OPENAI_BASE; skips (not silent-passes) when no live endpoint
- live_split_model_completes: panics "not implemented" so --ignored can never
  report it green without a real multi-node split harness

Addresses Perci's e2e blocker: rows must not pass as tests while only eprintln.
… docs

- iroh_relay: add verify_bearer_rejects_expired_timestamp covering the
  ±60s window (defense vs observed-token replay outside the live admit
  moment).
- iroh_relay: doc-comment the ViaOwner deny arm — v1 explicitly rejects
  NIP-OA owner-delegated agents at iroh admission even when HTTP would
  admit them, keeping the mesh-compute trust boundary legibly tighter
  and pointing at the NIP-OA scope follow-up.
- mesh_status_publisher: SproutMeshStatus doc — endpoint_addr and
  mesh_id are dial metadata, not access grants; iroh admission via
  NIP-98 → relay membership is the only gate.

Addresses Eva's N3 punch-list item. The third bullet from the original
proposal (ViaOwner-deny unit test) was scoped down to a doc-comment:
asserting it requires a real AppState (db+redis+typesense via
identity_archive::test_state), which fails the minimalness bar for a
two-line pattern-match arm.
Signed-off-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d added 2 commits May 29, 2026 20:02
- Resolve the mesh-llm rev from Cargo.lock instead of hardcoding `bd16da4`
  in two workflow files; a dependency bump no longer needs a lockstep CI edit.
- Cache the prebuilt llama native libraries with actions/cache (restore/save,
  house style) keyed on the resolved rev + OS + backend, so the expensive
  build is skipped on a hit and invalidated automatically on a rev change.
- docs: the cached CI build is shipped, not a "follow-up" — update to match.

Addresses Perci N4 + Sami N5.
- find_progressish_reason matches a typed phase/status/state/stage field instead
  of stringify-and-grep over the whole payload, so an unrelated field mentioning
  "preparing" can't pin the health badge to degraded forever (Sami N1).
- looks_like_model_ref drops the Qwen/Llama/GGUF name allowlist (missed
  Mistral/Phi/Gemma, false-positived on family substrings); a bare string is a
  ref only via hf:// scheme or .gguf ext — structured refs come through the
  typed model_id/modelRef/id path (Sami N2).
- Tests in sibling mod_tests.rs (#[path]) to keep mod.rs under the 500-line
  budget; pin both fixes incl. the unrelated-field regression.
Comment thread .github/workflows/ci.yml
Comment on lines +500 to +513
run: |
set -euo pipefail
cargo fetch --manifest-path desktop/src-tauri/Cargo.toml
SHORT='${{ steps.mesh_rev.outputs.short }}'
MESH_ROOT=$(find "${CARGO_HOME:-$HOME/.cargo}/git/checkouts" -path "*/$SHORT" -type d -name "$SHORT" | head -1)
if [[ -z "$MESH_ROOT" ]]; then
echo "::error::mesh-llm checkout for $SHORT not found after cargo fetch"
exit 1
fi
export LLAMA_STAGE_BACKEND=metal
export LLAMA_STAGE_BUILD_DIR="$GITHUB_WORKSPACE/.cache/mesh-llama/build-stage-abi-metal"
export CMAKE_OSX_DEPLOYMENT_TARGET=10.15
"$MESH_ROOT/scripts/prepare-llama.sh" pinned
"$MESH_ROOT/scripts/build-llama.sh" -DCMAKE_OSX_DEPLOYMENT_TARGET=10.15
Comment on lines +95 to +108
run: |
set -euo pipefail
cargo fetch --manifest-path desktop/src-tauri/Cargo.toml
SHORT='${{ steps.mesh_rev.outputs.short }}'
MESH_ROOT=$(find "${CARGO_HOME:-$HOME/.cargo}/git/checkouts" -path "*/$SHORT" -type d -name "$SHORT" | head -1)
if [[ -z "$MESH_ROOT" ]]; then
echo "::error::mesh-llm checkout for $SHORT not found after cargo fetch"
exit 1
fi
export LLAMA_STAGE_BACKEND=metal
export LLAMA_STAGE_BUILD_DIR="$GITHUB_WORKSPACE/.cache/mesh-llama/build-stage-abi-metal"
export CMAKE_OSX_DEPLOYMENT_TARGET=10.15
"$MESH_ROOT/scripts/prepare-llama.sh" pinned
"$MESH_ROOT/scripts/build-llama.sh" -DCMAKE_OSX_DEPLOYMENT_TARGET=10.15
@michaelneale
Copy link
Copy Markdown

FWIW I think we may be able to do a better gate with mesh side admission control vs bespoke relay hosting (this means relays see nothing in the clear and have no bearing on security): Mesh-LLM/mesh-llm#589 just merged in ability for that (which may help).

reason for relays: the closer they are to the devices connecting, the more likely their are for QUIC direct connections to establish for low latency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants