OTP supervision for multi-provider AI agents — an additive architectural exploration
T3Code's PR #581 centralizes
harness adapters behind a single HarnessService WebSocket server with event
projection via projector.ts. This is the right direction — a unified contract
across providers is exactly what multi-agent orchestration needs.
This page explores what happens when you take that same contract and add an OTP supervision layer underneath it. Not replacing the Node server — extending it with structural guarantees for the part that's hardest to get right: managing concurrent, crash-prone agent sessions.
A note on scope. The HarnessService abstraction in PR #581 is
itself still under discussion — Theo's own review comment asks what differentiates it from
the existing ProviderService / ProviderServiceLive layer. This
exploration treats the harness boundary as a useful architectural surface regardless of
where it ultimately lives in T3Code's stack. If the team consolidates
HarnessService back into ProviderService, the OTP mapping still
applies — it's the supervision boundary that matters, not the class name.
Anthropic only lets you access Claude Code through a Claude Max subscription (instead of per-usage API billing) if you use their Agent SDK. The Agent SDK is implemented in Node. There is no Elixir version. So Claude and Codex must stay Node — that's non-negotiable.
This means any OTP layer has to be additive. It sits underneath the existing Node server, connected by a single WebSocket, handling supervision and lifecycle while Node retains SDK access, persistence, and TypeScript contracts.
This isn't a novel observation. George Guimarães documented the same convergence across LangGraph, AutoGen, CrewAI, and Langroid — every major agent framework is independently reinventing the actor model. Isolated state, message passing, supervision hierarchies, fault recovery: all solved problems on the BEAM since 1986.
T3Code's HarnessService centralization follows the same pattern. The question
isn't whether you need these primitives — you clearly do, that's what PR #581 is building.
The question is whether you build them in application code or get them from the runtime.
Green chips = T3Code PR #581 already solving this. Dashed = structural gap that OTP fills. Framework analysis via Guimarães (2026).
Guimarães argues that you can get about 70% of the way there with enough engineering — the remaining 30%, which includes preemptive scheduling, per-process garbage collection, and true fault isolation, requires runtime-level support.
Guimarães later qualified this claim, noting that the BEAM's advantages vary by workload and honestly naming the ecosystem gaps (immature LLM tooling, smaller talent pool, Python/Node-first SDKs). For T3Code specifically — a desktop tool, not a server — two of the four properties he cites (preemptive scheduling, hot code swapping) don't meaningfully apply. The two that do — per-process garbage collection and true fault isolation — are the ones this exploration focuses on.
The HarnessService pattern is that 70%. It gives you centralized lifecycle,
event projection, and a unified contract. What it can't give you — because Node's V8 doesn't
support it — is per-session crash isolation and independent memory containment. That's the
30% that an OTP layer fills.
| Concern | Owner | |
|---|---|---|
| Agent SDK access (Claude, Codex) | Node | stays |
| Canonical event types + contracts | Node (TypeScript) | stays |
| SQLite persistence | Node | stays |
| Browser / Electron WebSocket | Node | stays |
| Provider process supervision | Elixir (OTP) | new |
| Crash isolation | Elixir (BEAM heaps) | new |
| Per-session memory containment | Elixir (process heaps) | new |
| Event normalization | Elixir | new |
| Event projection + snapshot | Both | shared |
When you run 4+ concurrent agent sessions — especially with subagents — the V8 shared heap becomes a liability. One leaky or crashing session affects every other session sharing that process. OTP's per-process heaps make this structurally impossible. Try it:
8 test types, single-run results framed as architectural demonstrations — not statistical claims. The structural properties (per-process heaps, supervision cleanup) are BEAM guarantees, not observations.
We scaled concurrent sessions from 5 to 200, all streaming simultaneously. From 5 to 100, both runtimes are functionally identical — you would not be able to tell them apart. At 150, Node's event loop lag creeps to 199ms. At 200, Node degrades sharply: latency jumps to 3.3 seconds, throughput drops (the event loop is so saturated it processes fewer events, not more), and event loop lag hits 1.2 seconds. Elixir at 200 sessions: 607ms latency, 18,000 events/sec, near-zero scheduler utilization, constant ~268KB per session.
This is the number that matters: if T3Code stays at 1–10 sessions, PR #581's Node approach is simpler and sufficient. If it grows into a control plane with concurrent subagent trees, the structural advantage materializes around 150 sessions — and it's not gradual. It's a cliff.
Codex in plan mode spawned 10 real subagents. Node oscillated 1.8–54 MB on the shared heap. Elixir held 54–63 MB flat while processing 14,918 events and 824 tool calls. Both completed all work — the difference is in predictability under load.
Elixir wins on isolation, observability, and per-session memory attribution. Node wins on raw throughput and SDK ecosystem access. Hence: two runtimes, each handling what it does best.
Adding a new provider to T3Code today means writing a TypeScript adapter, extending the
contracts, adding tests, and wiring it through the orchestration layer. In the OTP
architecture, providers that speak a CLI or HTTP protocol (and don't require a Node-specific
SDK) can be added as a single GenServer module + a clause in
SessionManager.provider_module/1, with no changes to the Node side. For
providers that do require Node SDKs — Claude (Agent SDK) and Codex (app server)
today — the Node adapter is still necessary. The OTP layer manages supervision and
lifecycle, but the SDK call itself stays in Node.
The savings are real but scoped: the more providers that communicate via standard protocols (CLI, HTTP, WebSocket), the more the OTP architecture reduces per-provider integration cost. For SDK-gated providers, both sides need code.
If T3Code stayed a one-agent-at-a-time tool, I would not push hard for Elixir. Subagents change the math. Imagine a future session:
Now ask the annoying questions:
What happens if one subagent crashes mid-turn?
What happens if one subagent is blocked on approval while another keeps streaming?
What happens if the parent cancels and all children must stop cleanly?
What happens if one child leaks memory or gets wedged on reconnect?
A Node-only architecture can answer all of them — but it answers them behaviorally: careful
process accounting, careful cleanup chains, careful shared-state discipline. OTP answers
them structurally: each subagent is a GenServer under a
DynamicSupervisor, parent crash cascades to children via process links, one
child's memory leak is contained to its own heap, and restart policies are declarative, not
imperative.
The more autonomous and unpredictable agents become, the stronger this argument gets. "Start child, monitor child, restart child, stop subtree" stop being architecture words. They become the product.
BEAM runtime adds ~60–80 MB to the Electron bundle. That's real. For a desktop app, it matters. The counter-argument is that with 4+ concurrent sessions running subagents, you're already in "supervise a tree of unstable runtimes" territory — and that's literally what OTP was designed for. But the bundle size cost is worth acknowledging upfront.
It's a second runtime to maintain. The Elixir ecosystem is smaller than Node's. Hiring is harder. The mitigation is that the OTP layer is intentionally thin (total ~3,900 LOC across all modules) — it does one thing well and delegates everything else to Node.
The bridge WebSocket is an additional failure surface. In the Node-only approach, events flow directly — no bridge, no extra hop. With the harness, there's a Phoenix Channel between Elixir and Node. That adds ~1ms of latency (invisible for most operations) but also a real failure mode: if the bridge drops during a streaming turn, events are lost until reconnection. The WAL ring buffer mitigates this with replay, but the failure mode does not exist in a single-runtime design.
Cross-runtime debugging is objectively harder. When something fails in Node, you get a continuous stack trace. With the harness, an error can start in Elixir (GenServer crash), manifest in Node (bridge disconnect), and surface in the browser (events stop). Two log systems, two debuggers. This is a cost that compounds over time if the boundary isn't sharp.
The Agent SDK constraint is real and permanent. As long as Anthropic gates subscription-based Claude Code access through their Node SDK, Claude and Codex must route through Node. The OTP layer manages the provider process lifecycle, but the SDK call itself stays in the Node adapter. This is a feature, not a bug — it respects the ecosystem as it exists.
We compared both approaches across ten concrete failure scenarios. Not benchmarks — failure scenarios. Because the OTP argument is not about throughput. It's about what happens when things go wrong.
| # | Scenario | Node | Elixir | |
|---|---|---|---|---|
| 1 | Provider CLI hangs | Session stale, others unaffected | GenServer stale, others unaffected | tie |
| 2 | Provider CLI crashes | child.on("exit") cleans up |
Port exit handled, supervisor cleanup | tie |
| 3 | Memory leak in one session | Shared heap +110 MB, lag +169ms p99 | Leak bounded per-process (94–352KB) | elixir |
| 4 | Unhandled exception | 5/5 survivors continue | 5/5 survivors continue | tie |
| 5 | 50+ concurrent sessions | Degrades sharply at ~150 (3.3s p99 at 200) | 607ms at 200, near-zero scheduler util | elixir |
| 6 | Process spawn/teardown churn | OS handles it | Port + process links, similar | tie |
| 7 | Subagent trees | Manual parent-child tracking | DynamicSupervisor, process links | elixir |
| 8 | Bridge WebSocket disconnect | Does not exist — events flow directly | Events lost until reconnection | node |
| 9 | Development complexity | One language, one debugger | Two languages, cross-runtime debugging | node |
| 10 | Desktop packaging | Electron bundles Node, zero extra cost | +60–80 MB BEAM runtime | node |
Score: Elixir 3, Node 3, Tie 4. But the scores are not equal in weight. Elixir's wins (3, 5, 7) are the ones that scale worst if left unaddressed — memory coupling compounds over hours of use, event loop saturation hits a cliff at ~150 sessions, and subagent cleanup chains grow with tree depth. Node's wins (8, 9, 10) are present-tense shipping costs. Both are real. The question is which set of problems you'd rather have.
A note on the "careful application code" failure class.
Macroscope's review of PR #581 found concrete instances worth examining. One is a logic bug
independent of runtime choice: adapter-emitted events pass through
#publish without re-sequencing, which can regress the global snapshot sequence
— this would need fixing in Elixir too. But the other two are lifecycle-management concerns
that OTP handles structurally: shutdownSession aborts the event stream before
the adapter completes, potentially losing final session.exited events; and
attachSession never calls #startStreaming, leaving attached
sessions deaf to events. These are exactly the class of lifecycle gaps where a supervisor's
declarative restart and shutdown policies eliminate entire categories of ordering bugs — not
because the Node code is bad (it's early WIP and well-structured), but because managing
concurrent session lifecycles through application code requires getting every edge case
right manually. A supervisor doesn't have edge cases. It has a restart policy.
For the full analysis with crossover benchmarks, methodology notes, and adversarial review: complete writeup.