Skip to content

fix(frontend): make disconnect indicator clickable to reconnect#9410

Open
bestvibes wants to merge 6 commits into
marimo-team:mainfrom
bestvibes:fix/clickable-disconnect-indicator
Open

fix(frontend): make disconnect indicator clickable to reconnect#9410
bestvibes wants to merge 6 commits into
marimo-team:mainfrom
bestvibes:fix/clickable-disconnect-indicator

Conversation

@bestvibes
Copy link
Copy Markdown

@bestvibes bestvibes commented Apr 28, 2026

📝 Summary

Closes #9385

This pull request was primarily authored by a coding agent.

Problem

When marimo is accessed over an SSH-forwarded tunnel (or any flaky link), the kernel WebSocket drops on tunnel reset or laptop sleep. The browser shows the broken-link indicator in the top-left, and the only recovery is a full page reload — disruptive because it resets scroll position, cell focus, and any unsaved client state, despite the kernel surviving the WS drop in edit mode and the server already supporting session resume on a fresh connection (marimo/_server/api/endpoints/ws/ws_session_connector.py).

The frontend was forcing the reload because of two compounding issues:

  1. One-shot reconnect gate: useMarimoKernelConnection.shouldTryReconnecting only resets on a successful onOpen. After partysocket's 10-retry budget is exhausted, the gate stays closed and ws.reconnect() is never called again from our code.
  2. Upstream partysocket bug: in partysocket@1.1.10, _connectLock leaks in the _connect() max-retries early-return — even if the gate were reset, ws.reconnect() would silently no-op because the lock is permanently held. Fixed upstream in partykit#322, released as partysocket 1.1.13.
  3. Stuck spinner: the onClose default branch always sets state to CONNECTING, even after partysocket has exhausted its retry budget — the spinner outlives the actual retry loop with no escape.

The broken-link icon also gives the visual impression of being clickable but is a no-op <div> today.

What changed

This is two commits — a dependency bump (precondition) followed by the frontend fix.

chore(frontend): bump partysocket 1.1.10 → 1.1.13

Patch-version bump pulling in the upstream _connectLock fix. No API-surface changes — verified that retryCount, reconnect(), and maxRetries signatures are identical between 1.1.10 and 1.1.13. The lockfile diff is exactly 6 lines, all about partysocket.

fix(frontend): make disconnect indicator clickable to reconnect

  • Disconnect indicator becomes a real <button> (status.tsx). Native button gives Enter/Space activation and accessibility; previously it was a <div> with a tooltip. When the connection is in a recoverable CLOSED state, clicking calls a new reconnect() exposed by useMarimoKernelConnection. When the connection is CLOSED for any other reason, the button is disabled.
  • reconnect() probes the runtime first via the existing runtimeManager.isHealthy() (/health). If the server is genuinely unreachable (e.g. the SSH tunnel is dead), we transition straight back to CLOSED + KERNEL_DISCONNECTED in ~1s instead of burning partysocket's full retry budget on the silent click. The probe also resets the one-shot reconnect gate so subsequent automatic retries work.
  • Surface CLOSED when partysocket has given up. In the onClose default branch, when ws.retryCount >= MAX_RETRIES, transition to CLOSED + KERNEL_DISCONNECTED instead of CONNECTING. This makes the disconnect icon reappear and become clickable again, so the user can manually retry without reloading. Without this, the spinner stayed up forever even after partysocket had stopped trying.
  • Refactor: the inline switch on event.reason in onClose is extracted into a pure classifyCloseEvent helper (close-handler.ts) so the regression behavior can be unit-tested. The hook is now thin glue: classify → setConnection → optionally close transport or call tryReconnecting.
  • Tests: close-handler.test.ts (15 cases) covers each terminal reason plus the retry-budget-exhaustion path. status.test.tsx (3 cases) covers the click → callback wiring and the disabled state.

Heads-up for review

  • runtimeManager.isHealthy() has a side effect (setDOMBaseUri() on success). It was previously only triggered during initial startup; now it can fire on each manual reconnect. Almost certainly a no-op in practice (URL hasn't changed) but flagging in case you want it gated.
  • IConnectionTransport.retryCount was added to the transport interface so the close handler can detect partysocket give-up. This leaks a partysocket-specific concept into a transport-agnostic interface; an alternative would be to have the transport emit a synthetic close event with a "gave-up" reason. Open to that direction if preferred — current approach was chosen for minimal diff.
  • No automatic-behavior changes for users on a healthy connection. The only paths that exercise the new code are user-initiated clicks and the previously-stuck post-give-up case.
  • No new user-facing strings. "kernel not found" / tooltip text reuse existing copy; only new string is the tooltip suffix "— click to reconnect".

📋 Pre-Review Checklist

  • For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).
  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • Video or media evidence is provided for any visual changes (optional).

✅ Merge Checklist

  • I have read the contributor guidelines.
  • Documentation has been updated where applicable, including docstrings for API changes.
  • Tests have been added for the changes made.

I have read the CLA Document and I hereby sign the CLA

Pulls in upstream fix partykit/partykit#322: `_connectLock` was leaked
in the max-retries early-return of `_connect()`, so once partysocket
exhausted its retry budget every subsequent `reconnect()` call was a
silent no-op because the lock was permanently held. This made any
manual or programmatic reconnect attempt fail after a long outage,
forcing a full page reload to recover.

Patch-version bump, no API surface changes.
When marimo runs over an SSH-forwarded tunnel and the WebSocket drops
(tunnel reset, laptop sleep, NAT idle reaping), the only way to
recover today is a full page reload. The kernel survives the WS drop
in edit mode and the server already supports session resume on a
fresh connection — only the frontend was forcing the reload.

- Disconnect indicator is now a `<button>` when the connection is in
  a recoverable CLOSED state. Click invokes a new `reconnect()`
  exposed by `useMarimoKernelConnection` that resets the one-shot
  reconnect gate and kicks off `ws.reconnect()`.
- `reconnect()` probes the runtime via `runtimeManager.isHealthy()`
  before reopening the socket so a click while the tunnel is dead
  fails in ~1s instead of burning partysocket's full retry budget.
- The `onClose` default branch now transitions to CLOSED +
  KERNEL_DISCONNECTED when partysocket has exhausted `maxRetries`,
  instead of leaving the UI stuck on the spinner indefinitely.
- Close-event classification extracted into a pure helper
  (`classifyCloseEvent`) with unit tests covering each terminal
  reason and the retry-budget exhaustion path.

Closes marimo-team#9385
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment May 13, 2026 3:04am

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@bestvibes
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@bestvibes bestvibes marked this pull request as ready for review May 6, 2026 02:23
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 13 files

Architecture diagram
sequenceDiagram
    participant User as User (click)
    participant Status as StatusOverlay (status.tsx)
    participant Hook as useMarimoKernelConnection
    participant CloseHandler as close-handler.ts
    participant Transport as ReconnectingWebSocket (partysocket)
    participant Runtime as Runtime Manager (/health)
    participant Backend as Backend Server

    Note over User,Backend: Manual Reconnect Flow (NEW)

    User->>Status: click disconnect icon (<button>)
    Status->>Status: NEW: button is rendered, onReconnect passed
    Status->>Hook: NEW: reconnect()

    Hook->>Hook: NEW: shouldTryReconnecting = true
    Hook->>Hook: NEW: setConnection(CONNECTING)
    Hook->>Runtime: NEW: isHealthy() → GET /health

    alt Health check succeeds
        Runtime-->>Hook: 200 OK
        Note over Runtime,Hook: CHANGED: setDOMBaseUri() side effect
        Hook->>Transport: NEW: ws.reconnect()
        Transport->>Transport: NEW: partysocket bugfix (1.1.13) unlocks _connectLock
        Transport->>Backend: WebSocket upgrade
        Backend-->>Transport: onOpen
        Transport-->>Hook: onOpen callback
        Hook->>Hook: shouldTryReconnecting = false
        Hook-->>Status: setConnection(OPEN)
    else Health check fails (server unreachable)
        Runtime-->>Hook: 4xx/5xx/timeout
        Hook->>Hook: NEW: setConnection(CLOSED, KERNEL_DISCONNECTED)
        Hook-->>Status: button stays disabled, icon shown
    end

    Note over Transport,Hook: Automatic Retry with Budget Tracking (CHANGED)

    Transport->>Backend: WebSocket close event
    Transport-->>Hook: onClose(event)
    Hook->>CloseHandler: classifyCloseEvent(event, {retryCount, maxRetries})

    alt Terminal reason (MARIMO_ALREADY_CONNECTED, MARIMO_SHUTDOWN, etc.)
        CloseHandler-->>Hook: { kind: "terminal", status: CLOSED, closeTransport }
        Hook->>Transport: ws.close() to prevent further reconnects
        Hook-->>Status: setConnection(CLOSED with appropriate reason)
    else Retry budget exhausted (retryCount >= MAX_RETRIES=10)
        CloseHandler-->>Hook: { kind: "gave-up", status: CLOSED, KERNEL_DISCONNECTED }
        Hook-->>Status: CHANGED: setConnection(CLOSED) instead of CONNECTING
        Note over Hook,Status: Spinner replaced by clickable disconnect icon
    else Transient close (retries remaining)
        CloseHandler-->>Hook: { kind: "retry", status: CONNECTING }
        Hook->>Transport: tryReconnecting(code, reason)
        Transport->>Backend: automatic partysocket retry
    end

    Note over Status,User: Disconnect Icon Behavior (CHANGED)

    alt Connection is CLOSED + KERNEL_DISCONNECTED + onReconnect provided
        Status-->>User: NEW: <button> with "click to reconnect" tooltip
    else Connection is CLOSED + no onReconnect
        Status-->>User: disabled <button>
    else Connection is CLOSED + ALREADY_RUNNING (canTakeover)
        Status-->>User: LockedIcon (no reconnect button)
    end
Loading

@mscolnick mscolnick requested a review from kirangadhave May 11, 2026 14:05
Copy link
Copy Markdown
Member

@kirangadhave kirangadhave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bestvibes great job. Just a few comments to address, and then should be good to merge. Thanks a lot!

Comment thread frontend/src/core/websocket/close-handler.ts Outdated
Comment thread frontend/src/core/websocket/close-handler.ts Outdated
Comment thread frontend/src/core/websocket/useMarimoKernelConnection.tsx
Comment thread frontend/src/core/websocket/close-handler.ts Outdated
Comment thread frontend/src/core/websocket/useMarimoKernelConnection.tsx Outdated
Comment thread frontend/src/core/websocket/useMarimoKernelConnection.tsx
bestvibes and others added 2 commits May 11, 2026 20:12
- Inline `classifyCloseEvent` into `useMarimoKernelConnection.tsx` and
  delete `close-handler.ts`; co-locate the test alongside the consumer.
- Use `logNever` in the default branch of the close-reason switch so a
  new MARIMO_* reason added server-side surfaces a console warning
  instead of falling silently into the retry path. Move the retry/
  give-up logic after the switch.
- `reconnect()`: early-return on `CONNECTING` as well as `OPEN` so a
  double-click can't fire a duplicate request.
- `reconnect()`: reset `shouldTryReconnecting.current = false` on
  health-probe failure so auto-reconnect doesn't keep firing against
  an unreachable runtime.
@bestvibes
Copy link
Copy Markdown
Author

Thanks for the review @kirangadhave, let me know your thoughts on the latest changes!

@kirangadhave
Copy link
Copy Markdown
Member

@cubic-dev-ai

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 12, 2026

@cubic-dev-ai

@kirangadhave I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 12 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/components/editor/header/status.tsx">

<violation number="1" location="frontend/src/components/editor/header/status.tsx:57">
P2: Disabled reconnect button suppresses tooltip when reconnect is unavailable</violation>
</file>
Architecture diagram
sequenceDiagram
    participant User
    participant Status as StatusOverlay
    participant Hook as useMarimoKernelConnection
    participant Transport as PartySocket
    participant Runtime as runtimeManager
    participant Server as Server (/health & WS)

    Note over User,Server: Manual Reconnect Flow (User clicks disconnect icon)

    User->>Status: click disconnect button
    Status->>Hook: reconnect()
    Hook->>Hook: set state to CONNECTING
    Hook->>Runtime: isHealthy() → GET /health

    alt Server unreachable (e.g., SSH tunnel dead)
        Runtime-->>Hook: false
        Hook->>Hook: set state to CLOSED + KERNEL_DISCONNECTED
        Hook-->>Status: disconnected (disabled button)
        Note over Status: Button stays disabled until next automatic close event
    else Server healthy
        Runtime-->>Hook: true (side effect: setDOMBaseUri())
        Hook->>Transport: ws.reconnect()
        Transport->>Server: WebSocket handshake
        Server-->>Transport: onOpen
        Transport-->>Hook: onOpen callback
        Hook->>Hook: set state to OPEN, reset retry gate
    end

    Note over User,Server: Automatic Retry Exhaustion (partysocket gives up)

    Transport->>Transport: retry 1..MAX_RETRIES (silent)
    Transport->>Hook: onClose (retryCount >= MAX_RETRIES)
    Hook->>Hook: classifyCloseEvent() → kind: "gave-up"
    Hook->>Hook: set state to CLOSED + KERNEL_DISCONNECTED
    Hook-->>Status: disconnected indicator appears (enabled button)
    Status-->>User: "App disconnected — click to reconnect"

    Note over User,Server: Server-Initiated Terminal Close (e.g., MARIMO_SHUTDOWN)

    Server->>Transport: close with reason "MARIMO_SHUTDOWN"
    Transport->>Hook: onClose event
    Hook->>Hook: classifyCloseEvent() → kind: "terminal"
    Hook->>Hook: set state to CLOSED + KERNEL_DISCONNECTED
    Hook->>Transport: ws.close() (prevent auto-reconnect)
    Hook-->>Status: disconnected indicator (button disabled, no onReconnect)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread frontend/src/components/editor/header/status.tsx Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves kernel WebSocket recovery UX by upgrading partysocket and making the top-left disconnect indicator actionable, enabling manual reconnect without a full page reload after retry exhaustion.

Changes:

  • Bump partysocket from 1.1.10 to 1.1.13 (lockfile + package.json).
  • Add retry-budget awareness (MAX_RETRIES, retryCount) and refactor close handling via classifyCloseEvent, plus a new manual reconnect() path that probes /health.
  • Update the status UI to render the disconnect indicator as a <button> and add Vitest coverage for close classification and indicator click wiring.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pnpm-lock.yaml Updates lockfile to partysocket@1.1.13.
frontend/package.json Bumps direct dependency partysocket to 1.1.13.
frontend/src/core/websocket/useWebSocket.tsx Introduces MAX_RETRIES and configures partysocket retries accordingly.
frontend/src/core/websocket/useMarimoKernelConnection.tsx Adds reconnect() and refactors close-handling into classifyCloseEvent using retry budget state.
frontend/src/core/websocket/transports/transport.ts Extends transport interface with retryCount.
frontend/src/core/websocket/transports/basic.ts Implements retryCount for the basic transport.
frontend/src/core/websocket/tests/useMarimoKernelConnection.test.ts Adds unit tests for classifyCloseEvent.
frontend/src/core/run-app.tsx Wires reconnect callback down into the app container.
frontend/src/core/edit-app.tsx Wires reconnect callback down into the app container.
frontend/src/components/editor/app-container.tsx Plumbs onReconnect into StatusOverlay.
frontend/src/components/editor/header/status.tsx Renders disconnect indicator as a button and hooks up onReconnect.
frontend/src/components/editor/header/tests/status.test.tsx Adds tests for indicator click/disabled behavior.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported
Comments suppressed due to low confidence (1)

frontend/src/core/websocket/useWebSocket.tsx:46

  • createConnectionTransport returns a ReconnectingWebSocket and force-casts it to IConnectionTransport, but IConnectionTransport now requires a retryCount: number. This cast bypasses any guarantee that retryCount exists at runtime (or stays compatible across partysocket versions), and if it’s missing/undefined the give-up detection will never trigger. Consider wrapping the socket in a small adapter that defines a retryCount getter (e.g., reading (socket as any).retryCount ?? 0) instead of relying on a double-cast.
function createConnectionTransport(
  options: Pick<UseConnectionTransportOptions, "url" | "static">,
): IConnectionTransport {
  if (options.static) {
    return BasicTransport.empty();
  }
  if (isWasm()) {
    return createPyodideConnection();
  }
  // Create a connection transport using the ReconnectingWebSocket from partysocket
  // This handles reconnecting when the connection is lost.
  const urlProvider = options.url; // We don't call the URL provider now since it may change (i.e. if the runtime redirects)
  // Cast needed: ReconnectingWebSocket types readyState as `number`
  // but IConnectionTransport expects `0 | 1 | 2 | 3`
  return new ReconnectingWebSocket(urlProvider, undefined, {
    maxRetries: MAX_RETRIES,
    debug: false,
    startClosed: true,
    // long timeout -- the server can become slow when many notebooks
    // are open.
    connectionTimeout: 10_000,
  }) as unknown as IConnectionTransport;

Comment thread frontend/src/components/editor/header/status.tsx Outdated
Comment thread frontend/src/core/websocket/useMarimoKernelConnection.tsx
Comment thread frontend/src/core/websocket/__tests__/useMarimoKernelConnection.test.ts Outdated
Copy link
Copy Markdown
Member

@kirangadhave kirangadhave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there. Can you please address the copilot and cubic.dev PR comments? I think we should be good to merge after that. I missed adding those reviews in last iteration.

Also it would be great if you add a video of the button in action.

bestvibes and others added 2 commits May 12, 2026 23:02
- status.tsx: gate the reconnect affordance on
  `connection.code === KERNEL_DISCONNECTED`. MALFORMED_QUERY and
  KERNEL_STARTUP_ERROR are deterministically non-recoverable, so the
  button is disabled for those reasons. ALREADY_RUNNING is unchanged
  (handled by LockedIcon).
- status.tsx: wrap the button in a <span> with toggled tabIndex so
  the tooltip remains reachable when the button is disabled — a
  disabled <button> swallows pointer events.
- test: import MAX_RETRIES from useWebSocket.tsx instead of
  redeclaring, so the test follows the production constant.
- test: add hook-level coverage for the new reconnect() — OPEN and
  CONNECTING early-returns, healthy → ws.reconnect(), unhealthy →
  CLOSED + KERNEL_DISCONNECTED + gate reset.
@bestvibes
Copy link
Copy Markdown
Author

Comments addressed! Here's a short video, it's hard to repro the scenario succinctly since it needs to wait for the retry budget to get exhausted but I think I caught it here.

marimo_demo.mov

@bestvibes bestvibes requested a review from kirangadhave May 13, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disconnect indicator unclickable; reconnect requires page reload after WS retries exhausted

3 participants