Skip to content

Define behavior when ConnectionBroker crashes mid-run #736

@oboehmer

Description

@oboehmer

Summary

PyATS D2D execution runs a ConnectionBroker inside the main nac-test process and test jobs/subprocesses connect to it via a Unix socket (NAC_TEST_BROKER_SOCKET). Issue #486 lists "broker crash mid-test" as a missing failure-mode scenario, but expected behavior is currently undefined.

We need to decide whether a broker crash should:

  • abort the run (treat as fatal), or
  • fall back to direct testbed/device connections for remaining work, or
  • attempt to restart the broker and reconnect, etc.

This decision impacts both product behavior and what we should test.

Context

  • Broker is started in-process in the orchestrator and socket path exported to subprocesses.
  • If the broker dies mid-run, subprocess BrokerClient operations can hang (no IO timeouts today) or fail with connection errors.

Decision needed

Choose one:

  1. Fatal / abort

    • Treat broker crash as unrecoverable.
    • Main process exits with an error and surfaces a clear message.
  2. Fallback to direct connections

    • If broker becomes unavailable, remaining subprocesses should connect directly (no pooling/caching).
    • Must define logging and any performance/behavior caveats.
  3. Restart broker

    • Attempt to relaunch broker and reconnect clients.
    • Requires careful state management and may not be worth complexity.

Test implications

Once the policy is chosen, add integration/chaos tests that:

  • start a broker
  • connect a BrokerClient
  • kill broker mid-run
  • assert the chosen behavior (abort/fallback/restart) and that tests do not hang

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions