Summary
PyATS D2D execution runs a ConnectionBroker inside the main nac-test process and test jobs/subprocesses connect to it via a Unix socket (NAC_TEST_BROKER_SOCKET). Issue #486 lists "broker crash mid-test" as a missing failure-mode scenario, but expected behavior is currently undefined.
We need to decide whether a broker crash should:
- abort the run (treat as fatal), or
- fall back to direct testbed/device connections for remaining work, or
- attempt to restart the broker and reconnect, etc.
This decision impacts both product behavior and what we should test.
Context
- Broker is started in-process in the orchestrator and socket path exported to subprocesses.
- If the broker dies mid-run, subprocess BrokerClient operations can hang (no IO timeouts today) or fail with connection errors.
Decision needed
Choose one:
-
Fatal / abort
- Treat broker crash as unrecoverable.
- Main process exits with an error and surfaces a clear message.
-
Fallback to direct connections
- If broker becomes unavailable, remaining subprocesses should connect directly (no pooling/caching).
- Must define logging and any performance/behavior caveats.
-
Restart broker
- Attempt to relaunch broker and reconnect clients.
- Requires careful state management and may not be worth complexity.
Test implications
Once the policy is chosen, add integration/chaos tests that:
- start a broker
- connect a BrokerClient
- kill broker mid-run
- assert the chosen behavior (abort/fallback/restart) and that tests do not hang
Related
Summary
PyATS D2D execution runs a
ConnectionBrokerinside the main nac-test process and test jobs/subprocesses connect to it via a Unix socket (NAC_TEST_BROKER_SOCKET). Issue #486 lists "broker crash mid-test" as a missing failure-mode scenario, but expected behavior is currently undefined.We need to decide whether a broker crash should:
This decision impacts both product behavior and what we should test.
Context
Decision needed
Choose one:
Fatal / abort
Fallback to direct connections
Restart broker
Test implications
Once the policy is chosen, add integration/chaos tests that:
Related