UCP/EP/FT: lanes auto recovery after failure#11371
UCP/EP/FT: lanes auto recovery after failure#11371evgeny-leksikov wants to merge 17 commits intoopenucx:masterfrom
Conversation
- recursive request restart from zcopy completion - am header leak - ep_config ref counter fix for ifaces reactivation in lanes discard flow
…ation - Removed assignments of ep->am_lane in multiple functions to streamline endpoint initialization and configuration. - Updated logging to include am_lane information when clearing failed lanes. - assert am_lane != NULL in ucp_ep_recovery_send_request
| uint64_t dst_ep_id; /* Endpoint ID of destination, can be | ||
| UCS_PTR_MAP_KEY_INVALID */ | ||
|
|
||
| /* TODO: move these to a separate header(s) due to compatibility reason. */ |
There was a problem hiding this comment.
Need to do that :) (also discussed f2f)
| } | ||
|
|
||
| return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ? | ||
| UCS_ERR_UNREACHABLE : UCS_OK; |
There was a problem hiding this comment.
ucp_proto_request_restart checks ucs_assertv_always(status == UCS_ERR_CANCELED..
| ucs_time_t now; | ||
| ucs_status_t status; | ||
|
|
||
| UCS_ASYNC_BLOCK(&worker->async); |
There was a problem hiding this comment.
Can deadlock in MT mode? (ucs_callbackq_add_oneshot..)
There was a problem hiding this comment.
I'm not sure I fully understand. Each implementation of UCS_ASYNC_BLOCK isupports recursive locking
|
|
||
| /* Arm recovery; the first round fires after recovery_interval, which | ||
| * also lets the async discard above finalize. */ | ||
| (void)ucp_ep_recovery_arm(ucp_ep); |
There was a problem hiding this comment.
Why silent-ignore failures?
|
|
||
| if (ucs_likely(self->status == UCS_OK) || | ||
| (req->send.ep->flags & UCP_EP_FLAG_FAILED)) { | ||
| ucp_am_eager_zcopy_completion(self); |
There was a problem hiding this comment.
Might pass a req with reg_desc == NULL to ucp_am_eager_zcopy_completion (if status != OK from before)
There was a problem hiding this comment.
not relevant in #11379 but the flow "error from before" should restart and reinitialize the request, intemediate error status tracked by UCT completion
| return status; | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
not relevant in #11379 , replaced with assert in master
| return status; | ||
| } | ||
|
|
||
| return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ? |
There was a problem hiding this comment.
not relevant in #11379 , replaced with assert in master
| return UCS_ERR_NO_MEMORY; | ||
| } | ||
|
|
||
| ucp_trace_req(req, "allocating AM header descriptor 0x%p", reg_desc); |
| if (*lane_idx != UCP_NULL_LANE) { | ||
| tmp_lanes.insert(*lane_idx); | ||
| } | ||
| } |
| } | ||
|
|
||
| short_progress_loop(); | ||
| ASSERT_EQ(0, m_err_count) << "Error callback invoked " << m_err_count << " times"; |
There was a problem hiding this comment.
Now there's no check of m_err_count after the final injection (moved to inside iter)
|
replaced with #11379 |
What?
POC implementation of lanes auto recovery after failure
Inctoduce new parameters to control interval and retries:
UCX_RECOVERY_INTERVAL=5000000.00us
UCX_RECOVERY_RETRIES=inf
Why?
Fault tolerance improvements
How?
Introduce new wireup-like handshake: re-create failed UCT lanes and exchange their addresses by live AM lane