Skip to content

UCP/EP/FT: lanes auto recovery after failure#11371

Closed
evgeny-leksikov wants to merge 17 commits intoopenucx:masterfrom
evgeny-leksikov:ucp_ft_recovery_poc
Closed

UCP/EP/FT: lanes auto recovery after failure#11371
evgeny-leksikov wants to merge 17 commits intoopenucx:masterfrom
evgeny-leksikov:ucp_ft_recovery_poc

Conversation

@evgeny-leksikov
Copy link
Copy Markdown
Contributor

@evgeny-leksikov evgeny-leksikov commented Apr 24, 2026

What?

POC implementation of lanes auto recovery after failure
Inctoduce new parameters to control interval and retries:
UCX_RECOVERY_INTERVAL=5000000.00us
UCX_RECOVERY_RETRIES=inf

Why?

Fault tolerance improvements

How?

Introduce new wireup-like handshake: re-create failed UCT lanes and exchange their addresses by live AM lane

Comment thread src/ucp/wireup/wireup.h
uint64_t dst_ep_id; /* Endpoint ID of destination, can be
UCS_PTR_MAP_KEY_INVALID */

/* TODO: move these to a separate header(s) due to compatibility reason. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to do that :) (also discussed f2f)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4816797

Comment thread src/ucp/am/eager_multi.c
}

return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ?
UCS_ERR_UNREACHABLE : UCS_OK;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_proto_request_restart checks ucs_assertv_always(status == UCS_ERR_CANCELED..

Comment thread src/ucp/core/ucp_ep.c
ucs_time_t now;
ucs_status_t status;

UCS_ASYNC_BLOCK(&worker->async);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can deadlock in MT mode? (ucs_callbackq_add_oneshot..)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand. Each implementation of UCS_ASYNC_BLOCK isupports recursive locking

Comment thread src/ucp/core/ucp_ep.c

/* Arm recovery; the first round fires after recovery_interval, which
* also lets the async discard above finalize. */
(void)ucp_ep_recovery_arm(ucp_ep);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why silent-ignore failures?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4816797

Comment thread src/ucp/am/eager_multi.c

if (ucs_likely(self->status == UCS_OK) ||
(req->send.ep->flags & UCP_EP_FLAG_FAILED)) {
ucp_am_eager_zcopy_completion(self);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might pass a req with reg_desc == NULL to ucp_am_eager_zcopy_completion (if status != OK from before)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant in #11379 but the flow "error from before" should restart and reinitialize the request, intemediate error status tracked by UCT completion

Comment thread src/ucp/am/eager_multi.c
return status;
}

/**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: comment style

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant in #11379 , replaced with assert in master

Comment thread src/ucp/am/eager_multi.c
return status;
}

return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucs_unlikely?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant in #11379 , replaced with assert in master

Comment thread src/ucp/am/eager.inl
return UCS_ERR_NO_MEMORY;
}

ucp_trace_req(req, "allocating AM header descriptor 0x%p", reg_desc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x is redundant

if (*lane_idx != UCP_NULL_LANE) {
tmp_lanes.insert(*lane_idx);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space after

}

short_progress_loop();
ASSERT_EQ(0, m_err_count) << "Error callback invoked " << m_err_count << " times";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there's no check of m_err_count after the final injection (moved to inside iter)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #11405

@evgeny-leksikov
Copy link
Copy Markdown
Contributor Author

replaced with #11379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants