Skip to content

[HLD] Add handling of role programming and bulk sync failures in HA state machine.#2336

Open
BYGX-wcr wants to merge 2 commits into
sonic-net:masterfrom
BYGX-wcr:enhance-ha-workflow
Open

[HLD] Add handling of role programming and bulk sync failures in HA state machine.#2336
BYGX-wcr wants to merge 2 commits into
sonic-net:masterfrom
BYGX-wcr:enhance-ha-workflow

Conversation

@BYGX-wcr
Copy link
Copy Markdown
Contributor

@BYGX-wcr BYGX-wcr commented May 14, 2026

Previously, we didn't define how to handle bulk sync failures and role activation failures.

Now, I define it as follows:

  1. When a bulk sync failure occurs, we roll the Active back to Standalone.
  2. When a DPU failed to activate a target role, we enter standalone setup.

Besides, I also define a new field in DASH_HA_SCOPE_STATE to give target audience a more direct way to know failures from DPUs.

Related Code PR:
sonic-net/sonic-dash-ha#165

…chine.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@BYGX-wcr BYGX-wcr requested a review from r12f May 14, 2026 23:22
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

r12f
r12f previously approved these changes May 15, 2026
Copy link
Copy Markdown
Contributor

@r12f r12f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. lgtm.

| DASH_HA_SCOPE_STATE | | | State of each HA scope. |
| | \<HA_SCOPE_ID\> | | HA scope ID. It can be the HA set ID or ENI ID, depending on the which HA mode is used. |
| | | last_updated_time | The last update time of this state in milliseconds. |
| | | last_update_ec | The error code of last update. 0 means no error. Non-zero values indicate different errors(Check the table below). |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change ec to err / error / errorcode

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

The current list of error code is defined as follows:
| Values | Error |
| --- | --- |
| 1 | Role activation failed |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can follow sai status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

…ode for the semantics

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@zjswhhh zjswhhh self-assigned this May 19, 2026
@zjswhhh zjswhhh changed the title Add handling of role activation and bulk sync failures in HA state machine. [HLD] Add handling of role activation and bulk sync failures in HA state machine. May 19, 2026
Copy link
Copy Markdown
Contributor

@zjswhhh zjswhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BYGX-wcr BYGX-wcr changed the title [HLD] Add handling of role activation and bulk sync failures in HA state machine. [HLD] Add handling of role programming and bulk sync failures in HA state machine. May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants