Skip to content

alarmd — Alarm Framework for SONiC [HLD]#2314

Draft
neelDatta17 wants to merge 3 commits into
sonic-net:masterfrom
neelDatta17:alarmd_hld
Draft

alarmd — Alarm Framework for SONiC [HLD]#2314
neelDatta17 wants to merge 3 commits into
sonic-net:masterfrom
neelDatta17:alarmd_hld

Conversation

@neelDatta17
Copy link
Copy Markdown

The high level design document for alarmd, a new daemon that provides centralized, structured hardware alarm detection for SONiC. It consumes existing STATE_DB data (zero changes to existing daemons), evaluates fault conditions defined entirely in JSON config, and writes unified alarm state to a SYSTEM_ALARMS table with severity, category, and lifecycle tracking.

Signed-off-by: Neel Datta <neel.datta@hpe.com>
Signed-off-by: Neel Datta <neel.datta@hpe.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@neelDatta17 neelDatta17 marked this pull request as draft April 28, 2026 08:20
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md
- **System resources**: CPU, memory, disk usage (similar to monit, but
integrated into SYSTEM_ALARMS)
- **Container health**: Check if critical Docker containers are running
- **Platform-specific hardware**: FPGA status, I2C bus health, GPIO state,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be either a bash or python script? Could it be a binary?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into testing this, if feasible I'll update HLD accordingly; I may create a separate design doc for the script runner itself.

Comment thread doc/alarmd/alarmd.md
### 5.1 Functional Requirements

- alarmd shall poll STATE_DB tables at a configurable interval (default 3 seconds) and evaluate field conditions defined in alarm definition files.
- alarmd shall execute external health-check scripts at a configurable interval (default 60 seconds) and evaluate exit codes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be a script that runs on a per-interval basis or could I run a health check script just once? For instance, what if I want to check the status of a platform init using a script during boot? I don't suppose there would be a need to run that script repeatedly, would there?

Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md
@bgallagher-nexthop
Copy link
Copy Markdown
Contributor

There already exists a relatively similar daemon called healthd that aggregates health information from multiple sources and reports it to redis with a CLI show system-health to report overall system health.

healthd HLD: https://github.com/sonic-net/SONiC/blob/master/doc/system_health_monitoring/system-health-HLD.md

Have you considered whether this daemon could be extended to aggregate and report hardware alarms as opposed to introducing a new daemon?

Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md
Comment thread doc/alarmd/alarmd.md
Comment thread doc/alarmd/alarmd.md Outdated
no alarm), but the version mismatch warning tells the operator why.

If `sonic_version` is omitted from the alarm_defs, the check is skipped.
If `/etc/sonic/sonic_version.yml` is unreadable, the check is skipped.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud: one could make the argument that not being able to read /etc/sonic/sonic_version.yml would break a lot of things and should probably be a default alarm.

Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md
Comment thread doc/alarmd/alarmd.md
Comment on lines +1474 to +1489
### 13.2 System Tests

Functional tests run on a live DUT over SSH, injecting faults via direct
STATE_DB writes and verifying `SYSTEM_ALARMS` response.

| Category | Scenarios |
|----------|-----------|
| Alarm raise / clear | PSU missing, fan under-speed, thermal warning — verify raise within 2 poll intervals, clear on recovery |
| Multi-FRU independence | Fault on PSU 1 does not affect PSU 2 |
| OR-logic | Voltage low / high both raise `PSU_OUTPUT_VOLTAGE_FAULT`; clears only when in range |
| Script checks | Inject fault script → alarm raised; restore → cleared |
| SIGHUP reload | Valid reload → new thresholds take effect; invalid JSON → old config preserved |
| Daemon restart | Stop/start with active fault → alarms cleared then re-raised within one poll |
| Warmboot / fastboot | `clear_on_startup()` removes stale alarms; faults re-raised; no boot delay |
| CLI | `show alarms`, `show alarms --summary`, `show alarms --json`, filter options (`-s`, `-c`, `-o`, `-g`) output verification |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sonic-mgmtify these tests

Comment thread doc/alarmd/alarmd.md Outdated
Comment thread doc/alarmd/alarmd.md

### 14.2 Alarm History Table

**Motivation**: Operators and NMS/gNMI collectors may want to query historical
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stream the SYSTEM_ALARMS table out, wouldn't collectors have per-switch alarm history anyway?

Comment thread doc/alarmd/alarmd.md
and checks are added to the common baseline, and an optional `disable` list
suppresses any unwanted common checks.

### Relationship to existing SONiC components
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bgallagher-nexthop : you're correct that healthd and alarmd overlap (PSU/FAN_INFO polling, for instance). We did consider extending healthd, but ran into the following concerns:

healthd's hardware checks are hard-coded in hardware_checker.py. Adding data-driven, per-field checks requires mods to the healthd script, as there's no equivalent to alarmd's alarm_defs.json.

SYSTEM_HEALTH_INFO is a flat, schemaless table with no severity, alarm_id, category, or time_created fields. Augmenting these fields onto it could disrupt existing consumers of that table.

Taken together, addressing both would mean redesigning healthd's check logic and output schema, at which point we'd effectively be writing alarmd inside healthd.

We'll add a section to the HLD clarifying the intended scope boundary between the two daemons.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: Neel Datta <neel.datta@hpe.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@venkatmahalingam
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Collaborator

@venkatmahalingam venkatmahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New design should be aligned to the existing alarm design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants