alarmd — Alarm Framework for SONiC [HLD] by neelDatta17 · Pull Request #2314 · sonic-net/SONiC

neelDatta17 · 2026-04-28T08:17:45Z

The high level design document for alarmd, a new daemon that provides centralized, structured hardware alarm detection for SONiC. It consumes existing STATE_DB data (zero changes to existing daemons), evaluates fault conditions defined entirely in JSON config, and writes unified alarm state to a SYSTEM_ALARMS table with severity, category, and lifecycle tracking.

Signed-off-by: Neel Datta <neel.datta@hpe.com>

mssonicbld · 2026-04-28T08:17:53Z

/azp run

azure-pipelines · 2026-04-28T08:17:59Z

No pipelines are associated with this pull request.

ashwnsri · 2026-04-28T18:08:01Z

+- **System resources**: CPU, memory, disk usage (similar to monit, but
+  integrated into SYSTEM_ALARMS)
+- **Container health**: Check if critical Docker containers are running
+- **Platform-specific hardware**: FPGA status, I2C bus health, GPIO state,


Does it have to be either a bash or python script? Could it be a binary?

Will look into testing this, if feasible I'll update HLD accordingly; I may create a separate design doc for the script runner itself.

ashwnsri · 2026-04-28T18:10:36Z

+### 5.1 Functional Requirements
+
+- alarmd shall poll STATE_DB tables at a configurable interval (default 3 seconds) and evaluate field conditions defined in alarm definition files.
+- alarmd shall execute external health-check scripts at a configurable interval (default 60 seconds) and evaluate exit codes.


Does it have to be a script that runs on a per-interval basis or could I run a health check script just once? For instance, what if I want to check the status of a platform init using a script during boot? I don't suppose there would be a need to run that script repeatedly, would there?

bgallagher-nexthop · 2026-04-30T18:02:59Z

There already exists a relatively similar daemon called healthd that aggregates health information from multiple sources and reports it to redis with a CLI show system-health to report overall system health.

healthd HLD: https://github.com/sonic-net/SONiC/blob/master/doc/system_health_monitoring/system-health-HLD.md

Have you considered whether this daemon could be extended to aggregate and report hardware alarms as opposed to introducing a new daemon?

ashwnsri · 2026-04-29T18:14:35Z

+  no alarm), but the version mismatch warning tells the operator why.
+
+If `sonic_version` is omitted from the alarm_defs, the check is skipped.
+If `/etc/sonic/sonic_version.yml` is unreadable, the check is skipped.


Just thinking out loud: one could make the argument that not being able to read /etc/sonic/sonic_version.yml would break a lot of things and should probably be a default alarm.

ashwnsri · 2026-04-30T18:33:40Z

+### 13.2 System Tests
+
+Functional tests run on a live DUT over SSH, injecting faults via direct
+STATE_DB writes and verifying `SYSTEM_ALARMS` response.
+
+| Category | Scenarios |
+|----------|-----------|
+| Alarm raise / clear | PSU missing, fan under-speed, thermal warning — verify raise within 2 poll intervals, clear on recovery |
+| Multi-FRU independence | Fault on PSU 1 does not affect PSU 2 |
+| OR-logic | Voltage low / high both raise `PSU_OUTPUT_VOLTAGE_FAULT`; clears only when in range |
+| Script checks | Inject fault script → alarm raised; restore → cleared |
+| SIGHUP reload | Valid reload → new thresholds take effect; invalid JSON → old config preserved |
+| Daemon restart | Stop/start with active fault → alarms cleared then re-raised within one poll |
+| Warmboot / fastboot | `clear_on_startup()` removes stale alarms; faults re-raised; no boot delay |
+| CLI | `show alarms`, `show alarms --summary`, `show alarms --json`, filter options (`-s`, `-c`, `-o`, `-g`) output verification |
+


sonic-mgmtify these tests

ashwnsri · 2026-04-30T18:35:08Z

+
+### 14.2 Alarm History Table
+
+**Motivation**: Operators and NMS/gNMI collectors may want to query historical


If we stream the SYSTEM_ALARMS table out, wouldn't collectors have per-switch alarm history anyway?

ashwnsri · 2026-04-30T19:23:42Z

+and checks are added to the common baseline, and an optional `disable` list
+suppresses any unwanted common checks.
+
+### Relationship to existing SONiC components


Hi @bgallagher-nexthop : you're correct that healthd and alarmd overlap (PSU/FAN_INFO polling, for instance). We did consider extending healthd, but ran into the following concerns:

healthd's hardware checks are hard-coded in hardware_checker.py. Adding data-driven, per-field checks requires mods to the healthd script, as there's no equivalent to alarmd's alarm_defs.json.

SYSTEM_HEALTH_INFO is a flat, schemaless table with no severity, alarm_id, category, or time_created fields. Augmenting these fields onto it could disrupt existing consumers of that table.

Taken together, addressing both would mean redesigning healthd's check logic and output schema, at which point we'd effectively be writing alarmd inside healthd.

We'll add a section to the HLD clarifying the intended scope boundary between the two daemons.

mssonicbld · 2026-05-05T21:42:00Z

/azp run

azure-pipelines · 2026-05-05T21:42:07Z

No pipelines are associated with this pull request.

mssonicbld · 2026-05-05T21:43:32Z

/azp run

azure-pipelines · 2026-05-05T21:43:38Z

No pipelines are associated with this pull request.

Signed-off-by: Neel Datta <neel.datta@hpe.com>

mssonicbld · 2026-05-05T22:58:01Z

/azp run

azure-pipelines · 2026-05-05T22:58:07Z

No pipelines are associated with this pull request.

venkatmahalingam · 2026-05-15T05:30:38Z

@neelDatta17 Can we align to the existing Alarm design in the community?
https://github.com/sonic-net/SONiC/blob/master/doc/event-alarm-framework/event-alarm-framework.md
sonic-net/sonic-buildimage#22617

venkatmahalingam

New design should be aligned to the existing alarm design.

neelDatta17 added 2 commits April 27, 2026 17:22

Initial commit for alarmd hld.

43f1ae2

Signed-off-by: Neel Datta <neel.datta@hpe.com>

Commiting the full alarmd HLD

ae7b1ff

Signed-off-by: Neel Datta <neel.datta@hpe.com>

neelDatta17 marked this pull request as draft April 28, 2026 08:20

ashwnsri requested changes Apr 29, 2026

View reviewed changes

ashwnsri requested changes Apr 30, 2026

View reviewed changes

ashwnsri reviewed Apr 30, 2026

View reviewed changes

Addressed review comments and added png diagrams.

9c09a13

Signed-off-by: Neel Datta <neel.datta@hpe.com>

neelDatta17 force-pushed the alarmd_hld branch from b388a14 to 9c09a13 Compare May 5, 2026 22:57

venkatmahalingam requested changes May 15, 2026

View reviewed changes


		### 14.2 Alarm History Table

		Motivation: Operators and NMS/gNMI collectors may want to query historical

Conversation

neelDatta17 commented Apr 28, 2026

Uh oh!

mssonicbld commented Apr 28, 2026

Uh oh!

azure-pipelines Bot commented Apr 28, 2026

Uh oh!

Uh oh!

ashwnsri Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

neelDatta17 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

ashwnsri Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bgallagher-nexthop commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashwnsri Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ashwnsri Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ashwnsri Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ashwnsri Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented May 5, 2026

Uh oh!

azure-pipelines Bot commented May 5, 2026

Uh oh!

mssonicbld commented May 5, 2026

Uh oh!

azure-pipelines Bot commented May 5, 2026

Uh oh!

mssonicbld commented May 5, 2026

Uh oh!

azure-pipelines Bot commented May 5, 2026

Uh oh!

venkatmahalingam commented May 15, 2026

Uh oh!

venkatmahalingam left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants