alarmd — Alarm Framework for SONiC [HLD]#2314
Conversation
Signed-off-by: Neel Datta <neel.datta@hpe.com>
Signed-off-by: Neel Datta <neel.datta@hpe.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
| - **System resources**: CPU, memory, disk usage (similar to monit, but | ||
| integrated into SYSTEM_ALARMS) | ||
| - **Container health**: Check if critical Docker containers are running | ||
| - **Platform-specific hardware**: FPGA status, I2C bus health, GPIO state, |
There was a problem hiding this comment.
Does it have to be either a bash or python script? Could it be a binary?
There was a problem hiding this comment.
Will look into testing this, if feasible I'll update HLD accordingly; I may create a separate design doc for the script runner itself.
| ### 5.1 Functional Requirements | ||
|
|
||
| - alarmd shall poll STATE_DB tables at a configurable interval (default 3 seconds) and evaluate field conditions defined in alarm definition files. | ||
| - alarmd shall execute external health-check scripts at a configurable interval (default 60 seconds) and evaluate exit codes. |
There was a problem hiding this comment.
Does it have to be a script that runs on a per-interval basis or could I run a health check script just once? For instance, what if I want to check the status of a platform init using a script during boot? I don't suppose there would be a need to run that script repeatedly, would there?
|
There already exists a relatively similar daemon called
Have you considered whether this daemon could be extended to aggregate and report hardware alarms as opposed to introducing a new daemon? |
| no alarm), but the version mismatch warning tells the operator why. | ||
|
|
||
| If `sonic_version` is omitted from the alarm_defs, the check is skipped. | ||
| If `/etc/sonic/sonic_version.yml` is unreadable, the check is skipped. |
There was a problem hiding this comment.
Just thinking out loud: one could make the argument that not being able to read /etc/sonic/sonic_version.yml would break a lot of things and should probably be a default alarm.
| ### 13.2 System Tests | ||
|
|
||
| Functional tests run on a live DUT over SSH, injecting faults via direct | ||
| STATE_DB writes and verifying `SYSTEM_ALARMS` response. | ||
|
|
||
| | Category | Scenarios | | ||
| |----------|-----------| | ||
| | Alarm raise / clear | PSU missing, fan under-speed, thermal warning — verify raise within 2 poll intervals, clear on recovery | | ||
| | Multi-FRU independence | Fault on PSU 1 does not affect PSU 2 | | ||
| | OR-logic | Voltage low / high both raise `PSU_OUTPUT_VOLTAGE_FAULT`; clears only when in range | | ||
| | Script checks | Inject fault script → alarm raised; restore → cleared | | ||
| | SIGHUP reload | Valid reload → new thresholds take effect; invalid JSON → old config preserved | | ||
| | Daemon restart | Stop/start with active fault → alarms cleared then re-raised within one poll | | ||
| | Warmboot / fastboot | `clear_on_startup()` removes stale alarms; faults re-raised; no boot delay | | ||
| | CLI | `show alarms`, `show alarms --summary`, `show alarms --json`, filter options (`-s`, `-c`, `-o`, `-g`) output verification | | ||
|
|
There was a problem hiding this comment.
sonic-mgmtify these tests
|
|
||
| ### 14.2 Alarm History Table | ||
|
|
||
| **Motivation**: Operators and NMS/gNMI collectors may want to query historical |
There was a problem hiding this comment.
If we stream the SYSTEM_ALARMS table out, wouldn't collectors have per-switch alarm history anyway?
| and checks are added to the common baseline, and an optional `disable` list | ||
| suppresses any unwanted common checks. | ||
|
|
||
| ### Relationship to existing SONiC components |
There was a problem hiding this comment.
Hi @bgallagher-nexthop : you're correct that healthd and alarmd overlap (PSU/FAN_INFO polling, for instance). We did consider extending healthd, but ran into the following concerns:
healthd's hardware checks are hard-coded in hardware_checker.py. Adding data-driven, per-field checks requires mods to the healthd script, as there's no equivalent to alarmd's alarm_defs.json.
SYSTEM_HEALTH_INFO is a flat, schemaless table with no severity, alarm_id, category, or time_created fields. Augmenting these fields onto it could disrupt existing consumers of that table.
Taken together, addressing both would mean redesigning healthd's check logic and output schema, at which point we'd effectively be writing alarmd inside healthd.
We'll add a section to the HLD clarifying the intended scope boundary between the two daemons.
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
Signed-off-by: Neel Datta <neel.datta@hpe.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@neelDatta17 Can we align to the existing Alarm design in the community? |
venkatmahalingam
left a comment
There was a problem hiding this comment.
New design should be aligned to the existing alarm design.
The high level design document for alarmd, a new daemon that provides centralized, structured hardware alarm detection for SONiC. It consumes existing STATE_DB data (zero changes to existing daemons), evaluates fault conditions defined entirely in JSON config, and writes unified alarm state to a SYSTEM_ALARMS table with severity, category, and lifecycle tracking.