Skip to content

Commit 71de97f

Browse files
pandafynemesifier
andauthored
[change] Consider multiple metrics before marking as critical #566
- Added data collected check which checks whether monitoring data is being received, flagged as critical metric - Set status to critical only when all the critical metrics are unhealthy: By default both ping and data collected checks must be failing for devices to be flagged as critical, otherwise they will be flagged as problem - This change also avoids the recovery detection mechanism to be triggered unnecessarily - The recovery detection mechanism is not triggered anymore by checksum requests Closes #566 --------- Co-authored-by: Federico Capoano <f.capoano@openwisp.io>
1 parent c1f87fd commit 71de97f

30 files changed

+745
-376
lines changed

docs/user/checks.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,16 @@ configuration status of a device changes, this ensures the check reacts
3535
quickly to events happening in the network and informs the user promptly
3636
if there's anything that is not working as intended.
3737

38+
.. _monitoring_data_collected_check:
39+
40+
Monitoring Data Collected
41+
-------------------------
42+
43+
This check ensures that the server is receiving metrics from network
44+
devices in a timely manner. You may choose to disable auto creation of
45+
this check by using the setting
46+
:ref:`openwisp_monitoring_auto_data_collected_check`.
47+
3848
.. _iperf3_check:
3949

4050
Iperf3

docs/user/device-health-status.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,18 @@ One of the metrics has a value which is not in the expected range (the
2525
threshold value set in the alert settings has been crossed).
2626

2727
Example: CPU usage should be less than 90% but current value is at 95%.
28+
Or, the device is not reachable by ping check (ping is a critical metric),
29+
but the device is sending monitoring metrics.
2830

2931
``CRITICAL``
3032
------------
3133

32-
One of the metrics defined in
33-
``OPENWISP_MONITORING_CRITICAL_DEVICE_METRICS`` has a value which is not
34-
in the expected range (the threshold value set in the alert settings has
35-
been crossed).
34+
All of the metrics defined in
35+
:ref:`openwisp_monitoring_critical_device_metrics` have values outside the
36+
expected range.
3637

37-
Example: ping is by default a critical metric which is expected to be
38-
always 1 (reachable).
38+
Example: Both :ref:`ping_check` and :ref:`monitoring_data_collected_check`
39+
are failing for the device.
3940

4041
``DEACTIVATED``
4142
---------------

docs/user/settings.rst

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -619,19 +619,20 @@ Automatically created charts.
619619
``OPENWISP_MONITORING_CRITICAL_DEVICE_METRICS``
620620
-----------------------------------------------
621621

622-
============ ================================================
622+
============ ==================================================================
623623
**type**: ``list`` of ``dict`` objects
624-
**default**: ``[{'key': 'ping', 'field_name': 'reachable'}]``
625-
============ ================================================
624+
**default**: .. code-block::
626625

627-
Device metrics that are considered critical:
626+
[
627+
{"key": "ping", "field_name": "reachable"},
628+
{"key": "data_collected", "field_name": "data_collected"},
629+
]
630+
============ ==================================================================
628631

629-
when a value crosses the boundary defined in the "threshold value" field
630-
of the alert settings related to one of these metric types, the health
631-
status of the device related to the metric moves into ``CRITICAL``.
632+
Device metrics that are considered critical.
632633

633-
By default, if devices are not reachable by pings they are flagged as
634-
``CRITICAL``.
634+
When all critical metrics have values outside their expected range, the
635+
device's health status changes to ``CRITICAL``.
635636

636637
.. _openwisp_monitoring_health_status_labels:
637638

@@ -754,6 +755,34 @@ This setting allows you to configure the config check interval used by
754755
:ref:`config_applied <config_applied_check>`. By default it is set to 5
755756
minutes.
756757

758+
.. _openwisp_monitoring_auto_data_collected_check:
759+
760+
``OPENWISP_MONITORING_AUTO_DATA_COLLECTED_CHECK``
761+
-------------------------------------------------
762+
763+
============ ========
764+
**type**: ``bool``
765+
**default**: ``True``
766+
============ ========
767+
768+
This setting allows you to choose whether :ref:`monitoring data collected
769+
<monitoring_data_collected_check>` checks should be created automatically
770+
for newly registered devices. It's enabled by default.
771+
772+
.. _openwisp_monitoring_data_collected_check_interval:
773+
774+
``OPENWISP_MONITORING_DATA_COLLECTED_CHECK_INTERVAL``
775+
-----------------------------------------------------
776+
777+
============ =======
778+
**type**: ``int``
779+
**default**: ``60``
780+
============ =======
781+
782+
This setting allows you to configure the data collected check interval
783+
used by :ref:`data collected <monitoring_data_collected_check>`. By
784+
default it is set to 60 minutes.
785+
757786
.. _openwisp_monitoring_auto_wifi_clients_check:
758787

759788
``OPENWISP_MONITORING_AUTO_WIFI_CLIENTS_CHECK``

openwisp_monitoring/check/apps.py

Lines changed: 6 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from swapper import load_model
55

66
from openwisp_monitoring.check import checks # noqa
7-
from openwisp_monitoring.check import settings as app_settings
87

98

109
class CheckConfig(AppConfig):
@@ -16,37 +15,10 @@ def ready(self):
1615
self._connect_signals()
1716

1817
def _connect_signals(self):
19-
if app_settings.AUTO_PING:
20-
from .base.models import auto_ping_receiver
18+
Check = load_model('check', 'Check')
2119

22-
post_save.connect(
23-
auto_ping_receiver,
24-
sender=load_model('config', 'Device'),
25-
dispatch_uid='auto_ping',
26-
)
27-
28-
if app_settings.AUTO_CONFIG_CHECK:
29-
from .base.models import auto_config_check_receiver
30-
31-
post_save.connect(
32-
auto_config_check_receiver,
33-
sender=load_model('config', 'Device'),
34-
dispatch_uid='auto_config_check',
35-
)
36-
if app_settings.AUTO_IPERF3:
37-
from .base.models import auto_iperf3_check_receiver
38-
39-
post_save.connect(
40-
auto_iperf3_check_receiver,
41-
sender=load_model('config', 'Device'),
42-
dispatch_uid='auto_iperf3_check',
43-
)
44-
45-
if app_settings.AUTO_WIFI_CLIENTS_CHECK:
46-
from .base.models import auto_wifi_clients_check_receiver
47-
48-
post_save.connect(
49-
auto_wifi_clients_check_receiver,
50-
sender=load_model('config', 'Device'),
51-
dispatch_uid='auto_wifi_clients_check',
52-
)
20+
post_save.connect(
21+
Check.auto_create_check_receiver,
22+
sender=load_model('config', 'Device'),
23+
dispatch_uid='auto_create_check',
24+
)

openwisp_monitoring/check/base/models.py

Lines changed: 48 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,7 @@
99
from jsonfield import JSONField
1010

1111
from openwisp_monitoring.check import settings as app_settings
12-
from openwisp_monitoring.check.tasks import (
13-
auto_create_config_check,
14-
auto_create_iperf3_check,
15-
auto_create_ping,
16-
auto_create_wifi_clients_check,
17-
)
12+
from openwisp_monitoring.check.tasks import auto_create_check
1813
from openwisp_utils.base import TimeStampedEditableModel
1914

2015
from ...utils import transaction_on_commit
@@ -108,74 +103,54 @@ def perform_check_delayed(self, duration=0):
108103

109104
perform_check.apply_async(args=[self.id], countdown=duration)
110105

111-
112-
def auto_ping_receiver(sender, instance, created, **kwargs):
113-
"""Implements OPENWISP_MONITORING_AUTO_PING.
114-
115-
The creation step is executed in the background.
116-
"""
117-
# we need to skip this otherwise this task will be executed
118-
# every time the configuration is requested via checksum
119-
if not created:
120-
return
121-
transaction_on_commit(
122-
lambda: auto_create_ping.delay(
123-
model=sender.__name__.lower(),
124-
app_label=sender._meta.app_label,
125-
object_id=str(instance.pk),
106+
@classmethod
107+
def auto_create_check_receiver(cls, created, **kwargs):
108+
if not created:
109+
return
110+
transaction_on_commit(lambda: _auto_check_receiver(created=created, **kwargs))
111+
112+
113+
def _auto_check_receiver(sender, instance, **kwargs):
114+
model = sender.__name__.lower()
115+
app_label = sender._meta.app_label
116+
object_id = str(instance.pk)
117+
if app_settings.AUTO_PING:
118+
auto_create_check.delay(
119+
model=model,
120+
app_label=app_label,
121+
object_id=object_id,
122+
check_type='openwisp_monitoring.check.classes.Ping',
123+
check_name='Ping',
126124
)
127-
)
128-
129-
130-
def auto_config_check_receiver(sender, instance, created, **kwargs):
131-
"""Implements OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK.
132-
133-
The creation step is executed in the background.
134-
"""
135-
# we need to skip this otherwise this task will be executed
136-
# every time the configuration is requested via checksum
137-
if not created:
138-
return
139-
transaction_on_commit(
140-
lambda: auto_create_config_check.delay(
141-
model=sender.__name__.lower(),
142-
app_label=sender._meta.app_label,
143-
object_id=str(instance.pk),
125+
if app_settings.AUTO_CONFIG_CHECK:
126+
auto_create_check.delay(
127+
model=model,
128+
app_label=app_label,
129+
object_id=object_id,
130+
check_type='openwisp_monitoring.check.classes.ConfigApplied',
131+
check_name='Configuration Applied',
144132
)
145-
)
146-
147-
148-
def auto_iperf3_check_receiver(sender, instance, created, **kwargs):
149-
"""Implements OPENWISP_MONITORING_AUTO_IPERF3.
150-
151-
The creation step is executed in the background.
152-
"""
153-
# we need to skip this otherwise this task will be executed
154-
# every time the configuration is requested via checksum
155-
if not created:
156-
return
157-
transaction_on_commit(
158-
lambda: auto_create_iperf3_check.delay(
159-
model=sender.__name__.lower(),
160-
app_label=sender._meta.app_label,
161-
object_id=str(instance.pk),
133+
if app_settings.AUTO_IPERF3:
134+
auto_create_check.delay(
135+
model=model,
136+
app_label=app_label,
137+
object_id=object_id,
138+
check_type='openwisp_monitoring.check.classes.Iperf3',
139+
check_name='Iperf3',
162140
)
163-
)
164-
165-
166-
def auto_wifi_clients_check_receiver(sender, instance, created, **kwargs):
167-
"""Implements OPENWISP_MONITORING_AUTO_WIFI_CLIENTS_CHECK.
168-
169-
The creation step is executed in the background.
170-
"""
171-
# we need to skip this otherwise this task will be executed
172-
# every time the configuration is requested via checksum
173-
if not created:
174-
return
175-
transaction_on_commit(
176-
lambda: auto_create_wifi_clients_check.delay(
177-
model=sender.__name__.lower(),
178-
app_label=sender._meta.app_label,
179-
object_id=str(instance.pk),
141+
if app_settings.AUTO_WIFI_CLIENTS_CHECK:
142+
auto_create_check.delay(
143+
model=model,
144+
app_label=app_label,
145+
object_id=object_id,
146+
check_type='openwisp_monitoring.check.classes.WifiClients',
147+
check_name='WiFi Clients',
148+
)
149+
if app_settings.AUTO_DATA_COLLECTED_CHECK:
150+
auto_create_check.delay(
151+
model=model,
152+
app_label=app_label,
153+
object_id=object_id,
154+
check_type='openwisp_monitoring.check.classes.DataCollected',
155+
check_name='Monitoring Data Collected',
180156
)
181-
)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .config_applied import ConfigApplied # noqa
2+
from .data_collected import DataCollected # noqa
23
from .iperf3 import Iperf3 # noqa
34
from .ping import Ping # noqa
45
from .wifi_clients import WifiClients # noqa

openwisp_monitoring/check/classes/base.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@ def __init__(self, check, params):
1313
self.related_object = check.content_object
1414
self.params = params
1515

16+
@classmethod
17+
def get_related_metrics(cls):
18+
"""
19+
Returns a tuple of metric names related to this check class.
20+
21+
The default implementation returns a tuple containing the lowercase
22+
name of the class.
23+
24+
Returns:
25+
tuple: A tuple of strings representing metric identifiers
26+
"""
27+
return (cls.__name__.lower(),)
28+
1629
def validate_instance(self):
1730
# check instance is of type device
1831
obj = self.related_object

openwisp_monitoring/check/classes/config_applied.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@
1111

1212

1313
class ConfigApplied(BaseCheck):
14+
@classmethod
15+
def get_related_metrics(cls):
16+
return ('config_applied',)
17+
1418
def check(self, store=True):
1519
# If the device is down or does not have a config
1620
# do not run config applied check
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
from django.utils import timezone
2+
from swapper import load_model
3+
4+
from ...db import timeseries_db
5+
from ...device.utils import SHORT_RP
6+
from .. import settings as app_settings
7+
from .base import BaseCheck
8+
9+
AlertSettings = load_model('monitoring', 'AlertSettings')
10+
Metric = load_model('monitoring', 'Metric')
11+
12+
13+
class DataCollected(BaseCheck):
14+
@classmethod
15+
def get_related_metrics(cls):
16+
return ('data_collected',)
17+
18+
def check(self, store=True):
19+
device_monitoring = self.related_object.monitoring
20+
active_metrics = device_monitoring.get_active_metrics()
21+
passive_metrics = device_monitoring.related_metrics.exclude(
22+
configuration__in=active_metrics
23+
).values_list('key', flat=True)
24+
if passive_metrics:
25+
# Perform a query to the timeseries database to get the last
26+
# written point for every passive metric. This is done to optimize
27+
# the query instead of iterating over each metric and getting
28+
# its last written point individually.
29+
points = timeseries_db.read(
30+
key=','.join(set(passive_metrics)),
31+
fields='*',
32+
tags={
33+
'content_type': self.related_object._meta.label_lower,
34+
'object_id': str(self.related_object.pk),
35+
},
36+
since=(
37+
timezone.localtime()
38+
- timezone.timedelta(
39+
minutes=app_settings.DATA_COLLECTED_CHECK_INTERVAL
40+
)
41+
),
42+
limit=1,
43+
order_by='-time',
44+
)
45+
result = int(len(points) > 0)
46+
else:
47+
result = 0
48+
send_alert = device_monitoring.status != 'critical'
49+
if store:
50+
self._get_metric().write(
51+
result, retention_policy=SHORT_RP, send_alert=send_alert
52+
)
53+
return {'data_collected': result}
54+
55+
def _get_metric(self):
56+
metric, created = self._get_or_create_metric(configuration='data_collected')
57+
if created:
58+
self._create_alert_setting(metric)
59+
return metric
60+
61+
def _create_alert_setting(self, metric):
62+
alert_s = AlertSettings(metric=metric)
63+
alert_s.full_clean()
64+
alert_s.save()

0 commit comments

Comments
 (0)