2024-02-26 WIS2 Monitoring and ET-W2AT Meeting

 Date

Feb 26, 2024 13:00-15:00 UTC

 Participants

ET-W2AT

  • Rémy GIRAUD

  • Jeremy TANDY

Experts

  • Kai Wirt

  • Chems eddine ELGARRAI

WMO Secretariat

  • Hassan Haddouch

  • Maaike Limper

  • @Xiaoxia Chen

  • Anna Milan

 Discussion topics

No

Notes

No

Notes

1

Agree on the agenda today

(metrics, roles responsibility, wis2node registry)

 

2

Metrics hierarchy https://github.com/wmo-im/wis2-metric-hierarchy/blob/main/metric-hierarchy/gb.csv

  • Global Broker Metrics

    • wmo_wis2_gb_messages_received_total:

(Rémy: shows the current disconnected wis2node, Argentina/ar-smn, Cuba/cu-insmet, Indonesia/id-bmkg, Sweden/se-smhi, Trinidad and Tobago/tt-)

To decide the duration of the disconnected for the GB to raise local alerts, e.g. 10m; expression; the lables (severity); annotations

To decide the workflow of the alerting mechanism, e.g. local alerting -- global alerting

metrics name: 6 metrics can be found here. https://github.com/wmo-im/wis2-metric-hierarchy/blob/main/metric-hierarchy/gb.csv

There could be sensor centres creating local and global alerts.

image-20240226-142952.png

Rémy shares the alertmanager https://blog.ruanbekker.com/cheatsheets/alertmanager/

  • 3 levels:

    • level 1: one GB not connecting for some time

    • level 2: all GS reporting same issues for some time

    • level 3: all GS reporting same issues for a longer time, action: to raise a ticket automatically

  • metric: timestamp

    • expression:

  • Question (Jeremy): to differentiate the channels of data or metadata?

    • (Rémy) more on GC side, not on GB side

3

  • Global Cache Metrics https://github.com/wmo-im/wis2-metric-hierarchy/blob/main/metric-hierarchy/gc.csv

    • GC (DWD) is the only one doing the metrics

    • centre-id and hostname (scenario, not connection from data centre but from other global cache)

  • Discussion

    • (Jeremy) GB and GC validate the notification message,

    • (Maaike) any metrics for the situation - if GC fails the connection with GB after several tries

      • decision: no, but sgc metrics may include this. Principle: to keep the number of metrics for GB and GC at a minimum level.

    • (Anna) GC, record the statistics of data cached or not cached?

      • (Rémy) Such statistics is not useful for WIS2 operation. But sensor centres can do so.

      • added two metrics: gc_cache_override_total, gc_integrity_failed_total

      • Action: Anna to raise a ticket to WIS2 Guide to update the WIS2 metrics

    • (Jeremy) metrics for gc to record who download the data

    • (Rémy) to test if fake messages are sent to the system (stress test in May)

    • Open Metrics end point, GC to run prometheus

      • Currently, DWD doesn’t have all the metrics open.

4

  • Global Discovery Catalogue Metrics

(to be discussed next time)

5

  • GM metrics

(Kai) to create a gm.csv, including wmo_wis2_gm_metrics_server_last_download, wmo_wis2_gm_metrics_server_status

6

centre_id for all Global Services

(Jeremy) action: to come up with a Global Cache name for US-UK co-jointed GC.

7

  • topic hierarchy for alerts

    • monitor/a/wis2/# will only be open to WIS2 Nodes using authentication (a unique user and password)

notice board of global monitor, not to ask WIS2 users

Jeremy proposed to discuss the connection between alert and ticketing system at the stress test in Japan in May. However, Rémy emphasises that the objective of stress test in Japan focuses on checking if Global Services are operating a decent level and doing proper jobs.

 

 Action items

Hassan to send out a notification to GS operators, informing them to provide the metrics by the first of May

Next meeting

Â