2024-03-11 WIS2 Monitoring and ET-W2AT Meeting

 Date

Mar 11, 2024 13:00-15:00 UTC

 Participants

ET-W2AT

  • Rémy GIRAUD

  • Jeremy TANDY

  • Tom KRALIDIS

Experts

  • Kai Wirt

  • Chems eddine ELGARRAI

  • Max Marno (Synoptic)

WMO Secretariat

  • Hassan Haddouch

  • Maaike Limper

  • @Xiaoxia Chen

  • Anna Milan

 Discussion topics

No

Notes

No

Notes

1

  • topic hierarchy for alerts

Jeremy: At the stress test in Japan in May, all the GS operators implemented the metrics?

Rémy: currently, GB (Brazil, France), GC(Germany) already provide the metrics.

Jeremy: Max from Synoptic is now preparing the metrics (GC-USA & UK).

(Kai): who raise the alert? Preference: automatic alerting using alert manager. to define what GM should alert. Other GS should not create such alerts.

(Rémy) GM raises the alert. GS may raise alerts but this is not a must. Criteria for GM to raise an alert.

Key question: what is the threshold of the metrics to raise an alert for GM? Kai proposed to go through all the metrics offline to define the

(Kai) If GM and GS raise a same alert, it is good to detect the same thing.

Involving China (CMA) in the GM

(Jeremy) In May, we may do the performance test. To get China on board to do the GM.

(Action) Xiaoxia to contact CMA colleagues for their GM plan.

(Maaike) what to communicate with the GM (China)

(Rémy) we will finalize 1) what are the potential alerts and the next step is 2) the GM to implement the alert. The meeting is to define the overall architecture to keep track of the alerts we want to raise.

2

Metrics hierarchy https://github.com/wmo-im/wis2-metric-hierarchy/tree/main/metric-hierarchy

 

image-20240311-145909.png

Tom proposed to start with data schema with extensibility and then refine it.

To look at the metrics, the levels of the metrics, the threshold at which alerts are raised (triggered by GM).

(Max) reset period for the metrics, example as: total download errors

  • (Jeremy) total number of metadata records, a rolling 24 hour? 00 UTC, set it as 0.

  • (Kai) no requirement on reset for open metrics. no difference between a rest and continuous counting. For DWD, reset at 00UTC.

  • (Rémy) metrics are published as raw data.

Levels of metrics/threshold for alerting

  • Info/warning/broker?

  • “wmo_wis2_gb_connected_flag”

    • severity level= info

 

image-20240311-153237.png

(Max) to share the currently existing standard that we can follow https://docs.python.org/3/library/logging.html#levels

Level and duration

  • Info

  • warning (30 minutes)

  • error (30 minutes)

  • critical (an hour): a complete loss of connectivity

Alert manager rules

(Kai) Do we agree to use Alert Manager? To install the rules. For each listed metric, we create rules of severity levels, by grouping of type of global services instead of metrics. JMA may have some available. (Secretariat) no metrics received from JMA.

(action) GB: Rémy; GC: Kai and Max; GDC: Tom

 

 Action items

Xiaoxia to contact XUE Lei (NFP on WIS matters for China) for their GM plan

Next meeting