2025-02-24-ET-W2IT Monday meeting
Date
Feb 24, 2025 13:00-15:00 UTC
Invited Participants
Jeremy TANDY (ET-W2IT Chair)
Rémy GIRAUD (SC-IMT Chair)
Hyumin EOM (KMA)
Masato FUJIMOTO (JMA)
Kari Sheets (NOAA)
Steve Olson (NOAA)
Max Marno (Synoptic)
José Mauro(INMET)
Tom Kralidis (ECCC)
Kai Wirt-Thorsten (DWD)
Elena Arenskotter (DWD)
Saad Mohammed Almajnooni
Majed Mahjoub (NCM)
Chems eddine ELGARRAI (DGM)
Lei XUE (CMA)
Wenjing GU (CMA)
Xinqiang HAN (CMA)
WMO Secretariat
Enrico Fucile
Hassan Haddouch
Xiaoxia Chen
David Inglis Berry
Anna Milan
Timo Proescholdt
Apologies
Ping GUO (CMA)
Yoritsugi YUGE (JMA)
Meeting Note
Assessment of metrics implementation in Global Services
Agreement on consistent metrics behavior in Global Services
validation of discovery metadata at Global Discovery Catalogue; implementation of Global Broker “discard”
WIS2 Global Services metrics
Jeremy introduced the commentary based on analysis performed by WMO Secretariat (Maaike), 1-hour period on 20-Feb-2025.
Key outcomes from discussion: agree what the correct reporting is for each metric
Connected: 0 = not connected, 1 = connected, report 0 if you cannot connected (null / no data if you’ve never tried to connect)
Numbers: _after_ de-duplication, including errors or not?
This is needed to provide baseline information to assess whether Global Service behaviour is consistent.
Metrics can be used to understand WIS2 performance in a “stepwise” fashion - i.e., start with “connection”, then look at the numbers of messages/data-objects.
Global Monitor:
Looking at the metrics available from ma-meteomaroc-global-monitor there are gaps in the metrics from 18h-20h UTC, 19-Feb-2025.
This was due to an outage of the Global Monitor. Which happens …
But: we’re not currently monitoring the Global Monitors so we cannot easily identify if “gaps” are due to an upstream Global Service not reporting data, or whether the GM was offline and didn’t scrape the metrics every 15-seconds.
** Recommendation: we monitor the Global Monitors
… should be able to see this from wmo_wis2_gb_connected_flag - but neither ma-meteomaroc-global-monitor nor cn-cma-global-monitor appear in the metrics for Global Brokers (below).
… do we need a dashboard indicating “Global Services available now” (red/green traffic light) etc. - based on connection to HTTP or MQTT end-points
(Rémy) regarding the Global Monitors, since China and Morocco have confirmed having a Broker, he proposed establishing a sensor centre to monitor the Global Monitors' connectivity. He also requested China and Morocco to share their brokers endpoints with WMO Secretariat.
Action: GM (Morocco) and GM (China) to share the GM broker endpoint with the Secretariat.
Global Broker metrics:
gb_metrics_analysis.xlsx
Nomenclature:
BR-GB = br-inmet-global-broker
CN-GB = cn-cma-global-broker
FR-GB = fr-meteofrance-global-broker
US-GB = us-noaa-global-broker
Sheet #1: wmo_wis2_gb_connected_flag
Values are the “highest” value recorded during the hour (i.e., if “1” was recorded at any point during the hour, the value is “1”)
First challenge: getting consistent metrics on _connections_
Important so that we can trigger warnings/alerts from Global Monitors - if multiple GBs cannot connect to a WIS2 Node (or Global Service) this increases the likelihood that the centre is offline rather than there being a point-to-point connection issue.
[lines 6, 8, 92] US-GB does not report a metric on WIS2 Node (au-bom, bf-anam, za-weathersa) that is known to be offline - all other GB report “0” (not connected)
Action: Steve to check and follow up on this.
[lines 10, 15, etc.] CN-GB does not appear to report metrics on other Global Services
Action: Lei to follow up on this.
[lines 17, 18, 23, 48, 62] CN-GB cannot connect to WIS2 Nodes from Chile (cl-meteochile), Cameroon (cm-meteocameroon), Cuba (cu-insmet), Italy (it-meteoam), Morocco (ma-marocmeteo) - confirm?
Action: Lei to check and follow up on this issue
Action: the Secretariat to resend the IP addresses of GB(China) and GB (China) need to check the user and password of these WIS2 nodes, e.g Morocco.
[line 35] Only BR-GB appears to be subscribing to FR-GB (specifically, only BR-GB is reporting being connected) - more connections expected (albeit CN-GB probably not reporting its connection)
Remy reported that the Global Broker in Brazil is using different credentials for incoming and outgoing connections and suspects this as the source of the problem
Action: Lei to check and follow up on this connectivity issue
[line 36] US-GB is using the wrong centre-id for FR-GB (which probably also explains why it looks like there are so few connections to FR-GB - see above)
Action: Steve to check and follow up on this
[line 41, 47] Only FR-GB connected to WIS2 Nodes hk-hko-swic and ir-irimo - recommended to have at least 2 subscriptions
Action: hk-hko-swic, Secretariat to verify the correct IP address of GB (Brazil) and share with Hong Kong colleague and also check with the Iran colleague their access control
[lines 46, 47, 48] US-GB is not reporting connection to EUMETSAT, Iran, and Italy - is this because US-GB never attempts to establish a connection?
Action: Steve to check. Secretariat to share the Global Broker (US) credentials, and Italy to verify their whitelist for the Global Broker (US)
[line 53] US-GB reports connection to ke-kmd which isn’t in the WIS2 Registry - what’s happening here?
Action: Steve to check
[line 59] Only US-GB reports connection to WIS2 Node from Kazakhstan (kz-kazhydromet) - confirm?
Remy confirmed that GB-FR and Brazil have connectivity with Kazakhstan and that there was only a downtime limited to a short period (one day)
[lines 75, 76] Confusion on centre-id for Sint Maarten (sx-met or sx-metservice)
Action: GBs to remove the subscription of sx-met. (centre-id: sx-metservice is the correct one.)
Sheet #2: wmo_wis2_gb_msg_received_total
Values are the _increase_ in numbers of messages received during the hour
Second challenge: diagnosing whether each GB is handling (roughly) the same number of messages
Important so that we can see if messages are getting lost, which would mean that subscribers to such a GB would be under-served
[summary] FR-GB and BR-GB report roughly similar numbers (they’re running the same software!), CN-GB reports roughly 20x more messages, 3000-3500x more messages
Action: Steve and Marc to follow up on this
[line 8] CN-GB reports receiving 25k messages from Burkina Faso while that WIS2 Node is known to be offline (and CN-GB reports no connection)
Action: Lei to follow up on this
Decision: no need to run the performance tests. identify 3 or 4 centres, as a reference, to monitor the all 4 GBs if they behave consistently. Each GB will test the metrics.
A discussion was raised regarding how to report on the functioning of Global Services and the required retention of metrics . For GB (France), Remy informed that the retention period is two weeks of archived metrics. For GM China and GM Morocco, the retention period is also two weeks
Chems informed that the retention for GM Morocco can be extended to one month.
(Hassan) to report on the functioning of the Global Services, we can consider using the Jira system to provide statistics of GS performance.
(Rémy) we need to collectively define KPIs of global services performance to be presented at INFCOM-4.
Action: Remy, Jeremy, and Hassan will discuss the retention of the metrics in Geneva this week on the sideline of the Gateways meeting
Sheet #3: wmo_wis2_gb_msg_no_metadata_tot
(Similar to sheet #2)
[lines 14, 17, 24, 25, etc.] CN-GB is reporting no metadata when the other GBs are not - is the metadata validation only enabled for CN-GB?
[line 82] CN-GB reports no missing metadata from UK, BR-GB and FR-GB report errors, US-GB doesn’t report
Global Cache metrics
gc_metrics_analysis.xlsx sent by email from Jeremy
Nomenclature:
CN-GC = cn-cma-global-cache
UK/USA-GC = data-metoffice-noaa-global-cache
DE-GC = de-dwd-global-cache
JP-GC = jp-jma-global-cache
KR-GC = kr-kma-global-cache
UK/USA-GC is still reporting metrics for the 100+ test nodes used in the Global Services testing - these are excluded from the analysis for clarity. But - these “old” centre-ids must be removed.
Update: Max fixed it.
Sheet #1: wmo_wis2_gb_connected_flag
[line 3, 4] JP-GC reports no metrics for Antigua or ai-metservice (?) even though a small number of messages are sent
[line 5] DE-GC reports no metric for Argentina even though messages are being sent according to GB metrics; from looking at Grafana DE-GC appears to have gaps in provision of metrics
[lines 17, 28, etc.] DE-GC and JP-GC often don’t report a metric (no value available, e.g., Cameroon, Guinea) where as other GCs are able to connect; GB metrics indicate that Cameroon and Guinea are connected but (excepting US-GB) not sending any messages. How are CN-GC, UK/USA-GC and KR-GC reporting “connected” when there probably wasn’t anything to download?
[line 21] UK/USA-GC can’t connect to Cyprus - confirm?
Note: “last-download” timestamp - if nothing has ever been downloaded for a given data server, the value will be null (not reported)
… Generally, metrics are only set once something has been tried. They are (mostly?) not initialised. CN-GC appears to initialise _download_total to zero even when there’s been no connection. Confirm?
Decision: we need to agree on the consistent metrics considering current DWD and Uk&USA implement differently.
Action: Jeremy, Rémy, Kai and Max to have a meeting this week to discuss the GC metrics.
Sheet #2: wmo_wis2_gc_download_total
[summary] Generally, CN-GC, UK/USA-GC and KR-GC tend to agree on numbers; DE-GC and JP-GC appear different.
[line 15] ca-eccc-msc-global-discovery-catalogue published ~30 cacheable data objects in the hour - what’s being published? (Metadata tarballs are once per day)
[line 20] GCs are not reporting a metric for cn-cma-global-discovery-catalogue - suggesting there has never been a download from the CN-GDC; is CN-GDC providing daily tarballs of the metadata records?
Action: pause for discussion
Sheet #3: wmo_wis2_gc_download_errors_total
[line 14] GC downloads from ca-eccc-msc: CN-GC and UK/USA-GC report ~200 errors, KR-GC reports 13k errors, DE-GC reports 172k errors, JP-GC reports 550k errors … what is happening?
[line 24] CN-GC, UK/USA-GC and KR-GC all report broadly consistent total-download from de-dwd-gts-to-wis2 (~500k message), CN-GC and UK/USA-GC report ~300 errors, but KR-GC reports 147k errors … does the _total_downloads include _error_downloads?
Action: pause for discussion
Global Discovery Catalogue metrics
gdc_metrics_analysis.xlsx
Global Discovery Catalogues (GDC) also appear to suffer from inconsistent metrics implementation.
But GDCs are not part of the data-exchange operations (mostly, the metrics are about metadata quality)
… so recommend that we prioritise getting metrics for GB and GC consistent first.
Find the best time to set properties.metadata_id as required (wmo-im/wis2-notification-message/#119) https://github.com/wmo-im/wis2-notification-message/issues/119
Current situation: from 1-Sep-2025 Global Brokers will check that a valid discovery metadata record exists for data relating to any WIS2 Notification Messages published via the GB. This is done by comparing the MQTT “channel” on which the WNM is published against a list of all channels harvested from discovery metadata published in the Global Discovery Catalogue.
This isn’t a fool-proof check; the GB cannot distinguish between datasets if a WIS2 Node is publishing notifications about more than one dataset on the same channel, I.e., the presence of one discovery metadata record would be sufficient for the GB to approve/republish notifications from all those datasets. That said, this is a rare edge-case.
Proposal: make inclusion of properties.metadata_id MANDATORY in the WNM specification.
The proposal is a breaking change, therefore needs an appropriate level of visibility and approval. Consequently, the proposal will be submitted to INFCOM-4.
Concerns with the current situation:
1/ The GDC is being used to configure whitelists for real-time data exchange. This raises the expectations of resilience for GDCs to that of an operational component.
First, a GDC may corrupt a record through bad processing.
Second, a GDC may not be available when the GBs request the list of valid channels (FR-GB caches the list for 48-hrs, so GDC availability isn’t a big problem, excepting that the list would not include any new entries if an older version is used).
** Recommendation:
GBs will use a “composite” list of valid channels compiled from all three GDCs, i.e., the superset of valid channels - if a channel is reported by one or more GDCs it will appear in the list used by GBs. Implementation details to be agreed.
A WIS2 Node may not be aware that of a broken linkage between discovery metadata (and the MQTT channel described therein) and the approved list of channels used by GBs - which would result in real-time notification messages and data exchange being blocked.
** Recommendations:
A WIS2 Node may choose to validate their discovery metadata prior to publication to ensure that it will be correctly parsed (CA-GDC provides a validator service).
A WIS2 Node should publish discovery metadata at least 24-hours prior to starting real-time data exchange _and_ subscribe to “monitor” messages published by the GDCs (e.g., “wmo_wis2_gdc_kpi_percentage_total”) to determine that the discovery metadata has been successfully published
WIS2 Node IT Operations should include monitoring the “wmo_wis2_gb_messages_no_metadata_total” metric for their “centre-id”; GBs will increment this when the linkage is not found. Any increases in this metric should be investigated and the causes resolved.
(Actions
GM (Morocco) and GM (China) to share the GM broker endpoint with the Secretariat.
Kari and Steve to follow up to make sure more consistent GB (US) metrics reports
All to go through the list shared in the email to have a consistent metrics
WIS2-Recommendations:
Each Global Service operator should run their own Prometheus instance so that they can monitor service performance using a time-series to compare with prior periods.
Each Global Service operator should have IT-Ops procedures in place to identify issues arising (e.g., failure to connect to upstream Node), diagnose faults and remedy the issue.
Next meeting
10 March