2022-02-14 ET-W2AT Meeting
Date
Feb 14, 2022 14:30-16:30 UTC
Participants
ET-W2AT
@Jeremy Tandy (Unlicensed)
@Rémy Giraud
@Dana Ostrenga (Unlicensed) (absent)
@thorsten.buesselberg (Unlicensed)
@Kai Wirt (Unlicensed)
@Henning Weber (Unlicensed)
@Tom Kralidis (Unlicensed)
@peter.silva (Unlicensed)
@Ken Tsunoda (Unlicensed)
@Li Xiang (Unlicensed) (absent)
@Baudouin Raoult (Unlicensed)
WMO Secretariat
@Peiliang Shi (Unlicensed)
@Enrico Fucile
@HADDOUCH Hassan
@Timo Proescholdt
@David Berry
@Xiaoxia Chen
Goals
Discuss the Connectivity & functions of Global Cache
Discuss what else GISC should do? Shared services? How to support their AoR?
Discussion topics
Item | Presenter | Notes |
---|---|---|
1 | Jeremy | Jeremy presented the Data distribution via Global Cache Assumption #1: Global Cache will hold "file" Data Objects - same as how "file" Data Objects are shared on the GTS Assumption #2: Global Cache will provide "file-level" metadata, e.g. Spatiotemporal Asset Catalog (STAC), to allow Data Consumer to browse the Global Cache content. This metadata is NOT discovery metadata, it is more fine-grained; used to describe each individual Data Object (file) that comprise the dataset.  Tom> shared a link (an example of Global model) Kai> Is there too much metadata? STAC records + discovery metadata + filename + geographic info in the topic structure. Seems like we have duplication of information? Baudouin> shared a link for different types of metadata Tom>STAC is different to discovery metadata - we need to be clear on granularity. It is widely used in mass market. Particularly useful to help search engines index your dataset content. Jeremy> data consumers know the data without opening files Kai > is there too much metadata? STAC records + discovery metadata + filename + geographic info in the topic structure. Seems like we have duplication of information? Tom > STAC is different to discovery metadata - we need to be clear on granularity. It is widely used in mass market. Particularly useful to help search engines index your dataset content. Kai > not saying that STAC isn't a good idea - only that we need to avoid duplication.  Tom > How would the STAC metadata be used? The discovery metadata record for the dataset would include an actionable link ("association") that points to a top-level STAC record, from where a user or system can use hypermedia to browse through all the files in the dataset, discovering more information about what's included in each file.  Remy > but the Global Cache is real time data? [yes], The STAC records are a feature of the Global Cache itself - not part of the Global Catalogue [yes]  Jeremy > the STAC records are automatically generated as file Data Objects arrive, and purged as the file Data Objects are deleted.  Remy > it would be useful for NC/DCPC to do this too - to be consistent. Jeremy > perhaps we can see STAC metadata as an optional extra provided by the Global Cache or NC/DCPC? e.g. defined in Technical Regulations as good practice, but not mandatory? The use of STAC doesn't change the overall architecture of WIS2 - it just makes things easier for data consumers.  Assumption# 3: Original dataset and cached copy are different Datasets because they have different temporal extent, e.g. NC may provide back-catalogue of SYNOP data for 6-months (or longer), but Global Cache only holds 24-hours of data.  Tom > If everything else is the same, the version at the originating center (i.e. the NC/DCPC) and the cached copy (or copies) can be treated as distributions of the same dataset - see W3C DCAT vocabulary. We can indicate the 24-hour retention period as part of the Association link in the [single] Dataset discovery metadata record. Retention period refers to the duration that individual Data Object files are kept - not the entire dataset/distribution  Jeremy > would be neater to have just one discovery metadata record in the Global Catalogue  Remy > why would people care about the difference in temporal extent - the Global Cache is only there to support real-time data delivery. Data Consumers [only] access the Data Objects in the Global Cache because they receive a message with pointing to the URL to access the data  Jeremy > important to allow people to download directly from the Global Cache (and browse the available Data Objects), irrespective of whether a Data Consumer has received a message, data access happens by downloading a file from a URL  Kai > would it be better to avoid confusion - and provide download links only for the data at the originating center? the Global Cache is only intended to support real-time data exchange [and provide low latency, resilient access for Data Consumers who receive "data availability" messages from the Global Broker]. Recommend users go to the originating center if they want to browse and download data directly.  Jeremy > in this case, the discovery metadata record would not need to include "association" links that refer to the Global Cache instances. Data Consumers will be directed to the Global Cache via the "data availability" messages; they don't need to be able to independently discover (and browse) the Global Cache  Tom > how/where are the "association" links [for subscription URLs at Global Broker instances] added to the discovery metadata record?  Jeremy > suggest these are added by the Global Catalogue - it knows about all the Global Broker instances.  Peter > [confirming understanding] the Global Catalogue is not in the real-time data flow - so doesn't advertise availability of specific files?  Kai > [describes his understanding of real-time data flow] The Global Broker only makes sense if they are the place where people subscribe to messages about data availability, only offer subscription from Global Brokers to Data Consumers, this means that for each Data Object published by an NC/DCPC, a Data Consumer would get a "data availability" message from the originating center and each of the Global Cache instances. Data Consumers should not subscribe to "data availability" messages direct from the Global Cache instances  Henning > what is the value of the Global Broker if Data Consumers can subscribe elsewhere? does it provide service guarantee, push messages based on priority etc.  Global Broker provides aggregation point for all message - so that a Data Consumer can manage their subscriptions on a single server rather than having to work with many message brokers  Peter > the "magic" provided by any broker in the system is threefold: (a) republishing messages, (b) to store a bit of data, (c) avoid circulating duplicate messages (i.e. "loop avoidance")  Jeremy > the WIS2 architecture separates the data storage and messaging functions between Global Cache and Global Broker  Enrico > we seem to be talking about a different kind of service to what was originally envisaged. Storing data for more than 24-hours, and allowing people to browse that content - this is more like a "Global Storage" service. There's currently no obligation for a NC or DCPC to offer access to data for more than 24-hours. Albeit that there's much interest in accessing this kind of data, maybe this "Global Storage" service could be created to offer long-duration datasets? noting that this would be an additional thing - not a core part of WIS Remy > such additional services need to build on the foundation of WIS2 - the Global Catalogue and real-time data exchange via Global Broker and Global Cache  Hassan > maybe some NCs might not be able to offer download for their data? perhaps the Global Cache should provide this capability for them?  Jeremy > WIS 2.0 principal #7 requires that real-time data is available to download for a minimum of 24-hours: 7.Will require all services that provide real-time distribution of messages (containing data or notifications about data availability) to cache/store the messages for a minimum of 24-hours and allow users to request cached messages for download <https://community.wmo.int/activity-areas/wis/wis2-implementation >  I interpret this as an obligation on the originating center (i.e. NC/DCPC) to be able to provide their data for download  Baudouin > maybe another canter could offer the download service on behalf the NC, e.g. their affiliated GISC?  Tom > if an NC was not able to provide download access to their dataset, we could accommodate this in the discovery metadata record, in this case, the discovery metadata record would not include an "association" link for download [at the NC], a download "association" link could be included that refers to some other location, such as their GISC or the Global Cache. The discovery metadata record would include "association" links to the where Data Consumers can subscribe to the dataset, e.g. at the Global Brokers  Jeremy> - note agreement above that the Global Cache is not required to offer a "download" perspective (e.g. with file-level metadata for Data Consumers to browse)  Peter >receipt of multiple messages is what provides resilience - it allows for graceful recovery if a Global Broker goes offline because Data Consumers still get their messages Jeremy > but this puts the burden on the Data Consumer to parse multiple messages and discard duplicates  Peter > yes - but this situation arises anyway if you want resilience. The Data Consumer will need to subscribe to multiple brokers to make sure they're not affected by a local outage - whether those brokers are at a Global Broker or Global Cache  Peter > MQP reliability is all about receiving messages multiple times, you could try to merge messages together at the Global Broker to reduce the burden on the Data Consumer, but this creates a "versioning nightmare" and introduces issues such as how long a Global Broker should wait for other "duplicate" messages before sending messages out etc.  Remy > are messages from the Global Cache and originating NC/DCPC published to the same topic? We want to avoid large numbers of Data Consumers trying to download data from the NC/DCPC and push this traffic to the Global Cache, also useful to allow Data Consumers to prioritize receiving messages that refer to their preferred Global Cache instance, e.g. based on geographic location  Peter > the Global Broker could resolve duplicate suppression [of message from Global Cache instances], but each subscriber (Data Consumer) might have a geographic preference. Duplicate suppression would affect all subscribers, maybe we need multiple "channels": one for all traffic from the originating NC/DCPCs, and one for each of the geographic regions served by Global Cache instances (Europe, Asia, Americas etc.)  Peter > the simplest way to deal with duplicates is for the subscriber to decide based on their local circumstances. The easiest way to make this work is if the Data Consumer makes decisions about which messages to discard, for example, the Data Consumer may be happy to wait an extra 30-seconds after receiving a "data availability" message from a [sub-optimal] Global Cache instance to receive a message from their preferred Global Cache instance, if the "preferred" message doesn't arrive, they download the data from the other Global Cache instance: "Data Consumer decides" is the most flexible  Remy > "channels" - is this protocol specific? Worried that concepts like "channels" might confuse people if it's not supported by all protocols.  Peter > no - the concept of "channels" maps to [all?] the main MQ protocols, for example, in AMQP it is an "exchange", we're designing WIS2 to be protocol agnostic, not sticking with MQTT only. At MSC our message broker system uses this "channels" concept; we implemented a change from [AMQP to MQTT], it was very simple, only requiring a change of URL and protocol   Peter > we see that the Data Consumer receives multiple messages, most of which are discarded, the bigger the message, the more bandwidth that's wasted. Embedding the data in the message bulks the message and wastes more bandwidth  follow up discussions:
 |
Action items
Decision
- Global Cache will hold "file" Data Objects - same as how "file" Data Objects are shared on the GTS
- Data Consumers who want to browse and download data should use the originating center (i.e. NC/DCPC) - Global Cache may be accessed directly, but its primary purpose is to host files that are identified in real-time "data availability" messages. Global Cache does not need to provide a mechanism for Data Consumers to browse the cached content (e.g. file-level STAC metadata). Effectively, this means the cached copy would not be identified in the discovery metadata as a Distribution of the dataset. The discovery metadata record would include "association" links that refer to the originating center for download, and to originating center and Global Brokers for where a Data Consumer can subscribe to updates.
- The Global Catalogue will update discovery metadata records to add "association" links for subscription URLs at Global Broker instances
- The Global Catalogue advertises the availability of datasets and how/where to access them or subscribe to updates, it does not advertise availability of individual Data Objects that comprise a dataset
- Data Consumers should subscribe to Global Brokers to receive "data availability" messages; exceptionally, a Data Consumer may decide to subscribe directly to the local message broker at the originating NC/DCPC; Data Consumers should not subscribe to the local message broker at Global Cache instances.
- Global Cache instances and NC/DCPC use consistent topic structure in their local message brokers
- Global Brokers should use distinct "channels" to keep messages from originating centers separate from messages originating from Global Cache instances]Â
- Data Consumers will need to implement logic to discard "duplicate" messages
- Only embed data in within a message in exceptional circumstances
Â
Â