2022-03-07 ET-W2AT Meeting
Date
Mar 7, 2022 14:30-16:30 UTC
Participants
ET-W2AT
@Jeremy Tandy (Unlicensed)
@Rémy Giraud
@Dana Ostrenga (Unlicensed)
@thorsten.buesselberg (Unlicensed)
@Kai Wirt (Unlicensed)
@Tom Kralidis (Unlicensed)
@peter.silva (Unlicensed)
@Kenji Tsunoda (Unlicensed)
@Baudouin Raoult (Unlicensed)
Other Experts
@Kari Sheets (Unlicensed)
WMO Secretariat
@HADDOUCH Hassan
@David Berry
@Anna Milan
@Xiaoxia Chen
Apologies
@Henning Weber (Unlicensed)
@Li Xiang (Unlicensed)
Goals
[Peter]Â message structure (what's in the "micro-metadata record", what's the max size of a message, when should we embed the data?), file-naming convention
[Jeremy]revisit the WIS2 (meta)data workflows
Discussion topics
Item | Presenter | Notes |
---|---|---|
1 | Peter Silva | Peter presented his slides on WIS2 TT-protocols status MQP payload:
|
2 | Tom Kralidis | Tom presented proposal to use STAC-Item as a base construct for our "data availability" message
|
3. Discussion | All | Baudouin > is [STAC] 'datetime' is equiv. to [WMO] 'pubTime'; are we losing semantics?. Also "assets" allows multiple file granules - this could be abused. If we're only using "half" a standard, then what's the point?  Remy > I'm in favour of standards, I Agree with Baudouin, that we must be using it properly, and consider what's the benefit?  Tom > these record will automatically integrate with existing workflows and tools  Peter > STAC records are about 1kb - even these were sparsely populated, so would expect these to grow. The WMO message structure are about 400b  Tom > metadata isn't just for catalogues, and STAC is extensible [examples provided]  Remy > is there benefit in putting "data" attributes in the message, and how does this deal with embedded data; e.g. a tsunami warning?  Peter > you can embed JSON in other JSON - that's legal, but question is whether that embedded data adds value to people  Tom > could create a STAC extension for handling embedded data  Baudouin > I think that we're trying to shoe-horn our requirement to fit an existing standard, but we talk about setting fields to "null". I think that adding our own fields, creating an extension to deal with embedded data is not really using the standard!  Baudouin > I can use the same logic to say base our message on "cloud events"; this would enable linking with cloud workflows … https://cloudevents.io CloudEvents is a specification for describing event data in a common way. CloudEvents seeks to dramatically simplify event declaration and delivery across services, platforms, and beyond! CloudEvents is a new effort and it's still under active development. Peter > there are hundreds of options that we could look at :Kafka, etc. and so many more!  Jeremy> (Baudouin, Peter): assess if cloudevents is worth a closer look by TT-Protocols  Remy > we're using attributes from Peter's structure, and embedding them in something else, e.g. STAC, so long as the attributes fit into STAC (or something). I think we can defer this decision  Peter > key point is to define the _minimal_ set of records that we need, and permitting "alien" attributes. The size problem will grow as people start to put more data into messages as permitted by STAC  Remy > key point is that we're assessing to re-use an existing message structure. We can identify reasons to discard such existing message structures, such as message size  Remy > we need a foundation for WIS2, which means a message needs to define some minimal set of attributes; others could build on the "foundation" and provide "STAC-compatible" translator to drive other community's workflow  Jeremy > Why are 'baseURI' and 'relPath' split?  Peter > because of the daisy-chain republication, combining the hash with the relPath to spot duplicates need to avoid hash-collision, "size" helps too  Jeremy > you're assuming that all data publishers have the same directory structure  Peter > assumption that the relPath matches the topic hierarchy, relPath is a "portable reference" to the file  Remy > what's the need to have the "topic hierarchy" embedded in the message?, a subscriber will know that, and we won't change the topic when we republish. Why do we need to embed the topic in the message?  Jeremy > 'retPath' vs 'relPath'? What's the difference?  Baudouin > retPath is for providing the URL for API access  Peter > relPath is a portable reference: retPath (retrieval path) would be unique to a given server; retPath _always_ overrides the relPath. relPath is like a virtual reference - imagining that you were providing the data as files in a folder structure; it's like a key for identifying duplicates. The client software will give you the topic from where the message came from; for example if you're using wildcards to subscribe  Baudouin > extension? Can we permit use of absolute URLs in retPath, an API endpoint might have a different baseURL. Actually, you can always put re-directs in place - so ignore my request  Baudouin > the 'relPath' can be used to tell a Global Cache where to put a given file - even if the "data file" is downloaded from a API end-point identified by retPath  Remy > Canadian requirement - use the relPath to try to "rsync" different data-pump instances  Baudouin > we could split the relPath in two - the "path" bit, and the "filename" bit  Peter > we avoid this - because "/" and "+" are special characters for some MQ implementations, also, we're "moving" metadata from the GTS filename into the relPath  Kai > what we need is a URL to download the file  Peter > we don't have an absolute URL because of the daisy chaining  Jeremy > can we get consensus that this is acceptable for us all?  Remy > we're designing a foundation for all WIS users, this works for World-Weather-Watch and GTS migration. Would it work for data shared via API? [yes - use retPath]. Would it work for data shared by other communities; ocean, hydrology? yes - use retPath (for the real location of the file), and they need to know the relPath because they need to know the topic that they're publishing their messages to  Jeremy > this is a foundation that will work for all  Remy > only concern is that relPath isn't that meaningful for non-cached content? Peter > relPath is used for duplicate suppression, we need this portable identifier  Remy > but not all data are in the cache  Baudouin > perhaps rename "relPath" as "data-key"  Jeremy > Duplicate suppression - there are two types: spotting duplicate messages, and spotting slightly different messages (from different sources) that refer to the same data object  Peter > the first is trivial - and encompassed in the second case, so I only worry about the first  Jeremy > but this is only relevant to data that is in the Global Cache  Peter > assumption is that other people will re-publish the data, irrespective of whether the data is in the Global Cache, example: a regional hub; download once for national use - redistribution to different systems. The duplicate message suppression makes this case easy  Baudouin > Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".  Jeremy > (Peter): take Baudouin's proposal to TT-Protocols - Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".  Kenji > How long we (wis2node? Global Cache? Consumer?) need to hold all the hash values to check duplication? One hour, one day, one week?  Remy > seems like a local decision  Peter > Agreed. Keeping hashes for 1-hour would be a minimum.  Remy > is duplicate suppression part of a standard broker?  Peter > duplicate suppression isn't part of a normal broker - it's a client function: It means that the Global Broker will need additional "intelligence" to de-duplicate messages  Remy > so the Global Broker is more than just a MQ broker - it needs additional software: (i) publish messages, (ii) subscribe to other brokers, (iii) de-dupe messages  Jeremy > Filename - use "data identifier" instead of "filename" - it's less GTS centric.  Peter > filename is part of the "relPath" - not a separate attribute.  [Peter describes the topic hierarchy … aka "topic tree"] :
 Dave Berry > I believe the WIGOS station ID uses the 3 digit ISO country codes as the issuer of identifier. Would it make sense to use the same for both? [Agreed]  Jeremy> (Peter): update topic tree structure to use 3-letter ISO country codes
 This section is where we need to look at mapping to dataset granularity.  Remy > this is where we can engage "domain specialists" to help determine the best sub-structure for the topic tree in different domains [agreed] |