2022-03-07 ET-W2AT Meeting

\uD83D\uDDD3Date

07 Mar 2022 14:30-16:30 UTC

\uD83D\uDC65Participants

ET-W2AT

Other Experts

Kari Sheets (Unlicensed)

WMO Secretariat

\uD83E\uDD45Goals

[Peter] message structure (what's in the "micro-metadata record", what's the max size of a message, when should we embed the data?), file-naming convention
[Jeremy]revisit the WIS2 (meta)data workflows

\uD83D\uDDE3Discussion topics

Item

Presenter

Notes

1

Peter Silva

Peter presented his slides on WIS2 TT-protocols status MQP payload:

Minimal sample message
Pubtime & base Url
pubTime - all times in UTC
relPath - matches the topic hierarchy?
Filename - haven't worked on this
Duplicate suppression - machinery to help identify where messages are duplicates

2

Tom Kralidis

Tom presented proposal to use STAC-Item as a base construct for our "data availability" message

STAC is a low-level metadata representation; for describing file-level objects
A broker providing STAC-Item messages would plug and play into so many existing workflows
Example:
- "geometry" can be null - but the messages could contain fine-grained geometry info
- it's a bigger payload - but it's standardised so increases broad use

3. Discussion

All

Baudouin > is [STAC] 'datetime' is equiv. to [WMO] 'pubTime'; are we losing semantics?. Also "assets" allows multiple file granules - this could be abused. If we're only using "half" a standard, then what's the point?

Remy > I'm in favour of standards, I Agree with Baudouin, that we must be using it properly, and consider what's the benefit?

Tom > these record will automatically integrate with existing workflows and tools

Peter > STAC records are about 1kb - even these were sparsely populated, so would expect these to grow. The WMO message structure are about 400b

Tom > metadata isn't just for catalogues, and STAC is extensible [examples provided]

Remy > is there benefit in putting "data" attributes in the message, and how does this deal with embedded data; e.g. a tsunami warning?

Peter > you can embed JSON in other JSON - that's legal, but question is whether that embedded data adds value to people

Tom > could create a STAC extension for handling embedded data

Baudouin > I think that we're trying to shoe-horn our requirement to fit an existing standard, but we talk about setting fields to "null". I think that adding our own fields, creating an extension to deal with embedded data is not really using the standard!

Baudouin > I can use the same logic to say base our message on "cloud events"; this would enable linking with cloud workflows … https://cloudevents.io

CloudEvents is a specification for describing event data in a common way. CloudEvents seeks to dramatically simplify event declaration and delivery across services, platforms, and beyond! CloudEvents is a new effort and it's still under active development.

http://cloudevents.io

Peter > there are hundreds of options that we could look at :Kafka, etc. and so many more!

Jeremy> (Baudouin, Peter): assess if cloudevents is worth a closer look by TT-Protocols

Remy > we're using attributes from Peter's structure, and embedding them in something else, e.g. STAC, so long as the attributes fit into STAC (or something). I think we can defer this decision

Peter > key point is to define the _minimal_ set of records that we need, and permitting "alien" attributes. The size problem will grow as people start to put more data into messages as permitted by STAC

Remy > key point is that we're assessing to re-use an existing message structure. We can identify reasons to discard such existing message structures, such as message size

Remy > we need a foundation for WIS2, which means a message needs to define some minimal set of attributes; others could build on the "foundation" and provide "STAC-compatible" translator to drive other community's workflow

Jeremy > Why are 'baseURI' and 'relPath' split?

Peter > because of the daisy-chain republication, combining the hash with the relPath to spot duplicates need to avoid hash-collision, "size" helps too

Jeremy > you're assuming that all data publishers have the same directory structure

Peter > assumption that the relPath matches the topic hierarchy, relPath is a "portable reference" to the file

Remy > what's the need to have the "topic hierarchy" embedded in the message?, a subscriber will know that, and we won't change the topic when we republish. Why do we need to embed the topic in the message?

Jeremy > 'retPath' vs 'relPath'? What's the difference?

Baudouin > retPath is for providing the URL for API access

Peter > relPath is a portable reference: retPath (retrieval path) would be unique to a given server; retPath _always_ overrides the relPath. relPath is like a virtual reference - imagining that you were providing the data as files in a folder structure; it's like a key for identifying duplicates. The client software will give you the topic from where the message came from; for example if you're using wildcards to subscribe

Baudouin > extension? Can we permit use of absolute URLs in retPath, an API endpoint might have a different baseURL. Actually, you can always put re-directs in place - so ignore my request

Baudouin > the 'relPath' can be used to tell a Global Cache where to put a given file - even if the "data file" is downloaded from a API end-point identified by retPath

Remy > Canadian requirement - use the relPath to try to "rsync" different data-pump instances

Baudouin > we could split the relPath in two - the "path" bit, and the "filename" bit

Peter > we avoid this - because "/" and "+" are special characters for some MQ implementations, also, we're "moving" metadata from the GTS filename into the relPath

Kai > what we need is a URL to download the file

Peter > we don't have an absolute URL because of the daisy chaining

Jeremy > can we get consensus that this is acceptable for us all?

Remy > we're designing a foundation for all WIS users, this works for World-Weather-Watch and GTS migration. Would it work for data shared via API? [yes - use retPath].

Would it work for data shared by other communities; ocean, hydrology? yes - use retPath (for the real location of the file), and they need to know the relPath because they need to know the topic that they're publishing their messages to

Jeremy > this is a foundation that will work for all

Remy > only concern is that relPath isn't that meaningful for non-cached content?

Peter > relPath is used for duplicate suppression, we need this portable identifier

Remy > but not all data are in the cache

Baudouin > perhaps rename "relPath" as "data-key"

Jeremy > Duplicate suppression - there are two types: spotting duplicate messages, and spotting slightly different messages (from different sources) that refer to the same data object

Peter > the first is trivial - and encompassed in the second case, so I only worry about the first

Jeremy > but this is only relevant to data that is in the Global Cache

Peter > assumption is that other people will re-publish the data, irrespective of whether the data is in the Global Cache, example: a regional hub; download once for national use - redistribution to different systems. The duplicate message suppression makes this case easy

Baudouin > Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".

Jeremy > (Peter): take Baudouin's proposal to TT-Protocols - Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".

Kenji > How long we (wis2node? Global Cache? Consumer?) need to hold all the hash values to check duplication? One hour, one day, one week?

Remy > seems like a local decision

Peter > Agreed. Keeping hashes for 1-hour would be a minimum.

Remy > is duplicate suppression part of a standard broker?

Peter > duplicate suppression isn't part of a normal broker - it's a client function: It means that the Global Broker will need additional "intelligence" to de-duplicate messages

Remy > so the Global Broker is more than just a MQ broker - it needs additional software: (i) publish messages, (ii) subscribe to other brokers, (iii) de-dupe messages

Jeremy > Filename - use "data identifier" instead of "filename" - it's less GTS centric.

Peter > filename is part of the "relPath" - not a separate attribute.

[Peter describes the topic hierarchy … aka "topic tree"] :

Avoid protocol specific things - use portable concepts
Support for server side filtering via topic hierarchies
Proposed: queue sharing [e.g. being able to run multiple processors on one queue, to parallelise processing of messages on a busy queue]
"Channels" - a few root topics, per audience [?] … (equivalent to Kafka topic and AMQP exchange)
Topic-tree is hierarchical
case sensitive
no spaces in names
simple pattern matching only (anything more sophisticated is [probably] an extension or proprietary)
hierarchical control: like OID, each levels "controls" lower levels in the tree; ability to allocate governance to subordinate jurisdictions

Dave Berry > I believe the WIGOS station ID uses the 3 digit ISO country codes as the issuer of identifier. Would it make sense to use the same for both? [Agreed]

Jeremy> (Peter): update topic tree structure to use 3-letter ISO country codes

"mobile_rgnl_al" - {originating centre code} from WMO Code Tables
"surface/aviation/metar/us" - still a work in progress
File format not in topic-tree - it used to be in the GTS filenaming convention; now this will be a file extension
"Geo extensions" in the topic-tree? … pick this up next week

This section is where we need to look at mapping to dataset granularity.

Remy > this is where we can engage "domain specialists" to help determine the best sub-structure for the topic tree in different domains [agreed]

✅Action items

(Baudouin, Peter): assess if cloudevents is worth a closer look by TT-Protocols
(Peter): take Baudouin's proposal to TT-Protocols - Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".
(Peter): update topic tree structure to use 3-letter ISO country codes
"Geo extensions" in the topic-tree? … pick this up next week