2022-03-07 ET-W2AT Meeting

Date

Mar 7, 2022 14:30-16:30 UTC

Participants

ET-W2AT

  • @Jeremy Tandy (Unlicensed)

  • @Rémy Giraud

  • @Dana Ostrenga (Unlicensed)

  • @thorsten.buesselberg (Unlicensed)

  • @Kai Wirt (Unlicensed)

  • @Tom Kralidis (Unlicensed)

  • @peter.silva (Unlicensed)

  • @Kenji Tsunoda (Unlicensed)

  • @Baudouin Raoult (Unlicensed)

Other Experts

  • @Kari Sheets (Unlicensed)

WMO Secretariat

  • @HADDOUCH Hassan

  • @David Berry

  • @Anna Milan

  • @Xiaoxia Chen

Apologies

  • @Henning Weber (Unlicensed)

  • @Li Xiang (Unlicensed)

Goals

  • [Peter] message structure (what's in the "micro-metadata record", what's the max size of a message, when should we embed the data?), file-naming convention

  • [Jeremy]revisit the WIS2 (meta)data workflows

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

1

Peter Silva

Peter presented his slides on WIS2 TT-protocols status MQP payload:

  • Minimal sample message

  • Pubtime & base Url

  • pubTime - all times in UTC

  • relPath - matches the topic hierarchy?

  • Filename - haven't worked on this

  • Duplicate suppression - machinery to help identify where messages are duplicates

2

Tom Kralidis

Tom presented proposal to use STAC-Item as a base construct for our "data availability" message

  • STAC is a low-level metadata representation; for describing file-level objects

  • A broker providing STAC-Item messages would plug and play into so many existing workflows

  •  Example:

    • "geometry" can be null - but the messages could contain fine-grained geometry info

    • it's a bigger payload - but it's standardised so increases broad use

3. Discussion

All

Baudouin > is [STAC] 'datetime' is equiv. to [WMO] 'pubTime'; are we losing semantics?. Also "assets" allows multiple file granules - this could be abused. If we're only using "half" a standard, then what's the point?

 Remy > I'm in favour of standards, I Agree with Baudouin, that we must be using it properly, and consider what's the benefit?

 Tom > these record will automatically integrate with existing workflows and tools

 Peter > STAC records are about 1kb - even these were sparsely populated, so would expect these to grow. The WMO message structure are about 400b

 Tom > metadata isn't just for catalogues, and STAC is extensible [examples provided]

 Remy > is there benefit in putting "data" attributes in the message, and how does this deal with embedded data; e.g. a tsunami warning?

 Peter > you can embed JSON in other JSON - that's legal, but question is whether that embedded data adds value to people

 Tom > could create a STAC extension for handling embedded data

 Baudouin > I think that we're trying to shoe-horn our requirement to fit an existing standard, but we talk about setting fields to "null". I think that adding our own fields, creating an extension to deal with embedded data is not really using the standard!

 Baudouin > I can use the same logic to say base our message on "cloud events"; this would enable linking with cloud workflows … https://cloudevents.io

CloudEvents is a specification for describing event data in a common way. CloudEvents seeks to dramatically simplify event declaration and delivery across services, platforms, and beyond! CloudEvents is a new effort and it's still under active development.

http://cloudevents.io

Peter > there are hundreds of options that we could look at :Kafka, etc. and so many more!

 Jeremy> (Baudouin, Peter): assess if cloudevents is worth a closer look by TT-Protocols

 Remy > we're using attributes from Peter's structure, and embedding them in something else, e.g. STAC, so long as the attributes fit into STAC (or something). I think we can defer this decision

 Peter > key point is to define the _minimal_ set of records that we need, and permitting "alien" attributes. The size problem will grow as people start to put more data into messages as permitted by STAC

 Remy > key point is that we're assessing to re-use an existing message structure. We can identify reasons to discard such existing message structures, such as message size

 Remy > we need a foundation for WIS2, which means a message needs to define some minimal set of attributes; others could build on the "foundation" and provide "STAC-compatible" translator to drive other community's workflow

 Jeremy > Why are 'baseURI' and 'relPath' split?

 Peter > because of the daisy-chain republication, combining the hash with the relPath to spot duplicates need to avoid hash-collision, "size" helps too

 Jeremy > you're assuming that all data publishers have the same directory structure

 Peter > assumption that the relPath matches the topic hierarchy, relPath is a "portable reference" to the file

 Remy > what's the need to have the "topic hierarchy" embedded in the message?, a subscriber will know that, and we won't change the topic when we republish. Why do we need to embed the topic in the message?

 Jeremy > 'retPath' vs 'relPath'? What's the difference?

 Baudouin > retPath is for providing the URL for API access

 Peter > relPath is a portable reference: retPath (retrieval path) would be unique to a given server; retPath _always_ overrides the relPath. relPath is like a virtual reference - imagining that you were providing the data as files in a folder structure; it's like a key for identifying duplicates. The client software will give you the topic from where the message came from; for example if you're using wildcards to subscribe

 Baudouin > extension? Can we permit use of absolute URLs in retPath, an API endpoint might have a different baseURL. Actually, you can always put re-directs in place - so ignore my request

 Baudouin > the 'relPath' can be used to tell a Global Cache where to put a given file - even if the "data file" is downloaded from a API end-point identified by retPath

 Remy > Canadian requirement - use the relPath to try to "rsync" different data-pump instances

 Baudouin > we could split the relPath in two - the "path" bit, and the "filename" bit

 Peter > we avoid this - because "/" and "+" are special characters for some MQ implementations, also, we're "moving" metadata from the GTS filename into the relPath

 Kai > what we need is a URL to download the file

 Peter > we don't have an absolute URL because of the daisy chaining

 Jeremy > can we get consensus that this is acceptable for us all?

 Remy > we're designing a foundation for all WIS users, this works for World-Weather-Watch and GTS migration. Would it work for data shared via API? [yes - use retPath].

Would it work for data shared by other communities; ocean, hydrology? yes - use retPath (for the real location of the file), and they need to know the relPath because they need to know the topic that they're publishing their messages to

 Jeremy > this is a foundation that will work for all

 Remy > only concern is that relPath isn't that meaningful for non-cached content?

Peter > relPath is used for duplicate suppression, we need this portable identifier

 Remy > but not all data are in the cache

 Baudouin > perhaps rename "relPath" as "data-key"

 Jeremy > Duplicate suppression - there are two types: spotting duplicate messages, and spotting slightly different messages (from different sources) that refer to the same data object

 Peter > the first is trivial - and encompassed in the second case, so I only worry about the first

 Jeremy > but this is only relevant to data that is in the Global Cache

 Peter > assumption is that other people will re-publish the data, irrespective of whether the data is in the Global Cache, example: a regional hub; download once for national use - redistribution to different systems. The duplicate message suppression makes this case easy

 Baudouin > Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".

 Jeremy > (Peter): take Baudouin's proposal to TT-Protocols - Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".

 Kenji > How long we (wis2node? Global Cache? Consumer?) need to hold all the hash values to check duplication? One hour, one day, one week?

 Remy > seems like a local decision

 Peter > Agreed. Keeping hashes for 1-hour would be a minimum.

 Remy > is duplicate suppression part of a standard broker?

 Peter > duplicate suppression isn't part of a normal broker - it's a client function: It means that the Global Broker will need additional "intelligence" to de-duplicate messages

 Remy > so the Global Broker is more than just a MQ broker - it needs additional software: (i) publish messages, (ii) subscribe to other brokers, (iii) de-dupe messages

 Jeremy > Filename - use "data identifier" instead of "filename" - it's less GTS centric.

 Peter > filename is part of the "relPath" - not a separate attribute.

 [Peter describes the topic hierarchy … aka "topic tree"] :

  1.  Avoid protocol specific things - use portable concepts

  2. Support for server side filtering via topic hierarchies

  3. Proposed: queue sharing [e.g. being able to run multiple processors on one queue, to parallelise processing of messages on a busy queue]

  4. "Channels" - a few root topics, per audience [?] … (equivalent to Kafka topic and AMQP exchange)

  5. Topic-tree is hierarchical

  6. case sensitive

  7. no spaces in names

  8. simple pattern matching only (anything more sophisticated is [probably] an extension or proprietary)

  9. hierarchical control: like OID, each levels "controls" lower levels in the tree; ability to allocate governance to subordinate jurisdictions

 Dave Berry > I believe the WIGOS station ID uses the 3 digit ISO country codes as the issuer of identifier. Would it make sense to use the same for both? [Agreed]

 Jeremy> (Peter): update topic tree structure to use 3-letter ISO country codes

  •  "mobile_rgnl_al" - {originating centre code} from WMO Code Tables

  •  "surface/aviation/metar/us" - still a work in progress

  •  File format not in topic-tree - it used to be in the GTS filenaming convention; now this will be a file extension

  •  "Geo extensions" in the topic-tree? … pick this up next week

 This section is where we need to look at mapping to dataset granularity.

 Remy > this is where we can engage "domain specialists" to help determine the best sub-structure for the topic tree in different domains [agreed]

Action items

(Baudouin, Peter): assess if cloudevents is worth a closer look by TT-Protocols
(Peter): take Baudouin's proposal to TT-Protocols - Can we had the mimetype to the message, e.g. "application/x-grib" or "image/png".
(Peter): update topic tree structure to use 3-letter ISO country codes
"Geo extensions" in the topic-tree? … pick this up next week

Decision