Pub/sub messaging for WIS2
WIS has several requirements that require information to be distributed in real time. The WIS2 Principle #6 states that WIS2 "Will add open standard messaging protocols that use the publish-subscribe (pub/sub) message pattern to the list of data exchange mechanisms approved for use within WIS and GTS."
Enabling subscription via "open standard messaging protocols", or asynchronous APIs, replaces two WIS Tech Specs:
WIS Tech Spec #10 downloading files via dedicated networks (real-time push)
WIS Tech Spec #12 downloading files via other methods (delayed-mode push)
We will use pub/sub messaging to notify data consumers about the availability of new data or updates to existing data.
The publish and subscribe pattern describes how information is distributed in an Event-Driven Architecture (EDA):
A message broker (or "broker") is a piece of infrastructure in charge of receiving messages and delivering them to those who have shown interest. They often store messages until they are delivered, which makes EDAs very resilient to failures. Examples of brokers are RabbitMQ, Apache Kafka, Solace, etc.
A publisher (a.k.a. producer) is an application that sends messages to the broker.
A subscriber (a.k.a. consumer) is an application that connects to the broker, manifests an interest in a certain type of messages, and leaves the connection open so the broker can push messages to them.
A message is a piece of information that's sent by the publishers to the broker, and received by all the interested subscribers.
All the brokers support communication through multiple channels. The industry doesn't have a common term though so you may find them as topics, routing keys, event types, etc. Channels are usually assigned a name or identifier and it's often good practice to send a single type of message through a channel.
The term “channel” is used throughout this document. Think of it as a place-holder term for protocol-specific terms such as topics, routing keys, event types, etc.
Note that message queues are often used to deliver information point-to-point. While this sounds similar to pub/sub it is a different model that requires the message destinations to be pre-configured. Pub/sub is a consumer-centric model, where subscribers register interest in receipt of messages on specified channels. WIS2 does not have a requirement to route messages to specific locations. It is dependent on applications subscribing to channels in order to receive messages.
Message format
We must also standardise the message structure. Producer applications must create messages in a consistent way so that subscriber applications understand how to process them. WMO TT-Protocols has already begun the work to develop a message format standard. See here for their draft. The proposed message format is modelled on a notification. A message asserts that new data is available from either an HTTPS or SFTP service, and optionally via other means such as AWS S3 or Azure Storage. The recipient of the message decides whether or not to retrieve the data. TT-Protocols are considering allowing data to embedded within the message itself under some circumstances. However, a message must ALWAYS refer to the Web location where the data is published.
Provenance is an important aspect to consider – a message should indicate it's origin. This may be different to the origin of the data it refers to.
While there is no requirement for "integrity voting" (as per distributed ledger), we do need to ensure that the message has neither been corrupted nor tampered with. End-to-end digital signing and verification should be sufficient for this purpose.
Message broker network topology
Unfortunately, there is no equivalent of a Content Distribution Network (CDN) for event-driven architectures. Therefore, to ensure low-latency, resilient delivery of messages all around the globe we will need to republish messages. To achieve this, simply run an application that subscribes to a channel on a remote broker. As the application consumes messages from that channel, it immediately republishes them via a channel on the local broker.
But how are all these brokers connected? What connects to what? We need to decide who is going to run a broker in the WIS2 architecture. GISCs, DCPCs, NCs?
The topology of the message broker network is a crucial part of the WIS2 architecture.
The more hops, the greater the latency. So, we need to find a workable compromise that balances the number of times a message is republished on route to its final subscriber against resilience and sharing the subscription load from the global community.
TT-Protocols propose a "mesh network" approach, where participating centres will operate their own message brokers and interconnect with other peer-brokers.
[more detail to be added here]
However, it is likely that some Members will lack the capability and/or motivation to operate a message broker. In such cases, NCs or DCPCs may publish messages onto a broker operated by their affiliated GISC. Over time, NCs and DCPCs may choose to operate their own message brokers.
For discussion:
Within its Area of Responsibility, a GISC will:
Publish messages on behalf of an affiliated centre. In this case, once authenticated the affiliated centre would publish messages onto a channel configured for them on the GISCs broker. [process 1: setting up a new channel on the GISC][process 2: authentication/authorization for the centre to publish to the channel].
Re-publish messages from an affiliated centre's message broker. In this case, (I) the affiliated centre publishes messages to a channel on their broker, (II) the GISC subscribes to that channel, and (III) the GISC republishes those messages on a corresponding channel on its broker. [process 1: registering a channel with the GISC so that it knows to subscribe][process 2: harvesting that AsyncAPI description from the affiliated centre so that the GISC can create a corresponding channel][process 3: keeping the GISCs channel description up to date in respect to changes by the affiliated centre]
Furthermore, we also need to ensure that resilient, fast publication of messages across the globe. GISCs should re-publish all messages from affiliated centres operating their own brokers. In addition to providing a layer of resilience, this approach consolidates all messages from its Area of Responsibility into a single place making it simple for users to subscribe to messages from a large number of sources.
For global distribution of messages, GISCs should re-publish messages from peer GISCs. Global message republication is an example of a GISC Service that could be shared.
GISCs will subscribe to the set of channels from all other GISCs and republish messages locally. A fully-meshed architecture is not necessary – only that each GISC ensures that it is publishing messages originating from throughout the WIS. A GISC may subscribe to messages from an intermediate GISC – albeit that this extra hop will add latency to message distribution. [process 1: how does each GISC know which channels to subscribe to?][process 2: how does a GISC gather information for the AsyncAPI to describe a channel it's republishing – and keep this description up to date?]
Discussion:
Peter Silva:
The proposals provided so far [by TT-Protocols] enable a "mesh" topology, making arrangement of individual nodes just a matter of taste. Each node should connect to more than one other one, a rule that arises naturally if centres want redundancy. That is enough to get a full mesh. Nodes that do not want redundancy can connect to a single other node. I think that will work fine without any further refinements. Members that want info faster will want >>2 connections, and naturally become bridges for others but the propagation delay across the entire mesh will not be such as to pose a problem. This needs to be demonstrated, but it is what I would expect.
GISCS are essentially nodes with >>2 links. More GISCs just mean fewer the hops between member nodes. But I fully expect other nodes will have >2 links, and that will just speed up propagation.
I think the WIS currently says that on NC should only connect to one GISC. I think that is it would be better to allow/recommend two.
Topic hierarchy:
The topic hierarchy, or channel structure is important. It will enable a user to logically browse through the available channels to find what they need, much like NASA's Global Change Master Directory (GCMD) is often used as de facto way to organise environmental data. Previously, the GTS Routing Tables provided a similar function and may offer a useful starting point.]
Work to be done; currently being lead from TT-Protocols. See here.
We should test proposals with the community to ensure it meets their needs, making sure that we don’t limit ourselves to the World Weather Watch.
Note: Given that both metadata record AND pub/sub channels relate to the dataset as an atomic unit, we need to align thinking on this.
Discussion:
Peter Silva:
Looks like more conceptual confusion... a Topic != channel... I don't know what a dataset is, and less why a dataset requires it's own channel. For example, there are around 200 countries in the world, and so far we have been told to segregate data by country, so this means we need around 200 channels just to subscribe to a single type of surface obs.
There is a concept in the topic trees (also in the OID) that means that responsibility for data follows the hierarchy, so there is a good reason to have responsible WMO member at the top of the tree. Those hierarchies mean something...
Guessing... if you want a "dataset" then presumably you don't want the country higher in the topic tree?
so you then want data from all sources of a single type. Perhaps you don't want country in the hierarchy at all, and just have channel for "surface observations" which means that anyone connecting to that channel gets the entire world and has to filter out the uninteresting countries client-side...
This is a deep discussion about what the topic hierarchy is for... what it means... it's a good discussion to have.
I think the many people have a deep RPC (remote procedure call) view of the world, where the server is answering a question from the client, and so the person asking the question can provide a lot of detail about that question. They do not fully appreciate that publishing is inherently about giving a general answer, that different consumers will want different subsets, and one is publishing for a wider audience that might not have as precisely formulated a question as is required for an RPC. The channel view you are describing sounds like "let's make an RPC call for every possibility" of which there are, in my view, too many to be practical. but I could easily be mistaken... I'm just not confident that approach works for now.
Finding the pub/sub end-point:
Users need to know where they can subscribe to real-time messages in WIS2. They need to be able to find the asynchronous API endpoint.
In WIS we think of data as the first class citizen. Operations on the data (e.g. services such as access/download, view, or subscribe) come second. So - a user first finds the data they’re interested in, then the services through which they can interact with that data. Including how and where they might subscribe to notifications about [updates to] the dataset. The WIS2 metadata record for the dataset would point to the asynchronous API end-point where a user could subscribe.
Similarly, metadata provided by the asynchronous API endpoint should point to the dataset description (in AsyncAPI, this is done using the External Documentation Object). It's good practice to always refer to the canonical version of the dataset description. In addition, a GISC may also refer to a local copy of the metadata if available.
Issues to consider:
WIS2 will need to define operational governance, including SLAs. These may differ for different types of message. For example, channels carrying warnings may need to be split out enabling warning messages to be expedited. It may be pertinent for warning messages to convey the data within the message itself to ensure rapid delivery.
Authorization to publish? Or even to subscribe?
The message notification system could also be used to publish updates about catalogs.
When GISCs republish messages from elsewhere, the original metadata provided by the data publisher won't include details of the GISCs additional asynchronous API endpoint, because the data publisher won't know about these additional endpoints when providing their metadata. So - should the GISC insert an additional association pointing to the subscription end-point(s*) they provide; or maybe even to the end-points provided by all GISCs?