2022-04-04 ET-W2AT Meeting

 Date

Apr 4, 2022 13:30-15:30UTC

 Participants

  • @Rémy Giraud

  • @Jeremy Tandy (Unlicensed)

  • @peter.silva (Unlicensed)

  • @Baudouin Raoult (Unlicensed)

  • @Kai Wirt (Unlicensed)

  • @thorsten.buesselberg (Unlicensed)

  • @Henning Weber (Unlicensed)

  • @Kenji Tsunoda (Unlicensed)

Other Experts

  • @Kari Sheets (Unlicensed)

  • Andre Fatton (VerneMQ)

WMO Secretariat

  • @Enrico Fucile

  • @HADDOUCH Hassan

  • @Xiaoxia Chen

  • @Timo Proescholdt

  • @David Berry

  • @Anna Milan

Apologies

  • @Tom Kralidis (Unlicensed)

  • @Pablo Loyber (Unlicensed)

  • @Wang Peng (Unlicensed)

  • @sabai.fatima (Unlicensed)

  • @Dana Ostrenga (Unlicensed)

 Goals

  • (1-hour) Discussion with VerneMQ about Global Broker implementation

  • (1-hour) Supporting browsable WAFs for originating centers, further definition of the topic tree (common part)

 Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

1- Discussion with VerneMQ

ALL

Remy > Previous discussions with VerneMQ: comparison with other brokers; WIS2 architecture; and commercial solution offered by VerneMQ.

 Remy > What would be your implementation architecture for an operational [Global Broker]?

 Andre > It's a natural question, but very difficult to answer. Maximums - number of topics, number of clients, number of messages through put? No. There's no inherent limit. It depends on the compute [and network] resources. Typically, broker is built to support hundreds of thousands of concurrent connections; 250k. Can scale to millions. But 250k connections on a single machine is a point of failure; so cluster provides resilience. "VerneMQ Cluster" - typically runs co-located in one datacenter. Optimizing for latency; start with a physical server that has 4 or 8 (or maybe 16 CPUs) to handle load from concurrent connections. Also - need lots of RAM to cope with TCP buffers. Start with 16GB RAM for 8-cores. A medium sized server.

 Remy > in WIS2, we don't expect hundreds of thousands of connections. This isn't IoT. Optimistically, we need to support 1000 Centers.

 Andre > Server size isn't your biggest concern now - this can be changed  later. Wouldn't go less than 2-Core, 4 GB.

 Jeremy > So what should we be thinking about for our Global Brokers?

 Andre > Think about where your likely bottlenecks are. e.g. distribution patterns, such as small input but massive distribution; or the opposite, large collection with simple distribution. Optimize for throughput - not concurrent connections.

Is there a question for broadcast of messages? e.g. fanning out messages to 10K subscribers. Also need to think how we bridge between the 5 [Global Brokers].

An MQTT bridge is re-publisher of messages; subscribe and republish. It's a plug-in for VerneMQ; configured in a config file. Updates applied at reboot of broker instance

 [Peter > a bridge is called a "shovel" in AMQP

 Remy > We have the need for Global Brokers to "synchronise". Global Brokers are unlikely to be the same implementation, need to make sure the "bridge" isn't part of a proprietary implementation.

 Remy > We might want to implement a bridge as a "black-box" that can be deployed in front of any broker. VerneMQ is open source; you have [consultancy services?] to amend?

 Andre > We sell a subscription to binary packages of the open source software (pre-configured, and packaged with monitoring/metric etc.). There are options for you to manage the packaging yourself if you have experts. CHF2000K / server / year for the subscription - without support. Support agreements are tailored according to user needs; e.g. depending on support SLAs etc. Support is "expert support" - e.g. help me fix this problem, it's not a managed service - you still need your own dev-ops / IT operations team

 Remy > So a potential installation in Europe, a VerneMQ cluster. and may be another in Asia, how would we deploy that?

 Andre > Subscription is for operational / production nodes. Doesn't matter if it's in Kubernetes etc. Price isn't based on usage either - only the number of nodes. Should consider clusters of three - allowing to have one node down, and still have redundancy. A two-node cluster introduces a single-point of failure if one node is down.

 Peter > How does this compare with RabbitMQ?

 Andre > RabbitMQ is older; with a much bigger install base, it also has additional components for operation; e.g. GUI. VerneMQ has less of that, but, RabbitMQ has a MQTT plugin; and isn't a complete MQTT implementation. RabbitMQ has 18 services; VerneMQ has 2. RabbitMQ is multi-protocol; VerneMQ is single-protocol, so - VerneMQ is more performant

 Peter > Messages … JSON payload, 500bytes each, 100 messages/sec incoming

 Andre > That's fine; we also need to think about out-going message load, a Bridge would be one MQTT client; consuming 100 messages / sec - which is fine

 Peter > Erlang is used in RabbitMQ for configuring bridges, is this the same for VerneMQ?

 Andre > Yes. Although VerneMQ was first: simple config., similar approach

 Jeremy > Do you need to restart instances for config changes to change effect?

 Andre > Some things can be changed dynamically, some plugins can be started dynamically, but this doesn't touch the config file. Dynamic changes also need to be persisted in the config file

 Peter > Is there a volume discount - e.g. for 100 servers!

 Andre > Yes! We can do site-licenses etc. for unlimited.

 Peter > Considering WMO bulk purchase - or country-by-country?

 Baudouin > What features are standards, and which are specific to VerneMQ? If we use different implementations throughout WIS2, then we need to make sure they're all 100% standard and not using proprietary features.

 Remy > We're here talking about a Cluster being used to implement one global broker; e.g. the European global broker.

 Andre > Decisions on standard or not should be based on the MQTT specification. So Bridge, isn't a standard component. But a bridge is really just a MQTT subscriber. It can be considered as a black-box that talks standards. VerneMQ Bridge can open a bridge to a Mosquito node and vice-versa.

 Peter > Bridges use MQTT to communicate - but configuring a Bridge is different on every broker

Jeremy > Is there a limit on how many topics an MQP broker might support?

 Andre > Example. 100K MQTT clients, publishing telemetry. But also, they might subscribe to exclusive topics delivering configuration info specific to them. But 1000 clients each subscribing to 1000 topics may need some optimization. Try to limit the numbers of subscriptions. Wildcards help here. Topic structure is an important aspect.

 Jeremy > How does VerneMQ deal with durable queues? (e.g. that retain messages for a period of time before expiry)

 Andre > MQTT doesn't have the concept of queues. But VerneMQ does. Every consumer [subscriber] has a unique queue to them. Whether the queue is durable or not is configured by the client - [driven around client sessions?] In MQTT5, a message publisher can issue a time-to-live directive - but this isn't much used. In MQTT 3.1.1 it's not there at all. But, at the global level, we can force delete persistence sessions to protect against denial of service. Some clients might request persistent sessions, then never reconnect.

 Baudouin > [describes the client recovery case]

 Andre > Yes. We can do this, but there is a necessary global config to limit the number of messages in the queue. Important: a topic is NOT a queue.

 Remy > I see we can download one instance for experimentation? Use this to evaluate and compare with alternatives. We're thinking of having a pre-operational phase of WIS2 later this year.

 Hassan > We have 2-types of messages to be delivered: data availability message and, for a few cases, embedded data in the message for priority elements, like tsunami warnings. Is there QoS or prioritization for these priority aspects?

 Andre > There's no concept of prioritization in MQTT. Can help to have priority messages on a different topic. But each subscriber has their own queue. So there's no way to prioritize a message within a queue. The client application needs to read them in order. But this is real-time delivery, so the client application needs to be able to deal with the load!

 Peter > Brokers often perform better with smaller messages. What's small?

 Andre > Agreed. 1KB is a good size. MBs is not. Can also consider batching messages to reduce load on servers. You will have surprising low throughput if your messages are in the MB size. Because you need to write such large messages to disk. 5KB, 10KB is OK.

 Baudouin > What about messages that are very small. Would that erode performance?

 Peter > The smaller the better? Is that right?

 Andre > Try it and see. But I'd think that 1KB would be my starting preference. Example: field devices. Optimization for message distribution; "report by exception" - only when the state has changed. Don't send updates every 1-second.

 Baudouin > So this may mean that we want to think about batching messages. [But let's see from tests.]

 Remy > We're not talking about using MQTT to transport weather information directly. We're talking about sending messages that refer to files of observation data published routinely; every hour or sometimes more frequently. 500B to 2KB is our range.

 Andre > That sounds good. Would fit well.

2- message structure

Jeremy

Jeremy > [presents decisions / recommendations from Remy, Tom and Jeremy from previous week, starting with proposal for message structure]

 Jeremy > All date-time values in UTC [agreed]

 Jeremy > Add geographic information for the data to message for client-side filtering

 Peter > We discussed how to include this information in the message - but not encoding. It could be embedded in the filename; the filename is in the message.

 Jeremy > Previously we've talked about embedding metadata in filenames as poor practice [it requires people to understand a "micro-format" and to parse filenames to extract this info; custom-parsers need to be provided to validate embedded metadata etc.]

 Jeremy > Add unique message identifier to enable de-duplication of messages.

 Peter > ID wasn't discussed; don't need a separate ID value because we can use the relPath

 Jeremy > For a message published at different Global Caches, the relPath would be the same but the baseURL would be different. These messages shouldn't be de-duplicated because they refer to different "distributions" of the data object.

 We do need a way to determine that multiple messages (e.g. from originating center and several Global Cache instances) refer to the same Data Object so that Data Consumers don't accidentally download the data multiple times from different locations.

 Remy > If people re-publish a file then relPath would be the same, but the data object would be different; how would the de-duplication strategy work here?

 Peter > The original proposal (see Other Fields) uses the "mtime" field (from the Canadian Sarracenia stack) to manage versioning of [data objects]

 Remy > WAF (Web Accessible Folders) is not a requirement for the WIS2 foundation.

 Peter > Agree that WAF isn't mandatory. But it does provide a trivial implementation for this kind of data distribution. Throwing out relPath for message uniqueness is crippling, relPath is a logical name. The physical path (retPath) might be implementation specific [e.g. where S3 is used]

 Baudouin > You can't browse directories in Amazon S3 - so you couldn't implement a WAF that way.

 Baudouin > Proposal to use a GUID for the ID is good - avoid putting metadata in a filename.

 Peter > retPath (or the fully-qualified URL in the GeoJSON proposal) can't be used for duplicate suppression - because they may be vary for different distributions.

 Peter > retPath is optional; for most implementations people use only relPath and don't specify retPath

 Peter > With GeoJSON you could embed multiple links, each to a different data object resource, in one message. How would that work?

 [note: the "links" section appears out of place - should it be within the properties object?]

 Jeremy > We can constrain the GeoJSON to only allow one link; using JSON-Schema. We are still assuming that there will be one message for each data object. So publishing a data object in two formats (e.g. NetCDF and GRIB2) would result in two messages.

 Baudouin > The version number in the message is also important.

 Peter > The version number is in the topic-hierarchy.

 Baudouin > Experience from working with NCAR/Unidata community is that (meta)data needs to be self-describing / self-documenting. The version number tells client applications how they should parse the message. Once you take the message off the message broker, you'd lost the version number.

 Remy > We need to be able to version the message structure and the topic hierarchy separately. They need different versions so we can iterate them at different speeds.

 Remy > Let's discuss the merits of relPath and retPath …

 Peter > An issue with the GeoJSON proposal is that geometry, even a polygon, is pretty unbounded. I see examples of Canadian CAP alerts that have 20KB geometry definitions.

 Jeremy > We can use JSON-Schema to limit GeoJSON Polygon to 5 coordinates - all that's needed to describe a bounding box. Key point is that the message would still be GeoJSON and would work with existing toolsets.

 Kai > I like the GeoJSON proposal. I like the provision of a fully-qualified URL. Perhaps we don't need the ID field - we can use integrity-value field to uniquely identify the message.

 Jeremy > integrity-value provides the hash (checksum) of the data - it doesn't uniquely identify the message.

 Peter > Yes - it's the hash of the data object. Hash-collisions do occur; that's why we include the file size too. Hash + file size is extremely unlikely to provide an incorrect match.

 Baudouin > It's poor practice to use a property (or field) for multiple purposes - it's difficult to evolve the semantics of the field later in such a case because any change may affect the secondary usage.

 Baudouin > Sometimes we can't create a hash for the data. For example, if the data is generated dynamically via an API. It's not a predefined file object. Second, ECMWF have examples of very big files that would be computationally expensive to derive hashes for. The separate ID field is very useful.

 Peter > The original proposal aims at replicating functions of the GTS: distributing files. Removing the integrity and relPath fields makes it impossible to see that data is successfully copied. We know that the original proposal works - we've moves terabytes of data for 5+ years [using Sarracenia].

 Kai > Need to distinguish between files and web services. It's difficult to understand how duplicate suppression [for data] would work for Web services. The data provided through a Web service end point may change over time, a hash or checksum doesn't make sense for this.

 Baudouin > For things like the filename in relPath; you shouldn't need to "crack a string" to extract the information needed for people to understand. Key point is that you should embed semantics in identifiers. You should treat them as opaque.

 Remy > The question is not that - it's whether the relPath or GUID is better to support message de-duplication? Which is a better identifier - the "meaningful" relPath or the "opaque" GUID?

 Peter > Don't understand why "topic" is included in the GeoJSON proposal. Topic doesn't need to be there, it's provided in the message envelope, so it doesn't need to be in the message body. Also, the topic string is implementation dependent: "." separator for AMQP, "/" for others. Topic is protocol specific.

 note: once you've taken the message off the broker, you lose information in the message envelope such as topic

 Remy > Important to remember we have a one-to-one mapping between topic and dataset. So perhaps we shouldn't have called it topic? We want to identify the categorization of the dataset; the term used in the discovery metadata. We use this string to search for the dataset in Global Discovery Catalogue. Perhaps "taxonomy" would be better?

 Peter > So you're rebuilding relPath with "topic" and "identifier"?

 Baudouin > You've explained relPath and retPath to us many times, but we still find it difficult to understand. Making things easy to understand is really important for WIS2. We have to make it simple for our users.

 Peter > I don't have a problem re-naming things to help with better communication.

 Remy > The top concern here is whether or not we structure the message based on GeoJSON, then we can talk about naming things; drilling into the details. Of course, if we choose GeoJSON there are some consequences, such as the need for the geometry field.

 Peter > I'm using this solution as a general-purpose file distributor, genomics, astronomy, doesn't matter. It's weird to have geometry in messages for those domains

 Remy > First: geometry is required but can be set to null. Second, this is WMO. Out domains are weather, water and climate. All this data is spatio temporal.

 Jeremy > Continue next week. Remember: consensus is concerned with defining something you can live with. Can you make this work - even if it doesn't seem perfect.

 Remy > Also, we have time to iterate. The message structure won't be defined in the Manual. We can try things out and evolve the solution through the pre-operational phase. So we have lots of opportunity to get this right. But we need to start with our best proposal.

 Peter > Split URLs (baseURL plus relPath / retPath) are needed for the Canadian implementation of file redistribution. The separate baseURL is important.

 Remy > WIS2 doesn't need to replicate files.

 Peter > Then this will never work.

 Remy > Aim is to enable people to download the data they need [from URL] - not for everyone to use the same directory structure.

 Peter > Don't have to use the same directory structure everywhere. relPath provides a logical folder structure. The relPath is a portable "logical path" that is the same in every location and can be used to identify and remove duplicate messages talking about the same data object.

Jeremy>Discussion will be continued next week

 Action items

 Decisions