CDN for WIS2

Comments:

How emergency backup connections/comms could work in the case of significant outages of Internet/DNS/edge connection or DoS attack.

Risk is in two parts:

A) a given cloud/comms network service could be compromised through malware, technical mistakes etc.

B) a connection to the edge of a cloud/comms network service could be damaged by spades, storms, trawlers. This would also include the telephone exchanges through which the connections are made.

Mitigations for (A) would be a backup service with the same or a different provider or even an in-house facility.

Mitigations for (B) usually in the form of a diversely and separately routed connection to a different access point of the service provider. The cost of (B) and its mitigation varies widely across the world, dependent on development and the public/private regime.

 

Risk analysis based on quantifiable information needed.


 Global, low-latency, resilient data distribution in WIS 2.0

(Jeremy Tandy, John Nolan) June 2021

 

WIS 2.0 is predicated on the use of Web technologies. A crucial implication of this is a change in the way that data files (or content) are distributed from Data Collection & Production Centres (DCPCs) and National Centres (NCs).

Instead of distributing data files using point-to-point transfer between Message Switching Systems according to the file header (TTAAii) and Routing Tables, data will be published by the originator (or their delegated agent) via the Web. This may use interactive Web services, such as the Open Geospatial Consortium's OGC-API Features or OGC-API EDR, but at its simplest, publishing files via an HTTP server is sufficient. The global meteorological community, including NMHS from 193 WMO Members, will need to pull data from these HTTP or FTP servers - it will no longer be pushed to them.

Providing data access via Web services responds to two WIS Tech Specs: 

  • WIS Tech Spec #10 downloading files via dedicated networks (real-time push),

  • WIS Tech Spec #11 downloading data via non-dedicated networks (pull)

For time-critical data, consumers will subscribe to messages from a Message Broker Network [defined elsewhere]. When new data is published, or existing datasets updated, subscribers will receive a message describing what data has become available, and where it can be accessed - either an HTTPS or SFTP end-point. If the consumer decides it needs this new data, they will request the data.

However, we need to ensure that data is delivered to the consumers in a timely manner - particularly where it is an input to safety-critical operations. Data needs to be globally available, and delivery needs to be quick (low-latency) and resilient.

WIS 1 solved this problem using the GISC 24-hour cache, ensuring that all data for global exchange was available within a GISC's Area of Responsibility. This was a reasonable approach at the time, given that the data files were being propagated through the WIS network via the GTS. However, content was only addressable in WIS 1 using the GTS header / GTS filename. Consumers on the GTS could get data pushed to them by amending the Routing Tables, or they could acquire content by searching the WIS Catalogue and then downloading or the datasets they found or be setting up a subscription to have them delivered to via email or FTP - what is referred to in the WIS Tech Specs as "delayed mode push". It's fair to say that the 24-hour cache and delayed mode push added significant cost and complexity to GISC implementation. Furthermore, WIS 1 did not allow consumers to download data simply by resolving a URL for the dataset - as is required by WIS 2.0 Principle #2: "Use Uniform Resource Locators (URL) to identify resources (i.e., Web pages, data, metadata, APIs)". In WIS 1, datasets or data files for global exchange existed in 15 different locations, once in each instance of a GISC cache, and each with a different URL (based on the GISCs hostname) - if one was provided at all. There was no single, canonical URL that could be resolved to download the data.

With WIS 2 intending to provide a Message Broker Network that uses open standard Message Queue Protocols (MQP), consumers no longer have to operate a Message Switch to subscribe to data on the GTS. The use of modern MQPs will democratise real-time data exchange in meteorological community, and in doing so remove the need for "delayed mode push" subscriptions.

But we do still need to provide global, low-latency, resilient delivery of content that is addressable via a URL. Fortunately, such requirements have long been met using Content Delivery Networks / Content Distribution Networks (CDN).

A CDN consists of Edge Servers distributed throughout the region where content needs to be served. CDNs improve consumer experience by (i) reducing the time taken to load content because it is served from a topographically close location (e.g. low-latency), and (ii) routing content from an alternative Edge Server if the preferred one is unavailable (e.g. resilience).

Edge Servers are populated using content from the Origin Server. This is where the content provider publishes the everything that they need the CDN to distribute for them.

Edge Servers are accessed by content consumers, while the Origin Server is only used to feed the Edge Servers. The Origin Server is secured and only accessible to the CDN provider. Necessarily, the Origin Server has a different URL to the Edge Server(s), enabling the Internet to route consumer requests to the (preferred) Edge Server, and from the Edge Servers to the Origin Server. For example,

  • the Origin Server may be: https://secure-content-server.metoffice.gov.uk

  • the CDN provider provisions an IP address and/or URL for the content provider to use, e.g. metoffice-content.someCDN.com

  • Consumer Clients use the consumer-oriented URL to retrieve content, e.g. content.metoffice.gov.uk

  • the Content Provider registers the provisioned URL/IP address with a CNAME in their public DNS service to map it to the consumer-oriented URL: e.g. content.metoffice.gov.uk CNAME metoffice-content.someCDN.com

Edge Servers do not necessarily provide only the content retrieved from the origin. They may provide value-added services such as format conversion. Some may even provide "edge computation" enabling providers to execute some of their application at the edge - very useful for authentication and authorisation at the edge!

The content update process may be:

  • Pull/Purge: lazy caching of content pulled from the Origin Server in response to a consumer request. The cached content will expire (e.g. be erased) once the "Time To Live" (TTL) is reached. A content provider may also purge (delete) all cached content from Edge Servers, for example if they want to publish updated content.

  • Push: where the content provider is responsible for pushing content to the Edge Servers. This method is not often used as it's considered sub-optimal by most CDN providers, e.g. forcing all content to the edge irrespective of whether it is needed.

CDNs also provide value added services, such as consolidated logging across all Edge Servers, and "traffic scrubbing" to isolate and deny bad client behaviour and isolate services from attack.

The Internet itself consists of interconnected networks of networks operated by independent organisations known as Autonomous Systems (AS). They route traffic between nodes in their networks and other networks to maintain connections between computers. Traffic routing is based on IP Addresses. But we humans prefer (domain) names to numbers, so the Internet also provides DNS to convert Domain Names to the associated IP Addresses so that requests and responses between client applications and servers can be routed (see below for a primer on DNS). Of course, the aim of a CDN is to route client requests to the closest Edge Server, thereby providing faster data access. This is one place where the clever stuff happens.

There are two mechanisms that can be used to route requests from the Consumer Client to the Edge Server that is closest to them:

  • DNS Redirection: the CDN's Authoritative Nameserver provides the IP Address of the server topologically closest to the consumer. This is sometimes called "geo routing" or "smart routing". The geographic location maybe determined from the IP Address of the DNS Resolver or Consumer Client. GeoIP data is available from organisations such as MaxMind (e.g. GeoIP2 Country Database for USD24/month). For more information, see: Limitations of DNS-based geographic routing - Edge Cloud (edge-cloud.net).

  • Anycast Routing (IP Anycast): Autonomous Systems (AS) share information about the topology of their networks using the Border Gateway Protocol (BGP). This information is used to configure the routers to direct Anycast packets to the topologically closest servers. There will be servers with the same IP Address in the network. The Internet dynamically routes from Consumer Client to Edge Server determined to be topographically closest using the information shared over BGP. This means if an Edge Server becomes unavailable the network seamlessly re-routes automatically to the next closest Edge Server with no change to the information sent from the client.

CDNs are available from a variety of vendors, ranging from pure-play (e.g. Akamai) through to cloud vendors (e.g. AWS, Azure, Google) and network providers (e.g. Verizon, BT). There are even options if you wish to operate your own private CDN (e.g. like Netflix) - see below.

Choosing a CDN depends on several factors:

  • Distribution of Points of Presence (are they in the locations you anticipate traffic?)

  • Cost structure

  • Connection approach (DNS redirect and/or Anycast?)

  • Content Update Process (Pull/Purge vs. Push)

To some extent, participation in peer-to-peer networks (e.g. BitTorrent) provides an alternative content delivery function. However, these peer-to-peer distribution mechanisms will prioritise "popular" traffic. It is unlikely that this would meet the performance requirements needed for the global meteorological community.

In WIS there are 15 GISCs and several hundred content providers (DCPCs and NCs), each of which is affiliated with a GISC. Each DCPC and NC will have their own Origin Servers that will need to be registered with a CDN.

Technically, we consider the WIS as a single entity, and we may consider using a single CDN to distribute content from every participating centre. Such a solution may be achieved either by:

  1. Procuring a single service for covering all content distribution in WIS. However, the Cache-in-the-Cloud project failed because we were unable to gain the support necessary to tender for a single global service.

  2. All 15 GISCs collaborating to deliver a "federated CDN" - defined by Cisco as a "multi-footprint, open CDN capabilities built from resources owned and operated by autonomous members". Each GISC would be responsible for low-latency, resilient distribution in their region of data from every content provider participating in WIS. Effectively, a GISC would be providing a Point-of-Presence (PoP) through which data from every participating centre could be accessed. This would require each GISC to manage content distribution from hundreds of Origin Servers, and for each DCPC or NC to register with the CDN service of each GISC.

A simpler alternative would be for each GISC to provide a CDN service with global reach to its affiliated centres - either via a commercial provider, or building their own. Other than performance, there should be no difference in behaviour of the CDN service provided by each GISC - content delivery, at least at the basic file distribution level, is very commoditised.

We can discount option (1) as not achievable under the current governance regime in WMO. The outcomes for options (2) and (3) are identical: both should provide global, low-latency resilient content distribution of data from WIS Centres. However, option (3) benefits from significantly reduced configuration for all parties, with interaction only needed between the GISC and centres in its Area of Responsibility.

GISCs could set up their own Content Delivery Network using Anycast routing or an Authoritative Nameserver providing (topologically close) IP addresses for Edge Servers they maintain. Open source components exist such as:

Also note DNS is also provided as a service; see Best free and public DNS servers in 2021 | TechRadar. Amazon Route 53 (pricing) is an example of a DNS service that provides geo-routing out of the box.

For a GISC building a private CDN, Edge Cache instances could be deployed around the world using various hosting providers.

It is conceivable that GISCs may be asked to host Edge Servers for their peers, given the geographic distribution of the 15 GISCs. However, such an arrangement would require a fit-for-purpose security / trust model to be implemented between GISCs - either to allow the remote GISC to deploy and maintain an Edge Server on the local GISC's estate, or for the local GISC to maintain an Edge Server configured to reach back to the Origin Servers of the centres affiliated with the remote GISC. Either way, this is a complexity that could be avoided by outsourcing CDN to a commercial provider.

While building your own CDN solution is possible, it seems to be uneconomical for GISCs to invest in building this infrastructure themselves given the commercial availability of CDN services (see costings below).

[Q/ How to express the performance requirements for content delivery to ensure that each GISC is meeting the needs of the WMO community?]

So - how would option (3) work? Take a fictitious example where Met Eireann operate a National Centre that publishes content via the following subdomain:

wmo-data.met.ie

This is the "consumer oriented" DNS name that people will use to access the data.

Met Eireann is affiliated with GISC Exeter, operated by the Met Office. Let's assume that the Met Office have procured a commercial CDN service: someCDN.

The Met Office team work with their CDN provider and Met Eireann to configure the content distribution:

  1. Register Met Eireann's Origin Server (https://secure-content-server.met.ie) with the CDN provider and establish a secure, private connection between Met Eireann and someCDN.

  2. Provision a subdomain for Met Eireann with the CDN provider for their Edge Server (meteireann-content.someCDN.com)

  3. Register the provisioned URL of the Edge Server with a CNAME in Met Eireann's public DNS service to map it to their consumer oriented DNS name (wmo-data.met.ie CNAME meteireann-content.someCDN.com)

Met Office manages the relationship with someCDN on behalf of all the centres affiliated with GISC Exeter, monitoring traffic volumes to ensure that costs for distributing data from affiliated centres remain within the anticipated cost envelope.


CDN Costs / Pricing

(enabling a cost assessment of commercial CDN vs. "home build")

(GISC cache size = approx. 100GB/day (3TB/month)?? But what are the access metrics? e.g. total download volume, number of requests?)

Could we get AWS or Azure or Google or Alibaba to provide CDN for safety critical data for free (or very low cost; similar to the sat-comms arrangement for pushing data from Argo floats)? As this is a commodity service there's no risk of vendor lock-in.

 

AWS CloudFront pricing:

CDN Pricing | Free Tier Eligible, Pay-as-you-go | Amazon CloudFront

 

AWS Free Usage Tier

As part of the AWS Free Usage Tier, you can get started with Amazon CloudFront for free. Upon sign-up, new AWS customers receive 50 GB Data Transfer Out and 2,000,000 HTTP and HTTPS Requests, and 2,000,000 CloudFront Function invocations each month for one year.

 

AWS On-demand pricing (approx.):

Regional data transfer to Internet - USD 0.1 / GB

HTTPS requests - USD 0.015 / 10000

 

CDN Pricing & Features Comparison 2021 | CDN77.com


DNS Primer:

 

Resolving URLs via DNS is a complex business involving 4 DNS servers: DNS Recursor (operated by the consumer's ISP or other designated party), Root Nameserver, TLD Nameserver, and finally the Authoritative Nameserver (a public DNS server operated by the CDN provider and Content provider that responds to DNS requests for specific URLs).

 

  • DNS recursor - The recursor can be thought of as a librarian who is asked to go find a particular book somewhere in a library. The DNS recursor is a server designed to receive queries from client machines through applications such as web browsers. Typically the recursor is then responsible for making additional requests in order to satisfy the client’s DNS query.

  • Root nameserver - The root server is the first step in translating (resolving) human readable host names into IP addresses. It can be thought of like an index in a library that points to different racks of books - typically it serves as a reference to other more specific locations.

  • TLD nameserver - The top level domain server (TLD) can be thought of as a specific rack of books in a library. This nameserver is the next step in the search for a specific IP address, and it hosts the last portion of a hostname (In example.com, the TLD server is “com”).

  • Authoritative nameserver - This final nameserver can be thought of as a dictionary on a rack of books, in which a specific name can be translated into its definition. The authoritative nameserver is the last stop in the nameserver query. If the authoritative name server has access to the requested record, it will return the IP address for the requested hostname back to the DNS Recursor (the librarian) that made the initial request.

As complex as this is, DNS resolution is fast:

  • Recursive DNS resolvers do not always need to make multiple requests in order to track down the records needed to respond to a client; caching that helps short-circuit the necessary requests by serving the requested resource record earlier in the DNS lookup.

  • Different DNS recursive resolvers such as Google DNS, OpenDNS, and ISPs all maintain data center installations of DNS recursive resolvers. These resolvers allow for quick and easy queries through optimized clusters of DNS-optimized computer systems.

  • The DNS nameservers of some CDN providers may also be infrastructure-level nameservers that are integral to the functioning of the Internet, for example, the root level DNS nameserver infrastructure components responsible for the billions of Internet requests per day. The Anycast network used by CDN providers such as Cloudflare puts them in a unique position to handle large volumes of DNS traffic without service interruption.