File-based data distribution

A CDN (see CDN for WIS2) is just one of many mechanisms by which global, resilient, low-latency file distribution can be achieved. A more incremental approach would be for GISCs (or a subset of GISCs) to host a copy of files shared for global distribution; e.g. a cache. One drawback of this situation would be that each copy of a file could have its own URL - not perfect, but could be handled by notification messages including multiple URLs (one for each copy), or using a URL template [RFC6570 URI Template], and allowing the client to choose the option that suits them?

GISCs may assist their affiliated NCs or DCPCs by publishing data files for them, providing resilient and fast access on their behalf[1]; e.g. using Web Accessible Folders (WAF) to allow upload of files. In this case, data files would be served from the GISC, meaning that URLs would use the GISC's domain[2]. The GISC would organise the data in some way, so that contributions from affiliated centres were partitioned; e.g. gisc.metoffice.gov.uk/data/met.ie/{…} . Using their own domain, thereby making the origin of the data more obvious (attribution!), might provide motivation to NCs and DCPCs to self-publish.

Advanced NCs or DCPCs, e.g. ECMWF, may choose themselves to use a CDN to publish their data files because it improves the distribution of their data. The mechanism to distribute files is up to each centre.

Issues to consider:

  1. Need to define SLA for file distribution; e.g. service up-time, response time etc. … This will help NCs decide their preferred implementation choice.

    1. [Remy] Yes but when to work on this? Probably not for step 1.

  2. Dealing with big files - splitting them into multiple parts for easy download? What are the best practices here?

    1. [Remy] This is a topic for the “visionary” paper! Up to now, 1 file = 1 download works.

    2. [Jeremy] See W3C Data on the Web Best Practices - #17 Provide Bulk Download, and #18 Provide Subsets for Large Datasets

 


 [1] This could also be done by publishing via a cloud-vendor; AWS, Azure, Google Cloud Platform etc.

 [2] It is possible for an NC or DCPC to allocate a subdomain to their GISC - but this would add significant complexity.