WIS 2.0 Demonstration Project: The GCW Data Portal
For a short, creative teaser of the GCW Data Portal, please regard the following presentation.
Introduction
GCW fosters international coordination and partnerships with the goal of providing authoritative, clear and usable data, information and analyses on the past, current and future state of the cryosphere. It fosters sustained and mutually beneficial partnerships between research and operational institutions, by linking research and operations as well as scientists and practitioners. This is important as most available cryospheric data come from the scientific community. It is generally managed by research institutes which often do not have the infrastructure, the resources, nor the mandate to enable FAIR data management, which is necessary for interoperability and discovery at data level. This implies that data do not fit into standardized systems or dataflows for broader data access and exchange (as exists at the WMO) and thus have been unavailable for operational meteorological and climate applications. This lack of standardization also impairs the reuse of data within the scientific community. GCW is bridging this gap through the GCW Data Portal and software stack enabling the transformation of sparsely documented and highly variable data into standardized and well documented data suitable for downstream applications with data level interoperability.
The GCW Data Portal is the entry point to datasets describing the cryosphere and forms the information basis for the assessment activities of the GCW. It offers access to the GCW Data Portal catalogue through a web interface that contains information about datasets through discovery metadata provided by the data providers (or host data centre). These discovery metadata are harvested on a regular basis from data centres managing the data on behalf of the owners/providers of the data.
The GCW Portal is also the interface for GCW metadata to WMO Information System (WIS) and WMO Integrated Global Observing System (WIGOS). GCW data management follows a metadata driven approach in which datasets are described through discovery metadata exchanged between contributing data centers and the GCW catalogue. The GCW Portal will facilitate real time access to data through Internet and WMO GTS as requested by the user community. This requires a certain level of interoperability at the data level in addition to at the metadata level. On GTS, WMO formats (BUFR and GRIB) are required and the GCW Portal can transform data into these formats in the dissemination process, provided contributing data centres are following the required standards for documentation and interfaces to data.
GCW Data Portal Specifications
In order to satisfy the requirements for the users, the NetCDF file format has been chosen with the Climate and Forecast Convention (CF) for the metadata. The NetCDF file format provides a standard file format that can be read by many different applications while quite compact and efficient for handling large amounts of data. The CF-1.6 convention provides standard names for the different meteorological parameters as well as the units and other metadata fields, allowing an application to read and interpret the data without any manual action.
A processing engine converts the raw data provided by the data producers into NetCDF-CF standard files with NetCDF Attribute Convention for Dataset Discovery (ACDD) metadata. The ACDD standard provides standard search metadata, describing the data origin and the spatial and temporal coverage. The data portal web front end harvests the metadata necessary for its search engine through an OPeNDAP server so no manual editing of the medatadata is necessary. Further, no data are stored on the data portal web frontend but only requested on demand to a backend. When a user downloads some data from the web portal, it gets the requested data through the OPeNDAP server. The OPeNDAP client/server architecture allows subset queries of datasets on a temporal, spatial or by variable basis. The search for scientific parameters is currently based on the GCMD Science Keywords.
The GCW/SLF Open Source Software Package
GCW is depending on a number of observing stations (CryoNet stations) for feeding the GCW value chain with observations. GCW has a requirement for both real time and archived data. In the period 2015-2017, GCW has been working with the WSL Institute for Snow and Avalanche Research (SLF) to set up interoperability between the WSL/SLF data centre being responsible for one of the CryoNet stations. WSL/SLF has kindly agreed to make the software stack they have developed available for a wider community. All projects are now available under open source licenses. The provided software tool allow to processes and manage data at various stages of the “datacycle” from sensors to published dataset.
MeteoIO
The core element in the software package is the data preprocessor MeteoIO that takes data from the sensor, through a quality control procedure into standardised NetCDF/CF files which can be published. MeteoIO was originally developed to provide robust meteorological forcing data to an operational model that forms part of the avalanche forecast at the SLF. However, it also happens to be very good at reading diverse data sources and producing a standardised output. It has a modular architecture which makes it flexible and fast to develop new use cases. It can handle both gridded and time series data and has various functions for cleaning/ processing data to various quality standards and produces QA reports. MeteoIO is a C++ library.
MeteoIO goes through several steps for preparing the data, aiming to offer within a single package all the tools that are required to bring raw data to an end data consumer: first, the data are read by one of the more than twenty available plugins supporting that many different formats or protocols (such as CSV files, NetCDF files, databases or web services). Then some basic data editing can be performed (such as merging stations that are next to each other or renaming sensors). The data can then be filtered, by applying a stack of user selected generic filters. These filters can either remove invalid data (such as despiking or low and high pass filters) or correct the data (such as precipitation undercatch correction, debiasing, Kalman filtering). Once this is done, the data are resampled to the requested time steps by various temporal interpolations methods. It is important to keep in mind that during this whole process, MeteoIO works with any sampling rates, including variable sampling rate and can resample to any point in time. If there are still missing data points at the requested time steps, it is possible to rely on data generators to produce some data out of either parametrizations (such as converting a specific humidity into a relative humidity) or very basic strategies (such as generating null precipitation to fill gaps). Finally, either the data are forwarded to the data consuming application or written back by a user-selected plugin.
For the MeteoIO git, please click here.
EnviDat
In order to publish discovery metadata for the data prepared through MeteoIO, software developed through the EnviDat project is used. EnviDat is the WSL/SLF main CKAN based dataportal and metadata repository. Core CKAN has been extended to cover specific requirements of research data management. These include an OAI-PMH server, DOI publishing and supporting metadata standards. The advantage of CKAN is that it provides a robust and intuitive UI for structured metadata submission. This enables large parts of the data
management process to be decentralised to the submitter.
For EnviDat extensions please click here.
For further information on the CKAN project, please click here.
Project Charter
Introduction
The World Meteorological Organization's Global Cryosphere Watch (GCW) is a mechanism for supporting all key cryospheric in-situ and remote sensing observations, and it facilitates the provision of authoritative data, information, and analyses on the state of the cryosphere.
To achieve this, a real-time and long-time series of data and products will have to be made available to all consumers. Data and products are made by NMHSs and other operational and scientific communities. The latter two often have limited resources, relying on a variety of data management approaches, quite different from those of the WMO community. GCW is establishing a link between these communities through WIS and WIGOS. In order to successfully implement GCW, barriers between communities need to be lowered.
The GCW Data Management is a metadata-driven service-oriented approach. GCW data management is based on the FAIR guiding principles and aligns well with the WIS principles.
It follows a metadata-driven approach where datasets are documented by standardized discovery metadata that are exchanged through standardized Web services. The GCW Data Portal can interface with scientific and other data providers with WMO-specific interfaces like real-time exchange through WMO GTS. For all other purposes, the Internet is used as a communication network. A critical component of the discovery metadata exchanged is the application of a standardized semantic annotation of data and interfaces, for example using ontologies as well as linkages between datasets and additional information useful to fully understand a dataset (e.g. WIGOS information).
At the data level, standardised use metadata are required along with containers for the data and services carrying the data. Currently GCW is promoting NetCDF following the Climate and Forecast (CF) convention as the preferred format for data and would welcome a number of WMO CF profiles accompanied by tools to simplify exchange. GCW is already serving free and open data extracted from WMO GTS, converted from WMO BUFR to NetCDF-CF. It is an ambition to fully support the opposite workflow for the data made available through GCW and requested by the WMO community to be available in WMO GTS.
GCW aims to provide access to both real time and archived data (in the form of climate consistent time series). This requires cost efficient mechanisms that can be used for both purposes. GCW is currently relying on OGC WMS and OPeNDAP for exchange of information. The combination of NetCDF-CF and OPeNDAP allows data streaming and on the fly services to be built on top of data in a distributed data management system. Currently GCW support on the fly visualisation and transformation of selected gridded products as well as time series. These services need to be extended to new areas. Transformation services include reformatting (e.g. NetCDF/CF to CSV or NetCDF/CF to WMO GRIB), reprojection, subsetting etc.
In order to support other providers of relevant data from sources which have limited resources for data management, GCW has developed a software stack relying on MeteoIO for transformation of data from unstructured to structured NetCDF/CF (FAIR compliant) and publishing these data using a lightweight OPeNDAP server based on pyDAP. This setup is still under development and the goal is, as resources allow, to establish web services based on MeteoIO. These data can be access by the GCW Data Portal. In essence the GCW Data Management is a metadata driven Service Oriented Approach.
The GCW outline of data centres currently involved is provided in the illustration below.
Project objectives
To facilitate the access to available datasets from different institutions and projects, by bridging between scientific communities and WMO systems in support of the WMO activities (e.g. WMO operating plan).
Improving interoperability of WMO GCW relevant datasets.
Increasing the amount of data available to support cryosphere related goals of WMO, as delivered by GCW.
Wherever possible efficiently link between WIGOS and WIS metadata.
WIS 2.0 Principles Demonstrated
GCW data management is aligned with the principles of WIS 2.0, as outlined below (using WIS 2.0 principles numbering).
Principle 1
GCW data management is based on harvesting discovery metadata through standardised web services for such exchange (primarily OAI-PMH). The information exchanged is standardised according to ISO19115 or GCMD DIF (currently) and data are encouraged to be served as NetCDF according to the Climate and Forecast convention. This links directly to a Service Oriented Approach relying on Semantic Web and Linked Data.
Principle 2
Discovery metadata are harvested from contributing data centres using URLs. The discovery metadata harvested contains URLs for data access, licence information as well as interpretation of semantic annotation (on scientific parameters or purpose of a URL).
Principle 3
The backbone for all communication within GCW data management is the Internet. For specific purposes GCW will connect with private networks (e.g. WMO GTS).
Principle 4
GCW data management relies on web services for exchanging information on datasets as well as the data themselves and higher order services on top of data. GCW does not currently have service catalogue as a web service, for now this is an internal catalogue.
Principle 5
GCW offers transformation services on top of data that are served according to the CF convention through OPeNDAP. These transformation services allows users (or applications to subset data in time, space or parameter space).
Principle 6
GCW does not currently have messaging protocol and would benefit from WIS efforts in this context.
Principle 7
GCW is currently not caching data, this will be implemented as part of an integration with GTS. These data will be treated as transient datasets in the GCW Data Portal.
Principle 8
GCW is currently considering the data provider and the host data centre of the data as the authoritative source for data. The direct access to a dataset is done by forwarding the data consumer to the web services offered by the host data centre. The only exception to this in the current implementation is when higher order services offered in the GCW Data Portal are used to modify or combine data prior to data delivery.
Principle 9
GCW data management is currently not using WMO GTS for transmission of data, and it relies on WMO efforts in this context. The critical part for GCW is how to connect in an efficient manner to relevant WMO services.
Principle 10
GCW maintains its own catalogue with discovery metadata, but holds currently no catalogue for web services. Integrating the existing GCW services with WIS 2.0 catalogue will be preferable. Currently the main effort of GCW is to ensure good enough quality on the discovery and use metadata supplied by contributors and transform this into WIS compliant information.
Principle 11
GCW is reimplementing web services offering discovery metadata and will in this context support OAI-PMH, OGC CSW and OpenSearch. Details are still under discussion as well as how to ensure integrity in the value chain between the originating data centre and the higher order catalogues like WIS (to avoid duplicated of records). GCW is also working with the ESIP community on extensions that will make Schema.org useful for dataset discovery. The current definition of Schema.org is insufficient for proper dataset discovery and filtering of information, but promising extensions are being discussed and the community working on this has good momentum.
Plan and milestones
Deliverables
No. | Deliverable name | Lead | Del. date | Status |
D1 | Updated information model enabling linkages on dataset to WIGOS metadata | MET | 2020Q2 | Complete |
D2 | Dynamic visualisation of time series from NetCDF-CF and OPeNDAP | MET | 2020Q2 | Complete |
D3 | Updated harvesting of discovery metadata supporting OAI-PMH, OGC CSW and OpenSearch | MET | 2021Q2 | In progress |
D4 | NetCDF-CF guidelines for timeseries and profiles (e.g. permafrost) | MET | 2021Q3 | In progress |
D5 | Mapping harvested discovery metadata to WMO Core Profile | MET | 2021Q3 | In progress |
D6 | Extension of metadata harvesting to support Schema.org provided current ESIP activities are approved by Schema.org | MET | 2022Q3 | Pending funding |
D7 | Conversion of NetCDF-CF to WMO BUFR for permafrost profiles | MET | 2023Q2 | Pending funding |
D8 | Web service converting non standardised data to NetCDF-CF using MeteoIO | WSL/SLF | 2023Q4 | Pending funding |
Milestones
No. | Milestone name | Lead | Due | Status |
M1 | New information model implemented | MET | 2021Q1 | In progress |
M2 | Selected permafrost datasets available online and in real time | MET | 2021Q4 | In progress |
M3 | Harvested discovery metadata exposed through WIS | MET | 2022Q1 | Not started |
M4 | Transformation of NetCDF-CF to WMO BUFR for selected datasets | MET | 2023 | Not started |
Further information and links
To access the new release of the GCW Data Portal, please click here.
For a general description of MeteoIO, please click here.
For specific information and software download of MeteoIO, please click here.
For technical specifications, please regard the following document:
For further information please consider the following publications:
Project team
Øystein Godøy (Norwegian Meteorological Institute, Oslo, NO) – project lead
Joel Fiddes (Norwegian Meteorological Institute, Oslo, NO, World Meteorological Organization, Geneva, CH)
Mathias Bavay (Institute for Snow and Avalanche Research SLF, Davos, CH)
Rodica Nitu ( World Meteorological Organization, Geneva, CH)
0 Comments