#Sebbari# Representation and Metadata   

Introduction.

For data and information[1] to be useful, it needs to be organised and written (or “serialized”) in a way that humans and machines (e.g. software systems) can understand it.

Data (information) representation may be written as: data representation = format + vocabulary + data model

Where

Format has to:

  • preserve as far as possible the information content in data values.;

  • Enable vocabularies and data models to be applied to data such that associate values with meaning and reconstruct the information the producer intends to convey

  • Be open to users to develop their own implementation

Vocabulary has to:

  • Be pertinent to the specific domain;

  • Be accessible to, and understandable by, individuals who are not subject matter experts;

A data model has to:

  • Provide a framework to express complex concepts and connect them to common elements like position, space, time, duration….;

  • Assist with implementing interoperability between computer systems and services

 

  1. Vocabularies

  • Ensure that a consistent, controlled domain-relevant vocabulary is employed to define and characterise key concepts and their interrelationships utilised in representing and conveying information. This involves utilising predefined, authorised terms developed through open, consensus-based processes, for representing data and information, and also metadata. 

  • Select vocabularies and formats that enable users to readily "read" and utilise the information. If there is not a specific user community, select vocabularies and formats that can be used by the widest possible audience.

Comment: We need to provide some examples of what is meant here, including an example of a widely-understandable vocabulary. Does WIS2.0 address this?

 

2. Formats

Use open, community-based standards

  • Provide information and metadata in open standard formats that users can easily read, interpret and process. In particular, the data/information provider should publish their data in a way that is consistent with regulations, policies, standards, and/or conventions in use by the primary audience (examples – for the Hydrology community, WaterML2.0 is the accepted community standard; CF-NetCDF for climate predictions/projections

[TB1] ).

  • Provide information and metadata in machine-readable formats.

  • It is recommended that data and information be provided in multiple representations to increase useability for a wider audience, and limit introduction of errors and extra processing costs during transformation.

  • Wherever possible, use standard representations for, e.g., spatial projections, dates, symbols, or if non-standard forms are used, clearly define the forms used. For Web applications it is recommended to utilise community standard guides such as the

W3C Data on the Web Best Practices. [WW2] 

 

3. Metadata

  • Ensure that the metadata describing data/information is sufficiently rich to meet the needs of the target audience. This applies to discovery, contextual and provenance metadata.

  • Ensure that users are always able to locate the metadata associated with a dataset, either by using formats that allow embedding of metadata within the data file, specifying the location of the metadata file (via, e.g., a hyperlink), or using the dataset identifier such as a DOI to search a metadata catalogue.

  • Ensure that a versioning system (version system, date of change) that enables distinction between different versions of a dataset is in place, and publish the version and reasons for publishing a new version in the metadata




[1] Hereafter, we use the term “data” as shorthand, inclusive of data, information, and products.


 [TB1]How will these translate to the cloud?  Won't we have a different paradigm of information retrieval in the cloud?

 [WW2]For Enrico: Should we be this specific? Also, does this perhaps fit better in the section under Publishing?