#Sebbari# Representation and Metadata
Introduction.
For data and information[1] to be useful, it needs to be organised and written (or “serialized”) in a way that humans and machines (e.g. software systems) can understand it.
Data (information) representation may be written as: data representation = format + vocabulary + data model
Where
Format has to:
preserve as far as possible the information content in data values.;
Enable vocabularies and data models to be applied to data such that associate values with meaning and reconstruct the information the producer intends to convey
Be open to users to develop their own implementation
Vocabulary has to:
Be pertinent to the specific domain;
Be accessible to, and understandable by, individuals who are not subject matter experts;
A data model has to:
Provide a framework to express complex concepts and connect them to common elements like position, space, time, duration….;
Assist with implementing interoperability between computer systems and services
Vocabularies
Ensure that a consistent, controlled domain-relevant vocabulary is employed to define and characterise key concepts and their interrelationships utilised in representing and conveying information. This involves utilising predefined, authorised terms developed through open, consensus-based processes, for representing data and information, and also metadata.
Select vocabularies and formats that enable users to readily "read" and utilise the information. If there is not a specific user community, select vocabularies and formats that can be used by the widest possible audience.
Comment: We need to provide some examples of what is meant here, including an example of a widely-understandable vocabulary. Does WIS2.0 address this?
2. Formats
Use open, community-based standards
Provide information and metadata in open standard formats that users can easily read, interpret and process. In particular, the data/information provider should publish their data in a way that is consistent with regulations, policies, standards, and/or conventions in use by the primary audience (examples – for the Hydrology community, WaterML2.0 is the accepted community standard; CF-NetCDF for climate predictions/projections
[TB1] ).
Provide information and metadata in machine-readable formats.
It is recommended that data and information be provided in multiple representations to increase useability for a wider audience, and limit introduction of errors and extra processing costs during transformation.
Wherever possible, use standard representations for, e.g., spatial projections, dates, symbols, or if non-standard forms are used, clearly define the forms used. For Web applications it is recommended to utilise community standard guides such as the
W3C Data on the Web Best Practices. [WW2]
3. Metadata
Ensure that the metadata describing data/information is sufficiently rich to meet the needs of the target audience. This applies to discovery, contextual and provenance metadata.
Ensure that users are always able to locate the metadata associated with a dataset, either by using formats that allow embedding of metadata within the data file, specifying the location of the metadata file (via, e.g., a hyperlink), or using the dataset identifier such as a DOI to search a metadata catalogue.
Ensure that a versioning system (version system, date of change) that enables distinction between different versions of a dataset is in place, and publish the version and reasons for publishing a new version in the metadata
[1] Hereafter, we use the term “data” as shorthand, inclusive of data, information, and products.
[TB1]How will these translate to the cloud? Won't we have a different paradigm of information retrieval in the cloud?
[WW2]For Enrico: Should we be this specific? Also, does this perhaps fit better in the section under Publishing?