Library Guides: Data Resources in the Health Sciences: Describe Data

Why describe data?

Datasets are usually accompanied by some documentation that functions like an owner's manual. This documentation will allow a user to understand the raw data, its context, how to read it, and appropriate use. Structured data in the documentation will also allow the dataset to be found more easily.

Good data documentation will include information about the context, methods, protocols, scale, software, instruments, etc. used in the data collection. It will include a detailed outline of the data structure and file types used. Data validation and cleaning methods should be included as well as data sources used. Modifications to the data over time should be noted as well as information on data confidentiality, access and use conditions.

Documentation is most useful when it is standardized to a general norm or to a discipline-specific convention. Two levels of terminology are used in data documentation: metadata and ontologies.

Metadata

Metadata is information about data. It is often information describing a unit of publication such as a book, a website, or a dataset.

NISO (National Information Standards Organization) describes 3 types of metadata:

Descriptive metadata describes a resource for purposes such as discovery and identification. It includes elements such as title, abstract, author, and keywords. Ex. a general standard is the Dublin Core Metadata Initiative (DCMI), a simple standard widely used to describe web resources.
Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters.
Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets of administrative data; two that sometimes are listed as separate metadata types are:
- − Rights management metadata, which deals with intellectual property rights, and
- − Preservation metadata, which contains information needed to archive and preserve a resource.
Markup languages that enclose the resource, can allow for embedded metadata interwoven with more deeply with the content. Ex. a discipline-specific standard is the Data Documentation Initiative (DDI), an XML-based standard for the documentation of datasets in the social and behavioral sciences.

Another type of metadata to consider is Provenance Metadata. This kind of metadata provides information on the individuals, groups, and activities involved in producing a dataset.

NIH Common Data Elements (CDE) - http://www.nlm.nih.gov/cde/

NIH portal of data elements that are common to multiple data sets across different studies in support of improving data quality and promoting data sharing. View collections of CDEs by specific project, or subject area.

Ontologies

Ontologies are standardized terminology for the concepts and their inter-relationships within a given domain. Ontologies are used at the content level of a dataset.

Examples include

Gene Ontology
MESH (medical subject headings)
ICD (International Classification of Diseases)
Logical Observation Identifiers Names, and Codes (LOINC)

An ontology, on its own, allows researchers to be specific about describing the phenomena they work with, such as in annotations of datasets and indexing of journal articles, in order to facilitate identification and retrieval of relevant information.

Ontologies, in concert with each other, allow for the mapping of concepts, interoperability, between domains thus allowing for use in:

exchange of information
data/information aggregation for analysis or computation
knowledge discovery

The following are links to directories of ontologies serving the health sciences.