Skip to Main Content

Research Data Management (RDM)

Resources to help researchers manage their research data, with an emphasis on Canadian tools.

Data Documentation

It is critical to provide documentation describing your data so that others may understand what it is, what it means, and what it can be used for. Documentation involves recording important metadata about the dataset structure and contents.

Metadata 

Metadata is data that describes other data. The descriptive elements in the metadata make it possible for you and others (collaborators, peer reviewers, potential data re-users) to search for, evaluate, and understand the data.

Due to the diverse needs of researchers who work with data, there are many different metadata standards to choose from. Your chosen data repository's support services should be able to assist with determining what metadata to record.

The general-purpose metadata standards most frequently used for research data are Dublin Core and DDI (Data Documentation Initiative). Other common standards include:

  • Darwin Core: biological diversity data
  • NAP: spatial data in North America coordinates
  • EML: ecological data
  • ISO 19115: comprehensive standard that can be used to describe any data; designed for data that includes geospatial information
  • PREMIS: preservation metadata
  • Schema.org: structured metadata to describe resources on the Internet (e.g. webpages, published datasets, etc)

Should your data require more comprehensive metadata specific to your discipline, the UK Digital Curation Centre (DCC) maintains a list of disciplinary metadata standards to help you find the best one for your work.

The global Research Data Alliance (RDA) also maintains the Metadata Standards Catalog, a directory of metadata standards, schemas, profiles, and related tools.

"Standards" xkcd, CC BY-NC 2.5

Types of data documentation

  • README files
  • Data Dictionaries
  • Codebooks

Any secondary file included with a dataset that clarifies what the data is, does, or means can be data documentation.

What is a README?

A 'README' file is a plain text file, usually titled README.txt, that contains important metadata about your dataset, such as administrative information, contents, and structure of the data. It should provide enough information for a potential user to determine whether the data is of interest to them or not. Other terms sometimes used interchangeably with 'README' are 'data dictionary' or 'codebook'.

The README file should be located in the root folder of your dataset. As suggested by the title, users should consult the README before attempting to use or interpret any other part of the dataset. 

A project's README file should be created as soon as you start collecting your data, or even during the planning phase, to fill out as your project progresses.

A codebook describes the variables in a dataset in specific, comprehensive detail. It defines the coding used to represent the meaning and values of a variable. Simple codebooks can be included directly within the README, while larger or more complex datasets may warrant a separate codebook file.

A data dictionary is a structured repository of metadata that comprehensively describes the elements of the data used in a dataset, database, or project. It defines the names, labels, definitions, and attributes of the data elements in order to provide users with a common language and understanding of the data: its meaning, purpose, and relationships between data elements. The metadata included in a Data Dictionary can assist in interpretation of the elements, as well as help in defining their scope and the shared rules for their usage and application.

How do you write a README?

A good Readme guide is available from Cornell University.

 

NEW! Downloadable README templates:

What goes in a README?

Below is a list of elements commonly included in a README file. Some may instead be provided in other documentation (such as a separate codebook or data dictionary), or even embedded in the data itself. Many of these elements only make sense with certain types of data. In the interests of keeping README files concise, you should only include elements that are useful and/or necessary to correctly interpret, evaluate and reuse your dataset.

Terms in bold are strongly recommended to include for all datasets.

  • General Dataset Information
    • Dataset title
    • Description
    • Contact Information
      • Names, roles, institutions and email and/or phone (include OrcID and ROR if available)
    • Contributors
      • Names, roles, institutions and email and/or phone (include OrcID and ROR if available)
    • Dataset publication date
    • Data publisher (e.g. repository)
    • Persistent Identifier (usually a DOI assigned by a data repository upon deposit)
    • Title of project the data was generated for
    • Funding information (funder, grant I.D., grant name)
    • Suggested citation
  • Access & Sharing Information
    • License
    • Any restrictions on use of the dataset or parts thereof
    • Relationship with other datasets, if any
    • Links to publications based on the dataset
  • Data Collection
    • Collection date(s)
    • Geographic location of collection
    • Methods used for data collection (including protocols, references, documentation, links)
    • Experimental & environmental conditions
    • Standards and calibration information
    • Uncertainty, precision and accuracy of measurements
    • Known problems & caveats (sampling, blanks, etc.)
  • Data Overview
    • Folder structure
    • File naming convention (template and examples)
    • Description of file versioning system (if applicable)
    • Relationships and dependencies between files
    • Other documentation files of interest within dataset (data dictionary, codebook, notes...)
    • File list:
      • For each major file or filetype, provide:
        • filename, with extension
        • short description of its contents and structure
        • date of last revision
    • Codebook
      • For each file/filetype, provide:
        • List of variables with:
          • variable name
          • variable label
          • short description (including units)
          • List of codes and their values, with definitions
      • Definition of column headings and row labels for tabular data
      • Treatment of missing data (code, etc.)
      • Example of records for each file type
    • Processing & QA
      • Methods used for data processing
      • Software used in data collection and processing, including version numbers
      • File formats used in the dataset & recommended software
      • Quality assurance procedures applied
      • Dataset changelog

Further resources:

Best practices for creating reusable data publications (Dryad) - good overview of what to consider when preparing data to be understood by someone else. Includes tips on what to include in the dataset, necessary metadata, accessibility, file formats and naming, and README creation.

READMEs

Create a README file (UBC) - self-paced learning mini-module

Quick Guide: Creating a README for your dataset (UBC) - the short short version!

Codebooks and Data Dictionaries

What is a Codebook? (SAHMSA)

What is a Codebook? (ICPSR)

Codebook Cookbook: How to Enter and Document Your Data (McGill)

How to Make a Data Dictionary (OSF)

Markdown Syntax

GitHub Basic Writing and Formatting Syntax

What is a controlled vocabulary?

Controlled vocabularies are a kind of metadata standard that features a set of expert-curated preferred terms used for indexing or searching within a particular subject domain. Some forms of controlled vocabularies are term lists, authority files, taxonomies, and thesauri.

Controlled vocabulary terms improve search results in two ways:

  1. by connecting synonyms (different words with the same or similar meanings) with the preferred term for a concept, and
  2. by distinguishing homophones (words that are spelled the same but have different meanings) reducing the ambiguity of natural language.

Using controlled vocabularies in the creating of data or metadata supports accuracy, consistency, and interoperability. There are well-established vocabularies for a variety of subjects, including personal and corporate names, geographic names, topics, concepts, resource types and genres, and languages.

Examples:

Under development!

Under development!