It is critical to provide documentation describing your data so that others may understand what it is, what it means, and what it can be used for. Documentation involves recording important metadata about the dataset structure and contents.
Metadata is data that describes other data. The descriptive elements in the metadata make it possible for you and others (collaborators, peer reviewers, potential data re-users) to search for, evaluate, and understand the data.
Due to the diverse needs of researchers who work with data, there are many different metadata standards to choose from. Your chosen data repository's support services should be able to assist with determining what metadata to record.
The general-purpose metadata standards most frequently used for research data are Dublin Core and DDI (Data Documentation Initiative). Other common standards include:
Should your data require more comprehensive metadata specific to your discipline, the UK Digital Curation Centre (DCC) maintains a list of disciplinary metadata standards to help you find the best one for your work.
The global Research Data Alliance (RDA) also maintains the Metadata Standards Catalog, a directory of metadata standards, schemas, profiles, and related tools.
"Standards" xkcd, CC BY-NC 2.5
Any secondary file included with a dataset that clarifies what the data is, does, or means can be data documentation.
A 'README' file is a plain text file, usually titled README.txt, that contains important metadata about your dataset, such as administrative information, contents, and structure of the data. It should provide enough information for a potential user to determine whether the data is of interest to them or not. Other terms sometimes used interchangeably with 'README' are 'data dictionary' or 'codebook'.
The README file should be located in the root folder of your dataset. As suggested by the title, users should consult the README before attempting to use or interpret any other part of the dataset.
A project's README file should be created as soon as you start collecting your data, or even during the planning phase, to fill out as your project progresses.
A codebook describes the variables in a dataset in specific, comprehensive detail. It defines the coding used to represent the meaning and values of a variable. Simple codebooks can be included directly within the README, while larger or more complex datasets may warrant a separate codebook file.
A data dictionary is a structured repository of metadata that comprehensively describes the elements of the data used in a dataset, database, or project. It defines the names, labels, definitions, and attributes of the data elements in order to provide users with a common language and understanding of the data: its meaning, purpose, and relationships between data elements. The metadata included in a Data Dictionary can assist in interpretation of the elements, as well as help in defining their scope and the shared rules for their usage and application.
A good Readme guide is available from Cornell University.
Below is a list of elements commonly included in a README file. Some may instead be provided in other documentation (such as a separate codebook or data dictionary), or even embedded in the data itself. Many of these elements only make sense with certain types of data. In the interests of keeping README files concise, you should only include elements that are useful and/or necessary to correctly interpret, evaluate and reuse your dataset.
Terms in bold are strongly recommended to include for all datasets.
Best practices for creating reusable data publications (Dryad) - good overview of what to consider when preparing data to be understood by someone else. Includes tips on what to include in the dataset, necessary metadata, accessibility, file formats and naming, and README creation.
Create a README file (UBC) - self-paced learning mini-module
Quick Guide: Creating a README for your dataset (UBC) - the short short version!
What is a Codebook? (SAHMSA)
What is a Codebook? (ICPSR)
Codebook Cookbook: How to Enter and Document Your Data (McGill)
How to Make a Data Dictionary (OSF)
Controlled vocabularies are a kind of metadata standard that features a set of expert-curated preferred terms used for indexing or searching within a particular subject domain. Some forms of controlled vocabularies are term lists, authority files, taxonomies, and thesauri.
Controlled vocabulary terms improve search results in two ways:
Using controlled vocabularies in the creating of data or metadata supports accuracy, consistency, and interoperability. There are well-established vocabularies for a variety of subjects, including personal and corporate names, geographic names, topics, concepts, resource types and genres, and languages.