LibGuides: Research Data Management (RDM): Organize Your Data

Organize Your Data

File Management

Large research projects can generate massive amounts of data, both in terms of size and number of files. Short, descriptive file names and a simple file hierarchy make these files easier to navigate and locate.

Once you create, collect, or start manipulating data and files, they can quickly become disorganized. To save time and prevent errors later on, you and your colleagues should decide how you will name and structure files and folders. Including a data dictionary or README file containing descriptive information about the data ('metadata') along with the data itself preserves context to ensure that you and others can understand the data in the short- and long-term. This documentation helps research teams and collaborators work more effectively and efficiently throughout the entire research life cycle, as well as greatly improving the data's future reusability.

"Documents" xkcd, CC-BY-NC 2.5

Use a naming convention

Consistent and thoughtful file naming will help you and your colleagues avoid frustration and work more efficiently. Establishing a naming convention will help to provide consistency, which will make it easier to find and correctly identify your files, prevent version control problems when working on files collaboratively. It is wise to develop a logical structure in cooperation with your collaborators at the start of a project.

File Naming Principles

e.g. [element 1]_[element 2 WordPart-WordPart-WordPart]_[element 3].txt

Keep it short

Recommended: around 30 characters. Definitely no more than 255 characters (the maximum filepath length in Windows).

If you use abbreviations, they must be explained in the data documentation (README).

Why? Shorter filenames are easier to read, don't cause problems with file systems, and reduce side-scrolling and column adjustment.

DO: CHHM*, FG1*, interviews, MC* (initials of the data collector)
- * with acronym spelled out in README
DON’T: Centre for Hip Health and Mobility, Focus Group 1

Include 3-5 elements that identify important aspects of the file

e.g. dates, file types, locations, people, version, procedures performed

Why? Helps users find the right file more easily.

DO: FileName_Guidelines_20140409_v01.docx
DON’T: FileName.docx, Guide.pdf

Avoid special characters and spaces except - and _

Use _underscores or -hyphens as delimiters in filenames, or use CamelCase (words capitalized, no spaces)

Don't use any other special characters, e.g.: & , * % # * ( ) ! @$ ^ ~ ‘ { } [ ] ? < >

Why? Different computer programs handle special characters differently – filing order, etc.

DO: FileNameGuidelines_20140409_v01.docx, File-Name-Guidelines_20140409_Cuthill-M_v01.1.docx
DON’T: File Names&Guidelines 2014 04 09 v1*.docx

Use YYYYMMDD or YYYY-MM-DD format for dates

Why? YYYYMMDD is an international standard (ISO 8601), ensuring interoperability. Computers sort YYYYMMDD in chronological order.

DO: 20240430 or 2024-04-30
DON’T: 30-04-2024, 04302024

Keep track of document versions

Either sequentially (e.g. v01, v02,...) or with a unique date and time ( e.g. 20140403_182206).

Why? Next year, will you remember what changed from one file to the next, and in what order?

DO: FileName_Guidelines_20140409_v01.docx
DON’T: FileName_Guidelines_20140409_Review.docx AND FileName_Guidelines_20140409_Investigation.docx

Make folder hierarchies as simple as possible

Recommended: at most 3 - 4 levels deep

Why? Complex folder hierarchies are harder to navigate and offer more opportunities for filing errors. System back-ups may take longer.

DO: F:/ Env/LIBR/DataMgmt_FileFormats_20140409_v01.docx
DON’T: F:/Environment/Library/Woodward/Data//Mat/Draft6/2014/-DataMgmt_FileFormats_20140409_v01.docx

(Adapted from: UBC data management planning documentation)

Version Control is the way to track revisions of a data set, or a process. If your research involves more than one person, it is essential. You will want to record every change to a file, no matter how small. Keep track of the changes to a file in your file naming convention and log files, or version control software. File sharing software can also be used to track versions.

You can do it manually by including a version control indicator in the file name, such as v01, v02, v1.4. The standard convention is to use whole numbers for major revisions, and decimals for minor ones.

There are several software programs that are designed for managing versions tracking. Mercurial, TortoiseSVN, Apache Subversion, Git, and SmartSVN.

File sharing software can also be used to track versions. Google Docs records version changes as well.

As you think through how to manage this step, keep the following issues in mind:

record every change to a file, no matter how small
keep track of changes to files
use file naming conventions
consider how headers are used inside the file
understand how log files are used
use, or investigate the use of, version control software (SVN, Git, Subversion)
use, or investigate the use of, file sharing software (Google Docs)

Source: The University of Virginia Library

File Formats

A computer file format is a particular way of encoding information within a computer file so that it can be recognized by an application. File formats are indicated by the file name extension, usually a full stop followed by three letters. Examples: .csv, .pdf, .txt

Open File Formats (.TIFF, .PDF, .XML, .MP3)

An open file format is one where the format specification is available to anyone, free of charge, so that the specification can be used in a variety of software without any intellectual property right limitations. Because the file specifications are publicly available, the open-source software community can ensure that data stored in these file formats remain accessible over the long term.

Open formats are recommended for file preservation purposes because they do not require specific software to access. Choose open file formats in order to:

increase your ability to open and read your files in the future
make your data accessible to more researchers immediately

Proprietary File Formats (.DOCX, .RAW, DWG, .PSD).

Proprietary File Formats work only with software provided by the vendor. File specifications are not freely available, so when the software is no longer supported, files in that format are typically unreadable.

Recommended File Formats

E-Books: EPUB
Images: JPG, PNG, PDF, TIFF, BMP
Sound: MP3, FLAC
Text: TXT, CSV, PDF/A, ASCII, UTF-8
Video: MPG, MOV, AVI
Spreadsheets: CSV
Medical Images: DICOM
Markup: XML, HTML
Data interchange: JSON

Note: Some research disciplines and industries treat a specific proprietary file format as a de facto standard which you may wish to follow.

Source: UBC Library.

Managing file formats for data curation

Data Curation Primers
Source: Data Curation Network
These guides explain how curators can best preserve data files in a variety of formats, including: Excel, SPSS, R, Atlas.ti, and many more.

KPU Library

Research Data Management (RDM)

Acknowledgment