Codebook

RDM

A codebook is a document (typically a table) describing the variables found in a data set. Its purpose is to record detailed information on each variable. The following information can commonly be found in a codebook:

  • ID variable(s): which variable(s) contain the unique observation identifier (number or alphanumeric combination)?
  • Data collection variables: which variables contain data collection information (date of collection, place, researcher, etc.)?
  • Variable name and description: what is the variable name in the data set? What is its full description? Variable names are typically short to facilitate the analysis and need to follow software-specific rules (for instance not include any special characters or spaces). Full descriptions are useful to identify the variable more in detail and may include definitions or explanations of acronyms. If the variable is a survey question, the exact wording of the question and instructions may also be indicated here.
  • Variable type: is a variable categorical, ordinal, continuous or text? This is important to check that the variable is identified as such in the software used for storage or analysis.
  • Variable values: what are the possible values of the variable (categories or numeric range)? If the variable is categorical, what are the labels corresponding to each category? For instance, gender may be encoded as 1/2, with 1 corresponding to "Women" and 2 to "Men".
  • Variable unit: what is the variable unit (percentage, kilograms, number of people, etc.)?
  • Missing values: how are missing values indicated? This is important to check that the values are identified as such in the software used for storage or analysis. Different types of missing values can be indicated in different ways, for instance to distinguish observations for which a specific variable should be empty (for consistency reasons or due to a filter) from variables where a value was expected but none was encoded (data input mistake, non-response, etc.).
  • Variable processing: is the variable the result of a data processing step? Is it a score, index or the results of a computation? Was it recoded based on other variables? Was it standardised or otherwise transformed?
  • Variable base: which population is the variable based on? Is the data filtered or limited to a sub-group of observations? What is the base size?
  • Variable links: is the variable standalone or should it be analysed together with other variables? For instance, a multiple-choice question in a survey needs to be encoded in several related variables and a follow-up question needs to be analysed taking into account the previous answer.
  • Weights: are there any weight variables? How were they created? When should they be used?
  • Typologies or classifications: is the variable based on an existing classification? What is it and what are the sources or references?
  • Technical information: what is the variable width and specific variable type in the software used for storage/analysis? What are the decimal and thousands separators? What is the number of decimals?

Useful resources:

https://ukdataservice.ac.uk/media/622417/managingsharing.pdf

https://libguides.library.kent.edu/SPSS/Codebooks

http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook/CodebookCookbook.pdf