Skip to main content

Clean your data

biul |

Some guidelines for cleaning your data: 

  • Open the file and look at the data description to be sure they fit to each other. Do the data comply with the GDPR?
  • How are the missing value encoded ? Be careful, there can be several types of missing values ("I don’t want to answer" is not the same as "I don’t know the answer").
  • Individuals/observations are in line and not in column.
  • The column names should be written on one line and not two (it will help for a potential importation).
  • Withdraw the useless lines and columns (and avoid leaving empty columns in the middle of the data).
  • Look at the data types : there should be only one per column.
  • If you import the data, check if there are the same before and after the importation. For example, you should have the same number of lines and of columns.
  • For numerical variables, compute summary statistics (minimum, mean, main quantiles, maximum, boxplots, histograms, ...) and check if the taken values are possible (e.g.: a negative value as a number of heart beats).
  • Check the levels of the categorical variables, especially if there are some differences in the way they are written (e.g.: "Belgium" is different from "belgium").
  • Look for duplicated observations.
  • Look for consistency between your variables (e.g : marital status = single but in the column other marital status the person wrote married).
  • You should always document the changes in your data. Keep them in a file with your data.