Data cleaning


Some guidelines for cleaning your data: 

  • Open the file and look at the data description to be sure they fit to each other. Do the data comply with the GDPR?
  • How are the missing value encoded ? Be careful, there can be several types of missing values (« I don’t want to answer » is not the same as « I don’t know the answer»).
  • Individuals/observations are in line and not in column.
  • The colum names should be written on one line and not two (it will help for a potential importation).
  • Withdraw the useless lines and columns (and avoid to leave empty columns in the middle of the data).
  • Look at the data types : there should be only one per column.
  • If you import the data, check if there are the same before and after the importation. For example, you should have the same number of lines and of columns.
  • For numerical variables, compute summary statistics (minimum, mean, main quantiles, maximum, boxplots, histograms, ...) and check if the taken values are possible (e.g. : a negative value as a number of heart beats).
  • Check the levels of the categorical variables, especially if there are some differences in the way they are written (e.g. : Belgium is different from belgium).
  • Look for duplicated observations.
  • Look for consistency between your variables(e.g : marital status = single but in the column other marital status the person wrote married).
  • You should always document the changes in your data. Keep them in a file with your data.