Data cleaning: summarising variables

When checking and cleaning data, every variable – every column of data – needs to be looked at.  For a water quality dataset, which in my case comes as an excel file, there could be tens of thousands of sampling occasions (rows) and measurements of perhaps 100 water quality constituents (columns). Clearly, an automated approach to checking is going to reduce the tedium and speed things up.

My current method is to initially treat all data as text and then summarise each column to provide:

  • Number of observations (number of rows)
  • Number of missing values (blank rows)
  • Percentage missing
  • Number of blanks (rows that only contain whitespace).  These will need to be recoded as missing.
  • Number of zeros (exactly zero).  Often zeros are disguised missing values and should be recoded as missing.
  • Number near zero (to take account of floating point issues)
  • Number of numeric values (those that can be converted from text to a number)
  • Summary of non-numeric values
  • Number of left censored values (e.g. < 0.01)
  • Number of right censored values (e.g. > 24,000)
  • A selection of quantiles (1%, 5%, 10%, 90%, 95%, 99%)
  • Maximum, minimum, mean, median.

I’ve selected this list based on data issues that have tripped me up in the past.

The summary of non-numeric values is often interesting.  A contrived, although realistic example for a water quality variable is shown below.

No Non-numeric Count
1 < 2 304
2 <1 65
3 <2 303
4 <5 1
5 ‘no result’ 1
6 ‘no result ‘(with a trailing space) 1
7 NR 21
8 ~30 1
9 7..6 1
10 `90 1

There were 673 left censored values; this is just a count of the number of rows that contain ‘<‘.  The table shows that most of these are ‘<2’ but sometimes this is coded with a space between the less-than sign and the 2 and sometimes not.  Clearly code for further analysis needs to be robust to whether there is a space or not.  Missing data are coded in 4 different ways: the strings ‘no result’ and ‘no result ‘ (i.e. with and without a trailing space), ‘NR’ and blank rows.  All these need to be recoded to missing.  There are typos, ‘7..6’, ‘`90’ and an indication of an approximate value ~30.  All these things can be fixed or addressed in some way, now that we know about them.

A function to summarise the columns in a data set is available here.


My approach was motivated by a talk given by Nicholas Tierney at a recent conference “Be a Hawk not a Turkey: How a bird’s eye view of your data can streamline data analysis” (Nick’s slides).  The upshot is, it is very helpful to have an overview of your data rather be scratching around like a bush turkey.







Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s