Data cleaning: dates and times

Dealing with dates can be very tricky.  Here are a few issues I’ve found with water quality and flow data, and how to fix them.  This builds on an earlier post.

Excel dates

Excel codes dates as the number of days since a certain date.  Sounds straightforward, but there are known issues.  Excel thinks 1900 was a leap year.  It wasn’t (see Wikipedia).  Also the origin of the Excel dates may vary between the Windows and Mac versions of Excel.  See  The microsoft support article explains how to check and change the default.

As numeric values

The dates may look ok in Excel but when importing data from a spreadsheet they are likely to end up as numeric values.  For example,  a date shown on an Excel spreadsheet as 17-Aug-16, corresponds to the value 42599. This is the number of days since 30-Dec-1899.  Excel uses day one as 1 Jan 1900 so day zero is 31 Dec 1899 but we need to use 30 Dec 1899 in calculations to correct for the fact that Excel erroneously adds a day for 29 Feb 1900.


In R we can convert the numeric values to dates.

as.Date(42599, origin = "1899-12-30")
# [1] "2016-08-17"

Take particular care with any dates that occur prior to 29 Feb 1900 as the origin will need to be changed to ‘1899-12-31’.

For example, 1 Feb 1900 is day number 32 in Excel.

as.Date(32, origin = "1899-12-31")
# [1] "1900-02-01"

Excel doesn’t seem to like dates before 1 Jan 1900 but they can be stored as text and then converted.

Where dates have associated times, Excel codes times as fractions of a day.  For example,  17 Aug 2016 08:00:00 has a numeric value of 42599.3333.  R treats time as seconds since 1 Jan 1970.  To do the conversation:

as.POSIXct(42599.3333333333333*24*60*60, tz = 'UTC', origin = as.POSIXct('1899-12-30', tz = 'UTC') )

#[1] "2016-08-17 08:00:00 UTC"

Here I’ve set the timezone of the origin and the number to be converted to UTC.  This is avoids time changes associated with daylight saving.

As strings

Dates and times from Excel may also be stored as strings e.g. ’20-12-2003 12:34:10′.  In Australia we generally use day-month-year for date but take care that there hasn’t been an inadvertent change to month-day-year which is common in the US.

The string format may vary.  For example, a file I’m working on has dates prior to 1900 stored in formats like: ‘1899-11-07 03:00:00′; while those after 1900 are formatted ‘28/08/1902 14:00′.  This may be fixable using the ‘custom’ option in the ‘Format cells’ menu in excel.  First, format everything as a date, then use the custom option which will allow specification of a format like yyyy-mm-ddd hh:mm:ss.

Converting strings to dates is straightforward using the lubridate package

lubridate::ymd_hms('1898-01-23 09:00:00')
[1] "1898-01-23 09:00:00 UTC"

Daylight saving time

You can get confusing behaviour with the change to and from daylight saving time.
For example, in Melbourne, Australia, the time 2:30 am on 7 Oct, 2012 doesn’t exist because clocks where moved forward one hour from 2 am to 3 am.  When the clocks are moved back in March the period between 2 am and 3 am occurs twice so times in this interval are ambiguous. See this discussion on stackoverflow.

Most data loggers will use standard time i.e. they are not affected by daylight saving, but people visiting the site may use daylight saving time when they make notes or comments.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s