# Data cleaning: times

I’m checking and cleaning some water quality data that includes a column of times when samples were collected.  Although this sounds straightforward, there are lots of surprises and complications.  Examples are shown the figures below.  Figure 1 shows the data as viewed in an excel spreadsheet.

Figure 1: Sample time as viewed in Excel.  Issues are highlighted

Although times appear to be in the format HH:MM:SS or HH:MM this can be misleading.  Excel stores times as fractions of 24 hours so 12:00 (noon) would be stored as 0.5.  However, just because a cell looks like it contains a time value doesn’t mean the underlying representation is consistent or represents what the sampler intended when they entered the data.

Figure 1 shows several issues with the time records.  These are highlighted in different colours:

1. Blanks need to be coded as missing (orange rectangle) – this is generally automatic.
2. Cells that contain whitespace need to be coded as missing – just looking at the data in Excel, it can be difficult to distinguish between blank cells and those that contain spaces.
3. Zeros may need to be coded as missing (blue) or they may refer to noon (see below).
4. Times shown early in the morning (red) are probably in the early afternoon so we need to add 12 hours.
5. Some times are stored as characters not fractions of a day (green) – they need to be converted to numeric values.
6. Some cells have contents that look like a time but the underlying representation may include the date (or not) see Figures 2 and 3.

Figure 2: Time appears to be 11:20 but the underlying representation includes the date i.e. day number 39101 which is 19 Jan 2007

Figure 3: This cell contains also contains 11:20 but the underlying representation is just the fraction of a day

Note there is ambiguity between conditions 3 and 4 i.e. are times that have been entered as zero, really referring to midnight, noon or are they meant to be missing.  What about times in the early morning, should they be moved to the early afternoon?

This is illustrated in the histogram of times based on a real data set (Figure 4).  Most of the sampling times are between 6 am and 4 pm which seems reasonable.  There are 121 samples where the time has been specified as zero and a few that have times between midnight and 4 am.  If I had to guess, I would recode the zero times as missing and move the early morning times to early afternoon.

Figure 4: Histogram of recorded sample times

Below is time data from another site that has some  samples collected in the evening (Figure 5).  Evening times could be plausibly associated with sampling in response to an incident.

Figure 5: Histogram of recorded sample times, showing nexpected times in the early morning and late evening

I’ve written a function that attempts to deal with these issues. See this gist.

The snippet of code to draw figures 4 and 5 is:

```my_labels = c('12 midnight', str_c(seq(2,10,2), ' am'), '12 noon', str_c(seq(2,4,2), ' pm'))

mutate(Sample_time_hrs = Time_convert_excel(Sample_time) * 24) %>%
ggplot(aes(x = Sample_time_hrs)) + geom_histogram(fill = 'blue', color = 'black') +
scale_x_continuous(name = 'Sampling time', breaks = c(seq(0, 20, 2)), labels = my_labels)annotate(geom = 'segment', x = 0, y = 120, xend = 1, yend = 300, color = 'red' ) +
annotate(geom = 'text', x = 1, y = 350, label = 'Missing? \n or move to 12 noon?') +
annotate(geom = 'segment', x = 1.5, y = 0, xend = 3, yend = 200, color = 'red' ) +
annotate(geom = 'text', x = 4, y = 200, label = 'Add 12 hours')
```

Another check to consider that as.numeric will produce NAs if a text value cannot be correctly converted. A warning will be issued as follows.

```as.numeric('08:00:00')

Warning message:
NAs introduced by coercion
```

We can check any records that are NA after conversion to numeric that were not NA before the conversion.  This shows up problems that may need fixing.

```WQ %>%
mutate(time.test = as.numeric(time)) %>%
filter(is.na(time.test) & !is.na(time)) %>% # missing after, not missing before conversion
select(time)```

If time is missing but we know the date, then adding date + time to produce a time-stamp will produce a missing value.  This may need to be addressed in some way.  One approach is to try and impute the times, perhaps just set them to noon or to the average sampling time, or randomly select a time from that that have been recorded. Then at least there would be a time-stamp that could be used for plotting.