# Data cleaning: DO

Dissolved oxygen is a commonly measured water quality parameter and like all data we need to check and clean any DO values before use.

Two scatter plots show some issues in the current data I’m looking at.  Most of the data has DO values less than 30 mg/L while there are a few values around 90 mg/L.

Figure 1: Scatter plot of DO against date of sample

Figure 2: Scatter plot of DO against temperature

There is a physical limit to DO in natural waterways.  The amount of oxygen that can dissolve in water is a function of temperature.  A function to calculate saturated DO is provided below and shown in Figure 3 (and available as a gist).  There are also tables and web calculators available.

This function gives the 100% saturated concentration.  It is possible for DO to be higher. This tech note from YSI Environmental shows DO at about 220% in a small farm pond with very high algal content.  The temperature at the time was about 6 oC so this would correspond to an oxygen concentration of 25 mg/L.  I’ve added % saturation curves to the DO-temperature scatter plot and highlighted DO values that are clearly physically physically impossible (Figure 4).  There are also a number of observations between 200% and 300% which need to be checked.

```# Function to calculate 100% saturated dissolved oxygen in water in mg/L as a function of temperature in degrees Celcius

Calc_DOsat100 = function(temp){

# temp = temperature in degrees C
# relationship between temperature and dissolved oxygen as defined by the APHA
# American Public Health Association (1992)
# Standard methods for the examination of water and wastewater. 18th ed. Washington DC.

# Required constants

C1 = 139.34411
C2 = 1.575701e+5
C3 = 6.642308e+7
C4 = 1.243800e+10
C5 = 8.621949e+11

Ta =temp + 273.15

exp(-C1 +  C2/Ta - C3/(Ta^2) + C4/(Ta^3) - C5/(Ta^4))

}```

Figure 3: Saturated dissolved oxygen concentration as a function of temperature.

Figure 4: Scatterplot of DO against temperature.  Suspect points are highlighted

It seems likely that the data that plots around 90 are actually percentage saturation rather than concentration.  A bit more investigation showed this was true in this case; the dataset has a column for % saturation and, for these points, DO concentration and % saturation are the same.

R code to produce figures is shown in these this gist.  There is also a function to plot a graph similar to Figure 4 here.