Load estimation, creating wet and dry events in a flow series

Pollutant loads carried by streams can be estimated from measurements of flow and pollutant concentrations. Often, continuous flow measurements are available, and there are discrete samples of pollutants from grab samples or autosamplers.

There are many different approaches to estimating loads from this type of data (see the references below). Here I consider a method based on stratifying a flow record, and water quality samples, into wet and dry periods. Average concentrations are estimated for these wet and dry periods and the total load is calculated as; the wet period flow volume x wet period average concentration, plus dry period flow volume x dry period average concentration. The motivation for this approach is that pollutant concentrations often differ between wet and dry conditions so stratifying flows, and water quality samples, produces more accurate load estimates.

This method is also referred to as the EMC/DWC approach (EMC – Event mean concentration, DWC – dry weather concentration) and has been used in a number of studies e.g. Chiew and Scanlon (2002); Chiew et al. (2002); Fletcher and Deletic (2007). It is also a commonly used approach in water quality models such as EMSS (Environmental Management Support System) and Source.

This blog only looks at one aspect of EMC/DWC approach, the method used to stratify flows to identify wet and dry periods.

Consider the example flow series shown in Figure 1. We need to determine a flow threshold to separate wet from dry periods. There are a range of approaches that have been used as documented in studies. For example, Chiew and Scanlon (2002) used the 5% exceedance flow as a cutoff for wet periods and the 20% exceedance percentile as a cut off for dry periods (Figure 2). Wet periods were when the flow was greater than the 5th percentile flow; dry periods  when the flow was below the 20th percentile flow. This means there were also times that were not categorised as either wet or dry so it is not clear how water quality samples taken during this intermediate period were classified.

events-1

Figure 1: Example flow series

events-2

Figure 2: Thresholds to determine wet and dry periods used by Chiew and Scanlon (2002)

The approach adopted by Fletcher and Deletic (2002) was to classify the whole flow record as either wet or dry by selecting a threshold that maximised the total number of wet and dry events. Figure 3 shows wet and dry periods for a flow threshold of 20. There are 11 events in total, 6 dry and 5 wet. Wet periods, blue on the figure, occur when the flow is above the threshold.

The number of events is a function of the flow threshold as shown on Figure 4. If the threshold is above the highest flow, the whole record would be classified as a single dry event. As the threshold is reduced, the number of events increases until a maximum is reached. This behaviour is shown on Figure 4 which is based on calculating the number of events at 500 thresholds, equally spaced between the maximum and minimum flow.

events-3

Figure 3: Wet and dry periods for a flow threshold of 20

events-4

Figure 4: Number of events as a function of flow threshold

Based on this analysis, the maximum number of events is 27 at a corresponding threshold of 9.58. The designation of wet and dry events is shown in Figure 5.

A key issue in using the Fletcher-Deletic approach to load estimation is to find a robust way of determining the threshold that maximises the number of events. Figure 4 shows that using an optimisation approach to find the threshold is not straightforward, there are many local maximum and large plateaus where there is no change in the number of events for range of thresholds. A grid search, which produced the results shown in Figure 4, works but may not be practical over a large range of flows and may miss the global maximum.

events-5

Figure 5: A threshold flow of 9.58 gives the maximum number of wet and dry events; 27

The following section considers 3 approaches to finding the optimal flow threshold:

  1. Optimisation using Brent’s method
  2. Optimisation using Differential Evolution, a sophisticated method that can search for a global optimum.
  3. A combination of grid search and optimisation.

1. Optimisation using Brent’s method

Brent’s method is a fast and reliable approach to one dimensional optimisation. The method is available via the optimise function in R. (As an aside: Richard Brent is an Australian mathematician and computer scientist).

The following snippet of R code first defines a function to calculate the number of events given a threshold and flow series, and then calls optimise to find the threshold corresponding to the maximum number of events.

Count_events <- function(thresh, Q){ #function to count the number of events, given flow data Q and a threshold, thresh
  Q_let <- Q < thresh
  -sum(ifelse(Q_let != lag(Q_let) | is.na(lag(Q_let)), 1, 0)) # negative so this becomes a minimisation problem
}

out <- optimise(Count_events, Q = Q$flow, lower = min(Q$flow), upper = max(Q$flow))

out$minimum # threshold
out$objective # number of events
## [1] 159.1659
-out$objective # number of events
## [1] 3

optimise is returning a maximum of 3 events at a flow threshold of 159.2. This is clearly not the right answer as the grid search found the maximum number of events as 27 at a threshold of 9.58.

The documentation ?optimise explains the problem:

The first evaluation of f [the function to be optimised] is always at x_1 = a + (1-φ)(b-a) where (a,b) = (lower, upper) and phi = (sqrt(5) – 1)/2 = 0.61803.. is the golden section ratio. Almost always, the second evaluation is at x_2 = a + phi(b-a). Note that a local minimum inside [x_1,x_2] will be found as solution, even when f is constant in there…

In this case, I’ve set lower = minimum(flow) = 1.35 and upper = maximum(flow) = 159 (the flow series is shown in Figure 1). Therefore the first threshold the optimiser tries is 61.6 and the second is 98.9. For both these thresholds the number of events, as returned by the Count_events function, is the same, 3 (see Figure 4). The optimiser does not see the larger number of events that occur for a smaller flow. The problem is caused by the highly skewed nature of the flow series, in particular, the single large peak.

2. Optimisation using Differential Evolution

The Differential Evolution approach to global optimisation is available via the R function DEoptim (Ardia, 2015). The following function takes a vector of flows as input and returns the threshold that maximumises the number of events.

Calc_threshold_DEoptim <- function(Q) { # Q is a vector of flows

  .Count_events_min <- function(thresh, Q){
    Q_let <- Q < thresh
    -sum(ifelse(Q_let != lag(Q_let) | is.na(lag(Q_let)), 1, 0)) # Negative so we search for a minimum
  }

  DEoptim(.Count_events_min, lower = min(Q), upper = max(Q),
                        control = DEoptim.control(trace = FALSE), Q)$optim

}

out <- Calc_threshold_DEoptim(Q$flow)

out$bestmem # optimal threshold

-out$ bestval # maximum number of events
##     par1 
## 9.586742
-out$ bestval # maximum number of events
## [1] 27

Using this approach, the maximum number of events is 27 at a threshold of 9.59, which is similar to the results from the grid search.

It is also likely to be possible to tune DEoptim.  After playing around with this some more and setting trace = TRUE I think it is best to increase the increase the population of search candidates (set to 10 × the number of parameters by default) and, to save time, decrease the maximum number of iterations (set to 200 by default).  It seems that optima are easy to find, we just need to look in a lot of places to get the global optimum.  Therefore in DEoptim use control = DEoptim.control(trace = FALSE, itermax = 20, NP = 50). For more information see ?DEoptim.

3. Grid search followed by optimisation with Brent’s method

An alternative would be to start with a grid search and polish the answer with optimise. I played with the following function, and it works in this case, but it is not robust. It is very easy to return a local rather than global maximum.

Calc_threshold_optimise <- function(Q){

  # Q is a time series of flows

  if(sum(is.na(Q)) !=0) stop('Missing values in flow series')


  .Count_events_min <- function(thresh, Q){
    Q_let <- Q < thresh
    -sum(ifelse(Q_let != lag(Q_let) | is.na(lag(Q_let)), 1, 0))
  }

  flow_min <- min(Q)
  flow_max <- max(Q)

  if(isTRUE(all.equal(flow_min, flow_max))) return(NA)

  # Search integers between min and max flow

  thresh.seq <- floor(flow_min):ceiling(flow_max)
  num_events <- map_dbl(.x = thresh.seq, .f = .Count_events_min, Q = Q)


  # Then optimise in the region identified as the best

  max_index <- which.min(num_events) # location of maximum found by grid search

  my_lower <- thresh.seq[ max(max_index - 2, 1) ] # search below, but ensure index >= 1
  my_upper <- thresh.seq[ min(max_index + 2, length(thresh.seq))] # search above, restrict index

  my.interval = c(my_lower, my_upper)
  optimise(.Count_events_min, interval = my.interval,  Q)

}


out <- Calc_threshold_optimise(Q$flow)

out$minimum # threshold

-out$objective # number of events
## [1] 9.58367
-out$objective # number of events
## [1] 27

My current thinking is that to find the threshold reliably, we need to use a global search method such as Differential Evolution or a fine grid search.

Code associated with this blog is available at this gist.

Flows as a mixture of two distributions

An alternative method of determining the flow threshold is to treat the flows as a mixture of two statistical distributions.  There is a discussion of this approach on p32 of a report by Fox et al. (2005).  I haven’t had success with this approach so far because the fitted distributions do not correspond to high and flow flows for the data I’m using.

References

Ardia, D., Katharine M. Mullen, Brian G. Peterson, Joshua Ulrich (2015). ‘DEoptim’: Differential Evolution in ‘R’. version 2.2-3.

Ardia, D., Boudt, K., Carl, P., Mullen, K.M., Peterson, B.G. (2010). Differential Evolution with ‘DEoptim’: An Application to Non-Convex Portfolio Optimization. The R Journal, 3(1), 27-34. URL http://journal.r-project.org/archive/2011-1/2011-1_index.html.

Ardia, D., Ospina Arango, N., Giraldo Gomez, N. (2010). Jump-Diffusion Calibration using Differential Evolution. Wilmott Magazine, Issue 55 (September), 76-79. URL http://www.wilmott.com/.

Chiew, F., Scanlon P., Vertessy, R. and Watson, F. (2002) Catchment scale modelling of runoff, sediment and nutrient loads for the south-east Queensland EMSS. Cooperative Research Centre for Catchment Hydrology. Report 02/1 link

Chiew, F. and Scanlon, P. (2002) Estimation of pollutant concentrations for EMSS modelling of the south-east Queensland region. Cooperative Research Centre for Catchment Hydrology. Report 02/2 link

Etchells, T., Tan, K. S. and Fox, D. (2005) Quantifying the uncertainty of nutrient load estimates in the Shepparton Irrigation Region. In: Zerger, A. and Argent, R.M. (eds) MODSIM 2005 International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2005, pp. 170-176. ISBN: 0-9758400-2-9. link

Fox, D. R., Etchells, T. and Tan, K. S. (2005) Protocols for the optimal measurement of nutrient loads.  A report to the West Gippsland Catchment Management Authority.  Australian Centre for Environmetrics. link.

Fletcher, T. D. and A. Deletic (2007) Statistical evaluation and optimisation of stormwater quality monitoring programmes. Water Science and Technology 56(12): 1-9. link

Letcher, R. A., Jakeman, A. J., Calfas, M., Linforth, S., Baginska, B., and Lawrence, I. (2002) A comparison of catchment water quality models and direct estimation techniques. Environmental Modelling and Software, 17, 77-85. link

Littlewood, I. G. (1992). Estimating contaminant loads in rivers: a review. Wallingford, Institute of Hydrology. link

Marsh, N. and Waters, D. (2009) Comparison of load estimation methods and their associated error. In: Anderssen, R.S., R.D. Braddock and L.T.H. Newham (eds) 18th World IMACS Congress and MODSIM09 International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand and International Association for Mathematics and Computers in Simulation, July 2009, pp. 2377-2383. ISBN: 978-0-9758400-7-8 link

Mukhopadhyay, B., and Smith, E. H. (2000) Comparison of statistical methods for estimation of nutrient load to surface reservoirs for sparse data set. Water Resources, 34(12), 3258-3268. link

Mullen, K., David Ardia, David Gil, Donald Windover, James Cline (2011). ‘DEoptim’: An R Package for Global Optimization by Differential Evolution. Journal of Statistical Software, 40(6), 1-26. URL http://www.jstatsoft.org/v40/i06/.

Preston, S. D., Bierman, V. J., and Silliman, S. E. (1989) An evaluation of methods for the estimation of tributary mass loads. Water Resources Research, 25(6), 1379-1389. link

Water Quality Analyser: user guide (2011) see Appendix 3, p132 user guide

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s