# Log-normal flood frequency analysis

Although the log-normal distribution is not usually used for flood frequency estimation in Australia it can be useful to compare results with more common approaches.

The general equation for flood quantile estimation (from Chow, 1951) is:

$x_t = \mu+K_t\sigma$      (1)

Where $K_t$ is the frequency factor that depends on the distribution of the flood data and the probability of interest, $\mu$ and $\sigma$ are the mean and standard deviation.

## Frequency factors

The classical frequency factor for the log-normal distribution is:

$K_t = \left( \frac{1}{C_v} \right) \left[ \exp \left\{ Z\left( \log \left(1 + C_v^2 \right) \right)^\frac{1}{2} - 0.5\left( \log \left( 1 + C_v^2 \right) \right) \right\} -1\right]$      (2)

Where, $Z$ is a standard normal deviate and $C_v$ is the coefficient of variation $= \frac{\mu}{\sigma}$.  The standard normal deviate can be calculated, for example, using the qnorm function in R.  So for the 100-year flood, exceedance probability = 1/100, $z = 2.326$.

An R function to calculate this log-normal frequency factor is:

# Frequency factor for the log-normal distribution
# m - mean
# s - standard deviation
# p - exceedance probability

FF_LogNormal <- function(m, s, p) {
cv <- s/m
z <- qnorm(1 - p)
(1/cv) * (exp(sqrt(log(1 + cv^2)) * z - 0.5 * log(1 + cv^2)) - 1)
}

### Example 1

Kite (2004) provides an example we can use for testing.

Annual maximum discharges of the Saint John River at Fort Kent, New Brunswick for the 37 years from 1927 to 1963 have a mean of 81,000 cfs and a standard deviation of 22,800. Estimate the 100-year flood.

As, calculated below, Q100 = 148,000 cfs

m <- 81000
s <- 22800
p <- 0.01

Kt <- FF_LogNormal(81000, 22800, p) #2.943

(Q100 <- m + Kt * s)
#[1] 148221.3

If flood data are log-normally distributed, that means the logarithms of the flood data are normally distributed.  This suggests there are two ways of calculating the flood quantiles.  We can use the data as-is along with the log-normal frequency factor (as in the above example), or take the logs of the data and use the frequency factor from the normal distribution.

### Example 2

Continuing with the example from Kite.  The mean and standard deviation of the logarithms of the 37 events from the Saint John River are 11.263 and 0.284 (in cfs units).

The frequency factor for the 100-year flood, based on the normal distribution, will be 2.326 so the 100-year flood estimate is 150,800 cfs.  Similar to the previous example but not exactly the same.

m.log <- 11.263
s.log <- 0.284

Kt <- qnorm(1 - 0.01) #2.326348

(Q100 <- exp(m.log + Kt*s.log))
#[1] 150795.9

### Example 3

Lets repeat this analysis using some Australian data.  An annual flood series for the Hunter River at Singleton is available here.

The two estimates for the 100-year flood are 10,500 cumec using data as-is and 13,800 cumec when calculating in the log domain.

# Hunter River at Singleton
library(repmis)
my.url <- 'https://dl.dropboxusercontent.com/u/10963448/Singleton.csv'

singleton <- source_data(my.url)
str(singleton)

m <- mean(singleton$Peak (cumec)) # 1401.7 s <- sd(singleton$Peak (cumec))   # 2312.9
Kt <- FF_LogNormal(m, s, 0.01) # 3.917
(Q100 <- m + Kt * s)
# 10460.5

m.log <- mean(log(singleton$Peak (cumec))) # 6.425 s.log <- sd(log(singleton$Peak (cumec)))   # 1.336
Kt <- qnorm(1 - 0.01) # 2.326
(Q100 <- exp(m.log + Kt*s.log))
# 13818.9

## Bayesian frequency factor

Kuczera (1999) derives a frequency factor for the log-normal distribution which takes account of the uncertainty in the parameters – the mean and standard deviation.  These parameters must be estimated from the flood data.  Assuming nothing is known about these parameters other that what we learn from the flood measurements (i.e. the Bayesian prior is noninformative), the frequency factor is:

$K_t = t_{n-1} \times \sqrt{1 + \frac{1}{n}}$

Using this frequency factor will result in a more conservative estimate of flood quantiles. As the sample size increases the the Bayesian frequency factor will approach the frequency factor based on the normal distribution.

An R function to calculate this Bayesian log-normal frequency factor is:

# Frequency factor for the log-normal distribution
# n - number of data points
# p - exceedance probability

FF_LogNormal_Bayes <- function(n, p) {

qt( 1-p, n-1) * sqrt(1 + 1/n)

}

### Example 4

Calculate the Bayesian estimate for the 100-year flood quantile using the data for the Hunter River at Singleton.

Calculations are below.  The estimate is 17,350 cumecs.

library(repmis)
my.url <- 'https://dl.dropboxusercontent.com/u/10963448/Singleton.csv'

singleton <- source_data(my.url)

m.log <- mean(log(singleton$Peak (cumec))) # 6.425 s.log <- sd(log(singleton$Peak (cumec))) # 1.336
Kt <- FF_LogNormal_Bayes(nrow(singleton), 0.01) # 2.4966

(Q100 <- exp(m.log + Kt*s.log))
# 17348

### Testing

So we have three flood quantile estimators for log-normally distributed data:

1. Take logs and use the frequency factor from the normal distribution
2. Use data as-as and calculate the frequency factor using equation 2.
3. Take logs and use the Bayesian frequency factor.

If the data set is large, all the results are the same.

set.seed(2016)
my.flood <- rlnorm(1e6, 6, 1)  # generate 1 flood values from a log-normal distribution.

# True value

exp(6 + 1*qnorm(1 - 0.01)) # 4131.302

# Direct calculation of quantile
quantile(my.flood, probs = 0.99, type = 8 ) # 4114.213

# Taking logs and using normal distribution frequency factor
Q100.log <- mean(log(my.flood)) + sd(log(my.flood))*qnorm(1 - 0.01)
exp(Q100.log) # 4130.316

# Use frequency factor from equation 2
Q100 <- mean(my.flood) + sd(my.flood) * FF_LogNormal(mean(my.flood), sd(my.flood), 0.01)
Q100 # 4127.146

# Bayes frequency factor
Q100.Bayes <- mean(log(my.flood)) + sd(log(my.flood))*FF_LogNormal_Bayes(length(my.flood), 0.01)
exp(Q100.Bayes) #4130.336

For small datasets, the estimates vary substantially.

First, let’s make a function that simulations 30 years of log-normal flood peaks and calculates the 100-year quantiles.

Test_f <- function(i, p) {

# i = dummy variable so the function can be included in a loop
# p = exceedance probability

my.flood <- rlnorm(30, 6, 1) # Generate 30 years of flood data
Q100.normal <- mean(log(my.flood)) + sd(log(my.flood))*qnorm(1 - p)
Q100.eq2 <- mean(my.flood) + sd(my.flood) * FF_LogNormal(mean(my.flood), sd(my.flood), p)
Q100.Bayes <- mean(log(my.flood)) + sd(log(my.flood))*FF_LogNormal_Bayes(length(my.flood), p)
data.frame(Q100.normal = exp(Q100.normal), Q100.eq2 = Q100.eq2, Q100.Bayes = exp(Q100.Bayes))

}

Now, we’ll call this multiple times and compare the average quantiles for the three methods with the true quantile.

set.seed(5)
out <- lapply(1:10000, Test_f, p = 0.01)
out.df <- do.call(rbind, out)
colMeans(out.df)

Q100.normal Q100.eq2 Q100.Bayes
4334.727 3678.353 5204.641

# True quantile 4131

So, for for this data set, on average, the quantile based on taking the logs and using the frequency factor from the normal distribution is about right.  The quantile based on equation 2 is too low, and the quantile using the Bayesian frequency factor is too high.

In another post, we’ll look at quantifying the uncertainty in quantile estimates.

## References

Chow, V. T. (1951) A general formula for hydrologic frequency analysis.  Transactions, American Geophysical Union 32:231-237.

Kite, G. W. (2004) Frequency and risk analysis in hydrology.  Water Resources Publications, LLC.

Kuczera (1999) Comprehensive at-site flood frequency analysis using Monte Carlo Bayesian inference.  Water Resources Research 35(5): 1551-1557.

# Lead in rain tanks – Wellington, NZ

I’ve written about lead levels in rain tanks before here and in this book chapter.  This is just a quick post about some recent work in New Zealand (link).  Researchers looked at lead levels in six water tanks in the Wellington area and found that 69% of samples exceeded WHOs health-based guideline of 0.01 mg/L for lead.  This is consistent with the 32 studies that we summarised in Magyar and Ladson (2015).

### References

Stewart, C., Kim, N. D., Johnson, D. M. and Nayyerloo, M. (2016) Health hazards associated with consumption of roof-collected rainwater in urban areas in emergency situations.  Int. J. Environ. Res. Public Health 13(10):1012 doi:10.3390/ijerph13101012

Magyar, M. I. and Ladson, A. R. (2015) Chemical quality of rainwater in rain tanks. In: Sharma, A., Begbie, D. and Gardner, T.  Rainwater tank systems for urban water supply: design, yield, energy, health risks, economics and social perceptions.  IWA publishing. pp 207-224. Link to Chapter 9

# RORB: comparing modelled and measured hydrographs

[Update  15 Nov 2016: Calculations available as a spreadsheet]

RORB provides 7 statistics to summarise the match between  calculated and actual hydrographs.  A comparison of:

1. Peak discharge (m3/s)
2. Time to peak (hour)
3. Volume (m3)
4. Average absolute coordinate error (m3/s)
5. Time to centroid (hour)
6. Lag (centre of mass to centre of mass) (hour)
7. Lag to peak (hour)

An example is shown in Figure 1 below which compares a calculated and actual hydrograph for the Werribee River.  This is taken from section 10.2.1 of the RORB manual.  The difference between the hydrographs is summarised as an absolute error and a percentage error for each statistic.

During calibration, the losses and kc can be varied until the modeller is satisfied there is a reasonable match between the calculated and actual hydrographs across the range of statistics.  Generally modellers focus on peak discharge and volume but the other error measures are also useful.  For example, problems in the speed of the hydrograph (time to centroid and lag) can be related to the number of sub-catchments used in the RORB model (see the discussion here).

There is a brief summary of these statistics in Section 9.2.1.1 of the RORB manual.  In this blog I’ll discuss them in more detail.

Figure 1: Error statistics provided by RORB based on the differences between calculated and actual hydrographs

### Example data set

Example calculations are made using data from the Werribee River worked example (Section 10.2 of the RORB manual).  The actual and calculated hydrographs are listed on page 111 and 112.  See Figure 2.

Figure 2: Calculated and actual hydrographs for the Werribee worked example (p111 – 112 of the RORB manual)

### Peak discharge

The error in peak discharge is the difference between the maximum of the calculated RORB hydrograph and the maximum of the actual hydrograph provided as input to RORB.

For the Werribee example:

max(actual) = 356 m3/s

max(calculated) = 337.999 m3/s

Absolute difference = max(calculated) – max(actual) = 337.999 – 356 = – 18 m3/s

Percentage difference = 100×(max(calculated) – max(actual))/max(actual)

$\dfrac{-18}{356} \approx 5.1\%$

Note that these error measures are expressed in a the opposite way to that used by statisticians.  In statistics, residuals are usually expressed actual – calculated  e.g. $y - \hat{y}$.  Here we are reporting calculated – actual.  So if the difference is negative, the calculated value is too small.

### Time to peak

This is where it starts to get interesting.  Looking at the data in Figure 2, the peak flow for both the actual and calculated hydrographs occurs at the 9th time increment, which Figure 2 suggests, is at 18 hours.  However, look at the statistics in Figure 1; the time to peak is listed as occurring at 16 hours.  What’s going on?

We can get a clue by looking at the RORB control vector for the Werribee case study (see Figure 3).  The first line highlighted in green show that the time increment is 2 hours and calculations will be undertaken for 28 increments. The second line in green shows that the two actual hydrographs (Melton Reservoir outflow and Werribee weir) both start at the zeroth time increment and end at the 28th time increment.  The trick is that increments start at zero not 1.  The increments shown in Figure 2 start at 1 not zero.  This is an error in reporting.  The first flows listed in Figure 2 actually occur at time zero so the maximum flows occur after 16 hours, not 18 hours.  Its a bit hard to tell, but if you look closely at the graph in Figure 1, the maximum flow occurs at 16 hours not 18 hours.

Figure 3: RORB control vector from the Werribee case study (p106 of the RORB manual)

There seems to be an ‘out by one’ error in the RORB output file.  Other case studies in the RORB manual have the same issue.  This creates an inconsistency between the on-screen statistics, such as those shown in Figure 1, and the RORB text output.

### Volume

RORB estimates hydrograph volumes as the mean flow times the length of the hydrograph.

For the data shown in Figure 2, the means are:

• Calculated 119.0093 m3/s
• Actual 122.4138 m3/s

To convert these mean flows to volumes (m3) multiply by the length of the hydrograph.  In this case, there are 29 increments (remember we start at zero and finish at 28).  Each increment is 2 hours long so we need to multiply by 2 × 60 × 60 to convert hours to seconds.

• Volume of calculated hydrograph 119.0093 × 29 × 2 × 60 × 60 = 24848021 ≈ 0.25 × 108 m3
• Volume of actual hydrograph 122.4138 × 29 × 2 × 60 × 60 = 25560000 ≈ 0.26 × 108 m3

The absolute difference is 24848021 – 25560000 =  -711979 ≈ -0.71 × 106 m3

Percentage difference:

100 × (24848021 – 25560000)/25560000 ≈ -2.8%

### Average absolute coordinate error

The average absolute coordinate error is the mean of absolute values of the differences between the actual and calculated hydrograph at each time step.

$\mathrm{AACE} = \dfrac{\sum\limits_{i }^{N} \lvert a_i - c_i \rvert}{N+1}$

Where:

• AACE is the average absolute coordinate error
• $a_i$ is the flow at each $i$ time increment for the actual hydrograph
• $c_i$ is the flow at each $i$ time increment for the calculated hydrograph
• There are $N + 1$ time increments.

For the Werribee case study, the first time step  is at zero and the final is at 28 so there are 29 increments in total.

The average absolute coordinate error =

$\dfrac{\lvert 0 - 0 \rvert + \lvert 0 - 0\rvert + \lvert 8 - 8.007 \rvert + \lvert 34 - 37.806 \rvert + \ldots}{28+1} \approx 7.6 \; \mathrm{m^3 s^{-1}}$

The percentage error is the average absolute coordinate error divided by the average flow of the actual hydrograph.

For the Werribee case study, the average flow during the actual hydrograph is 122.4138 m3/s so the percentage average absolute coordinate error is:

$100 \times \dfrac{7.6}{122.4} \approx 6.2 \%$

### Time to centroid

The time to the centroid of the hydrograph can be be calculated as sum of flow × time divided by the sum of the flow

$\dfrac{\sum{}{}t_iQ_i}{\sum{}{}Q_i}$

Looking at the data in Figure 2, and correcting for the fact that the first time step is at zero, not 2 hours,  the centroid for the actual hydrograph will be:

$\frac{0 \times 0 + 2 \times 0 + 4 \times 8 + 6\times 34 + \ldots}{0+0+8+35+\ldots}$

Time to centroid (actual) = 24.11662 hour

Time to centroid (calculated) = 23.44448 hour

Absolute error = 23.4 4- 24.11 = -0.672 ≈-0.7 hour

Percentage error in the time to centroid is the absolute error divided by the time to centroid of the actual hydrograph.

$\dfrac{-0.672}{24.117} \approx -2.8\%$

Lag (centre of mass to centre of mass)

The RORB manual (p101) defines the lag (c.m to c.m) as:

the time from the centroid of concentrated inputs to the channels upstream of the point concerned, including both sub-area inflows and concentrated channel inflow hydrographs, to the time of centroid of the hydrograph.

For the Werribee worked example, the concentrated input is the hydrograph at Melton Reservoir.  This is listed in Figure 3 (the two lines under the green box):

0, 0, 66, 150, 253, …

The time to the centroid of this hydrograph can be calculated using the same procedure as before.

Time to centroid of the input hydrograph at Melton Reservoir is 19.90657  hour.

The difference in time between the centre of mass of the input hydrograph and that of the actual hydrograph is:

24.111662 – 19.90657 = 4.21 ≈ 4.2 hour

The difference in time between the centre of mass of the input hydrograph and that of the calculated hydrograph is:

23.44448 – 19.90657 = 3.53823 ≈ 3.5379 hour

The absolute difference between these values is:

3.5379- 4.2101 ≈ -0.672 ≈ -0.7 hour

The percentage difference is the absolute difference divided by the difference in time between the centre of mass of the input hydrograph and that of the actual hydrograph.

100 × -0.672/4.2101 = -15.965% ≈ -16%

So the calculated value is -0.7 hour (16%) smaller than the actual value.  This means the calculated hydrograph is moving through the modelled reach of the Werribee River more quickly than the real hydrograph is moving through the real reach.  A modeller could adjust kc to slow the modelled hydrograph (but this would also affect other aspects of the calculated hydrograph, such as  the peak).

### Lag to peak

The lag to peak is the time from the centroid of the concentrated input to channels upstream of the point concerned, including both sub-areas inflows and concentrated channel inflow hydrographs, to time of peak of the hydrograph.

For Werribee worked example, the time to the centroid of the input hydrograph (at Melton Reservoir) is 19.90657 hour and the peak of the actual output hydrograph is at 16 hours so the absolute difference is:

16 – 19.90657 ≈ -3.9 hours

The peak of the calculated hydrograph  is also at 16 hours so the lag to peak of the calculated and actual hydrographs is the same i.e. the error is zero.

### Conclusion

The range of error statistics in RORB provides a summary of model performance which can be used as a guide during calibration.  Most modellers concentration on matching peaks and volumes but the other statistics also provide insight into aspects where the model can be improved.

Example calculations are available as an excel spreadsheet and a gist.

# Detecting outliers in water quality data

This post proposes a simple method to detect outliers in water quality data.  This is based on a proposal by John Tukey as discussed in the Wikipedia article: Outliers.

Tukey proposed a method that could be used to flag outliers based on the interquartile range.  An outlier is any observation falls outside:

${[} Q_1 - k(Q_3 - Q_1) {]}, {[} Q_3 + k(Q_3 - Q_1) {]}$

Where, Q1 and Q3 are the lower and upper quartiles.

Tukey proposed that k = 1.5 could be used to flag outliers, while k = 3 suggests observations that are “far out”.

A guide to whether the maximum or minimum value in a dataset are outliers is to calculate their k values.

$k_{max} = \frac{max(O) - Q_3}{Q_3 - Q_1}$

$k_{min} = \frac{ Q_1 - min(O)}{Q_3 - Q_1}$

The appealing feature of using the interquartile range is that it is resistant to outliers so there is reduced change of the statistic used to flag any outlier being influenced by outliers, an issue known as masking.

A function to calculate the k value for the maximum and minimum of a dataset is as follows:

Tukey_k <- function(x){
my.quantile <- quantile(x, na.rm = TRUE)
Q_25 <- my.quantile[2]
Q_75 <- my.quantile[4]

k_max <- as.vector((max(x, na.rm = TRUE) - Q_75)/(Q_75 - Q_25))
k_min <- as.vector((Q_25 - min(x, na.rm = TRUE))/(Q_75 - Q_25))

data.frame(k_max = k_max, k_min = k_min)

}

Let’s run some tests.

If we calculate the k values for a large normally distributed dataset (say 100,000 observations) they are less than 3.  For these data, getting a value larger than 3 is obviously rare.

set.seed(2016)
Tukey_k(rnorm(100000))
# k_max k_min
# 1 2.530309 2.531518

For water quality observations, a review by Duncan (1999) suggested the log-normal distribution was a good approximation for all the data he looked at so we need to take logs of the observations before calculating the k values.

For some data on Nickel concentration I’m working on, the k for the maximum value is 4.3.  Clearly a ‘far out’ value and one that deserves checking, via a histogram or QQ plot (see below).  Of the 12,000 values in this dataset there are clearly 15 or so that are very large (relative to the bulk of the data).

Code is available as a gist.

Identify, describe, plot and remove outliers from the dataset

# Rating curve resources

Hydrologists are often interested in the highest flows and that means we are using the upper limits of rating tables where uncertainty is large.

Below are some links that explain how rating curves are developed and potential issues with current practice.

Fenton, J. D. and Keller, R. J. (2001) The calculation of streamflow from measurements of stage.  Cooperative Research Centre for Catchment Hydrology.  Technical report 01/6.  (link)

USGS

• How streamflow is measured
• Buchanan, T. J. and Somers, W. P. (1969) Discharge measurements at gaging stations.  U. S. Geological Survey Techniques of Water-Resources Investigations Book 3, Chapter A8, pp. 65. (link)

John Fenton’s papers

The NZ Hydrological Society Ratings Workshop 2016

Stu Hamilton’s blog at Aquatic Informatics

Stu Hamilton’s Whitepaper: 5 Best Practices for Building Better Stage-Discharge Rating Curves.

### Journal articles

Van Eerdenbrugh, K., Van Hoey, S. and Verhoest, N. E. C. (2016) Identification of temporal consistency in rating curve data: Bidirectional Reach (BReach). Water Resources Research 52(8):6277-6296. (link)

Steinbakk, G. H., Thorarinsdottir, T. L., Reitan, T., Schlichting, L., Holleland, S. and Engeland, K. (2016) Propagation of rating curve uncertainty in design flood estimation.  Water Resources Research. DOI: 0.1002/2015WR018516  (link).

Kuczera, G. (1996) Correlated rating curve error in flood frequency inference.  Water Resources Research.  32(7):2119-2127 (link)

Mansanarez, V., Le Coz, J., Renard, B., Lang, M., Pierrefeu, G. and Vauchel, P. (2016) Bayesian analysis of stage-fall-discharge rating curves and their uncertainties.  Water Resources Research. DOI: 10.1002/2016WR018916 (link)

Tomkins, K. M. (2012) Uncertainty in streamflow rating curves: methods, controls and consequences.  Hydrological Processes 28(3):464-481. (link)

### Other articles

Ward, B and Thomas, L. (2008) Does your rating curve hold water? The consequences of rating curve errors. ANCOLD Conference (link)

# Victoria’s open data directory

The Victorian government has set up an open data directory that has some great information for hydrologists.

A few of the data sets:

There is a nice an analysis of the snow data here (courtesy of Truii).

# The State of Hydrologic Practice in Victoria, Australia

Edit: 30 Nov 2016.  Slides from conference presentation and handout.

Edit: 11 Nov 2016.  The final version of the paper is here.

I’ve put together a conference that provides a summary of current hydrologic practice in Victoria based on a review of 20 recent flood studies.

Some highlights:

• The RORB hydrologic model was used in 17 of the 20 studies.  Clearly RORB is popular and useful even though much of the core capability dates from the 1970s.
• Climate change impacts on flooding were only considered in half the studies.  There has been other work that suggests Australian hydrologists and their clients lag behind their international counterparts in considering climate change (e.g. this study by Caroline Wenger and others).  This is surprising given that Engineers Australia issued climate change guidelines in 2003.  The latest guidelines are available here.
• The number of sub-catchments used in RORB models has increased greatly with the use of tools such as ArcHydro and CatchmentSim. There was an average of 92 sub-catchments in the studies I looked at.  RORB models from 1970 to the mid 1990s generally used 10 and 20 sub-catchments.
• No one used Monte Carlo or ensemble approaches to hydrologic flood modelling despite these features being available in RORB (For a summary of RORB’s capability see this conference paper from 2006).
• Rainfall temporal patterns were taken from the 1987 version of Australian Rainfall and Runoff.  RORB provides a tool to extract temporal patterns from local pluviographs but this was not used.
• Most studies used uniform spatial patterns for design rainfalls.  The new recommendations from Australian Rainfall and Runoff are that spatial patterns should reflect variations in rainfall across catchments (ARR Book 2, Chapter 6).  Clearly this will require a change in practice.
• The baseflow contribution to design flood hydrographs was generally not considered.  Again there is increased emphasis on this in the new ARR.

Ladson, A. R. (2016) The State of Hydrologic Practice in Victoria, Australia.  Hydrology and Water Resources Symposium.  28 Nov – 2 Dec 2016, Queenstown, NZ.