Australian Rainfall and Runoff 2016 has an example of fitting a probability model to POT (peak over threshold) data (Section 2.8.11 in Book 3). It took me a long time to understand this example. To these similarly challenged, I hope this blog will help make it easier.
Calculations are available as an excel spreadsheet; R code is available as a gist. The R code reproduces all the results and figures from this blog.
The case study is the Styx River at Jeogla with the POT series providing 47 peaks above the threshold of 74 m3/s for a period of record of 47 years. The data are as follows:
878 541 521 513 436 411 405 315 309 301 300 294 283 258 255 255
238 235 221 220 206 196 194 190186 164 150 149 134 129 129 126
119 118 117 117 111 108 105 98 92.2 92.2 91.7 88.6 85.2 79.9 74
Our goal is to fit a probability model to these data and estimate flood quantiles.
You can download the data as a csv file here. This is the same data set that was used in the 1987 edition of ARR (Book IV, Section 2.9.5). I’ve sourced the date of each flood from ARR1987 and included it in the file. Interestingly the earliest flood discharge is from 1943 and the last from 1985, a record of 43 years not 47. This is because the floods between 1939 and 1943 were all less than the 74 m3/s threshold (from 1939 to 1942 they were: 58.0, 13.0, 22.1, and 26.1 m3/s).
Start by calculating the the plotting positions for these data points. ARR2016 recommends using the inverse of the Cunnane plotting position formula to calculate recurrence intervals (see ARR 2016 equation 3.2.79, in Book 3, Chapter 2.7.1).
- average interval between exceedances of the given discharge (years)
- is the rank of flows in the POT series (largest flow is rank 1)
- is the number of years of data.
Note that is the number of years of record, not the number of peaks.
Also, its important to recognise that equation 1 works differently for the POT series compared to the annual series. Let’s look at an example that shows this difference. Consider the case where we were analysing 94 peaks from a POT series from 47 years of data. In that case, the smallest peak would have a rank of 94. The recurrence interval would be:
Which is about half a year, or 6 months, which seems reasonable (there were 94 peaks in 47 years so there are 2 peaks per year on average and they will be about 6 months apart on average).
Interpreting the Cunnane plotting position as annual exceedance probability would give a result of:
That is, the probability is greater than 1 which doesn’t make sense. So we can’t use a plotting position formula to calculate the AEP of data from a peak-over-threshold series. Instead, the inverse of equation 1 can be interpreted as EY, the expected number of exceedances per year.
The ARR example fits the exponential distribution as the probability model. This has two parameters, and , which can be determined by calculating the first and second L moments of the POT series data.
L1 is just the mean of the pot series (266.36 m3/s). To calculate L2 we use the formulas developed by QJ Wang (Wang, 1996)
- are the flows (the POT series) in order from smallest to largest
- is the number of flows
L2 = 79.12
I’ve calculated L2 using this formula as shown in the companion spreadsheet but it does require a fudge. Excel doesn’t automatically recognise that 0Cn is equal to zero. This is a standard mathematical convention that Excel doesn’t seem to implement.
The parameters of the exponential distribution can be expressed in terms of the L moments (See ARR2016, Table 3.2.3.) We are using the formulas for the Generalized Pareto (GP) but setting to zero which results in the exponential distribution. The GP distribution may be useful in real world applications.
For the exponential distribution:
The example in ARR doesn’t include confidence intervals for these parameter estimates, but these can be calculated using bootstrapping.
The 95% confidence intervals are, for beta (37.4, 89.4) and (120.6, 244.7). The correlation between the parameters is about -0.5. Uncertainty in parameters is shown graphically in Figure 1.
Figure 1: Uncertainty in parameters. Ellipse shows 95% confidence limit
The exponential distribution relates the exceedance probability to the size of a flow. In this case, the probability that a peak flow, , exceeds some value in any POT event is:
The ARR example cross-references these formulas to Table 3.2.3, but the reference should be to Table 3.2.1. The final formula in Table 3.2.1 shows the cumulative distribution function for the exponential distribution. This is for the non-exceedance probability. In hydrology we usually use exceedance probabilities (the probability that a flood is greater than a certain value), therefore the required equation is one minus the equation shown.
We can convert the POT probabilities to expected number of exceedances per year, by multiplying by the average number of flood peaks above the threshold per year. The example in ARR states that the threshold is i.e. 68.12 m3/s but the threshold used to create the POT data was 74 m3/s. I’m not sure, but I think we should use the 74 m3/s threshold. There is no information available on how many peaks there are above 68.12 m3/s. It is likely the ambiguity has arise because of a typo in the example i.e. the threshold should not have been related to .
Sticking with the 74 m3/s threshold, there are there are 47 values in POT series and 47 years of data, so the average number of peaks per year is 1. Therefore, the EY can be expressed as follow:
Where latex is the average number of POT floods per year, which is 1 in this case. In real applications, is likely to be greater than 1. The discussion in ARR2016 mentions the flow threshold is typically selected so that there are 2 to 3 times more peaks in the POT series compared to the annual series (Book 3, Chapter 184.108.40.206). Although in Chapter 2.3.4 it is mentioned that there are circumstances where the POT and annual series should have the same number of peaks.
can also be related to as follows:
We can also relate to annual exceedance probability by the formula:
Therefore, a flood magnitude, can be related to an AEP, and an AEP to a flood magnitude, , through the following pair of equations (ARR2016 Equation 3.2.11).
These equations can be used to estimate quantiles at standard EY values. Confidence intervals (95%) have been estimated using bootstrapping (Table 1).
Table 1: Flood quantiles and 95% confidence limits at standard EY values
||AEP (1 in X)
Now plot the data and the fit (Figure 2). The x values are the EY calculated from the inverse of equation 1; the y values are the flows. The fit is based on equation 6. Confidence intervals for quantiles were calculated using bootstrapping.
Figure 2: Partial series for the Styx River at Jeogla showing fit and confidence intervals
Personally I prefer the graph to be increasing to the right which is the way flood frequency curves are usually plotted, its also the way a similar example was presented in the 1987 edition of Australia Rainfall and Runoff. This just involves using the ARI as the ordinate (Figure 3).
Figure 3: Partial series for the Styx River at Jeogla showing fit and confidence intervals (same as Figure 2 except increasing to the right)
To see more, check out the excel spreadsheet and R code gist.
I’d also like to thank Matt Scorah for his help in clarifying the process of fitting POT data.
Wang, Q. J. (1996). “Direct sample estimates of L moments.” Water Resources Research 32(12): 3617-3619.