# 1% flood: binomial distribution, conditional probabilities

I previously wrote about considering the occurrence of 1% floods as a binomial distribution, this post extends that analysis to look at conditional probabilities.  Some of the results are counter intuitive, at least to me, in that the risk of multiple 1% floods is larger than I would have guessed.

The probability of a 1% (1 in 100) annual exceedance probability (AEP) flood occurring in any year is 1%.  This can be treated as the probability of a “success”  in the binomial distribution, with the number of trials being the number of years. So the probability of having exactly one 1% flood in 100 years is ${100\choose 1}0.01^{1}\left( 1-0.01\right) ^{99} = 0.37$

In R this can be calculated as dbinom(x = 1, size = 100, prob = 0.01) or in excel =BINOM.DIST(1,100, 0.01, FALSE).

The cumulative distribution function of the binomial distribution is also useful for flood calculations.

What is the probability of 2 or more 1% floods in 100 years:

R: pbinom(q = 1, size = 100, prob = 0.01, lower.tail = FALSE) = 0.264

Excel: =1 - BINOM.DIST(1,100, 0.01, TRUE) = 0.264

We can check this by calculating the probability of zero or one flood in 100 years and subtracting that value from 1.

1 - (dbinom(x = 1, size = 100, prob = 0.01) + dbinom(x = 0, size = 100, prob = 0.01)) = 0.264

We can also do conditional probability calculations which could be useful for risk assessment scenarios.

What is the probability that exactly two 1% floods occur in 100 years given that at least one occurs? $\Pr{(X = 2\mid X \ge 1)}$ =
dbinom(x = 2, size = 100, prob = 0.01)/pbinom(q = 0, size = 100, prob = 0.01, lower.tail = FALSE) = 0.291

What is the probability that at least two 1% floods occur in 100 years given that at least one occurs? $\Pr{(X \ge 2\mid X \ge 1)}$ =
pbinom(q = 1, size = 100, prob = 0.01, lower.tail = FALSE)/pbinom(q = 0, size = 100, prob = 0.01, lower.tail = FALSE) = 0.416

We can also check this by simulation.  This code generates the number of 1% floods in each of 100,000 100-year sequences.  We can then count the number of interest.

set.seed(1969) # use a random number seed so the analysis can be repeated if necessary
floods = rbinom(100000,100, 0.01) # generate the number of 1% floods in each of 100,000, 100-year sequences

floods_subset = floods[floods >= 1] # Subset of sequences that have 1 or more floods
# Number of times there are two or more floods in the subset of 1 or more floods

sum(floods_subset >= 2) / length(floods_subset)
# 0.4167966

# or
sum(floods >= 2)/sum(floods >= 1)

# 0.4167966



A slightly tricker situation is a question like: What is the probability of three or fewer floods in 100-years given there is more than one. $\Pr{(X \le 3\mid X > 1)} = \Pr(X \le 3 \cap X > 1 )/\Pr( X > 1)$

floods_subset = floods[floods > 1] # Subset of sequences that have more than one flood

# Number of times there are three or fewer floods in the subset of more than one flood

sum(floods_subset ≤ 3) / length(floods_subset)
# 0.9310957

# Or, for the exact value

# (Probability that X = 3 + Probability that X = 2)/(Probability that X > 1)
(dbinom(x = 3, size = 100, prob = 0.01) + dbinom(x = 2, size = 100, prob = 0.01))/ pbinom(q = 1, size = 100, prob = 0.01, lower.tail = FALSE)
# 0.9304641



The probability of experiencing at least one 1% flood in 100-years is $1 - (1-0.01)^{100}$ = 0.634.  How many years would we need to wait to have a 99% chance of experiencing a 1% flood? $0.99 = 1-(1-0.1)^n$ $n=\frac{log(0.01)}{log(0.99)} = 458.2$.  The next largest integer is 459.

We can also solve this numerically.  In R the formula is 0.99 = pbinom(q=0, size = n, prob = 0.01), solve for n. Using the uniroot function gives n = 459 years (see below).

So all these areas subject to a 1% flood risk will flood eventually, but it may take a while.

f = function(n) {
n = as.integer(n) #n must be an integer
0.99 - pbinom(q = 0, size = n, prob = 0.01, lower.tail = FALSE)
}

# \$root
#  458.4999

uniroot(f, lower = 100, upper = 1000)

pbinom(q = 0, size = 459, prob = 0.01, lower.tail = FALSE)
#  0.990079



How many years before there is a >99% chance of experiencing more than one flood? This is one minus (the probability of zero floods + the probability of one flood).

Let the number of years equal n. $1-((1-0.01)^n + n(0.01)(1-0.01)^{n-1}) = 0.99$. Solving for n gives 662 years

# Using a Treemap to show gauged areas

According to Wikipedia a Treemap is a method for displaying hierarchical data using nested rectangles.  Its easier to explain with an example which also shows how Treemaps are useful in hydrology.

In this post we’ll use a Treemap to provide information on the catchment of Western Port, a bay to the east of Melbourne; in particular, the proportion of the catchment that is gauged.  By the way, according to the Vicnames database, the official name is Western Port, not Western Port Bay.

There are 4 main gauged tributaries draining to Western Port:

• Cardina Creek
• Bunyip River
• Lang Lang River
• Bass River

Figure 1 shows the gauged portions of these catchments overlaid on a shaded relief base map. Figure 1: Gauged catchments draining to Western Port

We can also plot the the gauged and ungaged portions of these catchments on a treemap along with the remainder of the Western Port Catchment (Figure 2).  Or just the gauged portions compared to the rest (Figure 3). Figure 2: Gauged and ungauged portions of the 4 catchments Figure 3: Gauged portions of the four catchments compared with the rest of the Western Port Catchments

If we are going to estimate flow or pollution loads to Western Port than we need to factor up the results from the gauges.  A treemap shows how much extrapolation is necessary.  Code to produce these treemaps is available as a gist.

# How do you spell Eumemmerring?

Edit 10 Sep 2019.  The WMIS has been updated so the spelling of Eumemmerring is now correct.

There is a waterway in Melbourne’s South East called Eumemmerring Ck (note the spelling, the first ‘m’ is single, second ‘m’ doubled and the ‘r’ is doubled).  This is the official spelling from the Vicnames website.  On the Victorian water monitoring site its “Eummemering” (first ‘m’ doubled, single ‘r’) as in the stream gauges:

On the Bureau of Meteorology Water Data Online its Eumemmering (single first ‘m’, double second ‘m’, single ‘r’:

• 228203C Eumemmering Creek at Dandenong South
• 228235A Eumemmering Creek at Narre Warren North

Looking at the numbers of hits on a google search:

• Eumemmerring, the correct spelling (299,000 results)
• Eummemering, the spelling on the WMIS website (663 results)
• Eumemmering, the spelling on the Bureau of Meteorology’s water data online (9610) results.

So the correct spelling is winning on the internet but there is significant support for the BOM spelling.

A map, from the Argus 30 Oct 1888 (p3) supports the current official spelling as does the Parish Plan from 1859 (Figures 1 and 2).  It looks like the Bureau and the Victorian WMIS need to correct their records. Figure 1: Map published in the Argus on 30 Oct 1888 (page 3) Figure 2: Parish Plan (Victoria, Public Lands Office, 1859) (http://handle.slv.vic.gov.au/10381/150137)

There is more history of Eumemmerring provided by the City of Casey.

# Model performance based on coefficient of efficiency

The Nash-Sutcliffe coefficient of efficiency E, is commonly used to assess the performance of rainfall runoff models. $E = \frac{\sum \left(O_i - \bar{O} \right)^2 - \sum \left( M_i - O_i \right)^2 }{\sum \left(O_i - \bar{O} \right)^2 }$

Where O are the observed values and M are the modelled values.

The maximum values of E is 1.  A value of zero indicates that the model is only as good as using the mean of the observations.  E can be less than zero.

So what value of E indicates a model is reasonable? Researchers (Chiew and McMahon, 1993) surveyed 93 hydrologists, (63 responded ) to find out what diagnostic plots and goodness of fit statistics they used, which were the most important, and how they were used to classify the quality of a model fit. The most important diagnostic graphs were timeseries plots and scatter plots of simulated and recorded flows from the data used for calibration. R-squared and Nash-Sutcliffe model efficiency coefficient (E) were the favoured goodness of fit statistics. Results were considered acceptable if E ≥ 0.8.

I adapted the values provided by Chiew and McMahon (1993) to create the following table (Ladson, 2008).

Table 1: Model performance based on coefficient of efficiency

Classification Coefficient of efficiency
(Calibration)
Coefficient of efficiency
(Validation)
Excellent E ≥ 0.93 E ≥ 0.93
Good 0.8 ≤ E < 0.93 0.8 ≤ E < 0.93
Satisfactory 0.7 ≤ E < 0.8 0.6 ≤ E < 0.8
Passable 0.6 ≤ E < 0.7 0.3 ≤ E < 0.6
Poor E < 0.6 E < 0.3

Others have suggested different ranges.  For example, when assessing an ecosystem model in the North Sea, the following categorisation was used E > 0.65 excellent, 0.5 to 0.65 very good, 0.2 to 0.5 as good, and <0.2 as poor (Allen et al., 2007).  Moriasi et al (2015) provide the performance evaluation criteria shown in Table 2.

Table 2: Criteria for Nash-Sutcliffe coefficient of efficiency (Moriasi et al., 2015)

Component Temporal scale Very good Good Satisfactory Not Satisfactory
Flow Daily Monthly Annual > 0.80 0.7 < E ≤ 0.8 0.50 < E ≤ 0.70 ≤ 0.50
Sediment Monthly > 0.80 0.7 < E ≤ 0.8 0.45 < E ≤ 0.7 ≤ 0.45
Nitrogen
Phosphorus
Monthly > 0.65 0.50 < E ≤ 0.65 0.35 < E ≤ 0.50 ≤ 0.35

In modelling of flows to the Great Barrier Reef (GBR) the following criteria were adopted (Waters, 2014):

• Daily Nash Sutcliffe Coefficient of Efficiency, E > 0.5
• Monthly E > 0.8

For GBR constituent modelling, the following ranges were used for the coefficient of efficiency: Very good E >0.75; Good 0.65 < E ≤ 0.75; Satisfactory 0.50 < E ≤ 0.65; unsatisfactory ≤ 0.50.

Also note that the Nash-Sutcliffe coefficient has been criticised and there are alternative proposals but it remains frequently used in hydrologic modelling.  See the references below for more information.

There is also an interesting discussion on stakoverflow https://stats.stackexchange.com/questions/414349/is-my-model-any-good-based-on-the-diagnostic-metric-r2-auc-accuracy-e/

References

Allen, J., P. Somerfield, and F. Gilbert (2007), Quantifying uncertainty in high‐resolution coupled hydrodynamic‐ecosystem models, J. Mar. Syst.,64(1–4), 3–14, doi:10.1016/j.jmarsys.2006.02.010. (link to research gate)

Bardsley, W.E. (2013) A goodness of fit measure related to r2 for model performance assessment.  Hydrological Processes. 27(19):2851-2856. DOI: 10.1002/hyp.9914

Criss RE, Winston WE. 2008. Do Nash values have value? Discussion and alternate proposals. Hydrological Process 22: 2723–2725.

Gupta HV, Kling H. 2011. On typical range, sensitivity, and normalization of mean squared error and Nash-Sutcliffe efficiency type metrics. Water Resources Research 47: W10601, doi:10.1029/2011WR010962.

Gupta HV, Kling H, Yilmaz KK, Martinez GF. 2009. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of Hydrology 377:80–91.

Ladson, A. R. (2008) Hydrology: an Australian Introduction.  Oxford University Press. (link)

McCuen RH, Knight Z, Cutter, AG. 2006. Evaluation of the Nash–Sutcliffe efficiency index. Journal of Hydrologic Engineering 11:597–602.

Moriasi, D., Gitau, M. Pai, N. and Daggupati, P. (2015) Hydrologic and Water Quality Models: Performance Measures and Evaluation Criteria Transactions of the ASABE (American Society of Agricultural and Biological Engineers) 58(6):1763-1785 (Link to article at research gate)

Murphy, A. H. (1988) Skill scores based on the mean square error and their relationship to the correlation coefficient.  Monthly Weather Review 116: 2417-2424.

Waters, D. (2014) Modelling reductions of pollutant loads due to improved management practices in the Great Barrier Reef catchments. Whole of GBR, Technical Report. Volume 1.Department of Natural Resources and Minds, Queensland Government. (Link to article at research gate)

Willmott, C. J. (1981) On the validation of models.  Physical Geography 2: 184-194.

# The distribution of burst initial loss

Recent posts have looked at the distribution of storm initial loss and pre-burst rainfall.  Here I extend the analysis to burst initial loss.  The definitions of storm and burst initial loss are described here.

According the the Australian Rainfall and Runoff (ARR) data hub:

Burst initial loss = storm initial loss – pre-burst rainfall.                  (1)

This seems clear, but the text in ARR is confusing.  For example, Book 5, Chapter 3.3.2:

However, if design bursts, rather than complete storms, are used in design then the burst initial loss needs to be reduced to account for the pre-burst rainfall. (emphasis added)

Shouldn’t the underlined word be ‘storm’?  That would make the statement consistent with equation 1.  There is a similar problem in Chapter 3.5.1.   I’ve sent these corrections into the editors of ARR so perhaps they will be addressed in the next edition.

In applying equation 1, it is assumed that median values of storm initial loss and pre-burst rainfall should be used to determine the median burst initial loss.  This assumption was tested by WMA Water for NSW. They found that this approach resulted in an over-estimation of burst initial loss (See Section 7 of WMA Water 2019).

We can test this approach using the data for Toomuc Creek in Victoria (-38.064520N, 145.463277E). The distributions of storm initial loss and pre-burst rainfall were discussed here and here.  From the data hub, the median storm initial loss is 25 mm and the median burst initial loss is 1.5 mm for the 2 hour 1% AEP storm.  Therefore, using equation 1, the median burst initial loss for the 2 hour 1% AEP storm is estimated as 25 – 1.5 = 23.5 mm.

A histogram of burst initial loss, based on 10,000 randomly generated values of pre-burst rainfall and storm initial loss is shown in Figure 1.  Note that many of the values are less than zero, which occurs when the pre-burst rainfall is larger than the storm initial loss.  The median burst initial loss is 17.6 mm.  If we set all the negative burst initial loss values to zero, the histogram becomes as shown in Figure 2.  The median is unchanged at 17.6 mm. Figure 1: Histogram of 10,000 burst initial loss values Figure 2: histogram of 10,000 burst initial loss values restricted to be zero or greater

The upshot is that equation 1, when based on median storm and pre-burst values does not provide an accurate estimate of the burst initial loss.  The loss is over estimated which means modelled flood peaks will be biassed low.   Although only one example has been used in this analysis, the result confirms that reported by WMA Water.  There analysis was based on a large number of catchments, durations and AEPs.  WMA Water address this problem by calculating ‘probability neutral’ burst initial loss values based on the distributions of storm initial loss and pre-burst rainfall.  These probability neutral burst initial loss value are included in the data hub for sites in NSW.  There has been no similar work in Victoria or other states.  The probability neutral loss for Toomuc Creek would be 17.6 mm for the 2 hour 1% AEP event.  These losses could be calculated for other durations and AEPs by extending the methods shown here.  It would be reasonably straight forward to create a web app to read in the file from the data hub and output all the required losses.

Code to reproduce the figures and calculations is available as a gist.

I’d like to acknowledge the assistance of Scott Podger but any mistakes are mine.

### References

WMA Water (2019) Review of ARR design inputs for NSW.  Report for the NSW, Office of Environment and Heritage.  Authors: Podger, S., Babister, M., Trim, A., Retallick, M. and Adam, M. (link)

# The distribution of pre-burst rainfall

Pre-burst rainfall is storm rainfall that occurs before the main rainfall burst.  In design, we get information on bursts from IFD tables and sometimes need to take account of pre-burst rainfall to construct complete storms or determine the correct initial loss values to use in modelling.

Pre-burst rainfall is characterised, regionalised and mapped in Australian Rainfall and Runoff (ARR) (Book 2, Chapter 5.2.1).  Information on the analysis that supports this work is  available as conference paper (Loveridge et al. 2015a) and there is also a detailed report (Loveridge et al., 2015b).

Pre-burst depth  throughout Australia is provided on Figure 2.5.10 in ARR; reproduced as Figure 1 below.  Note the figure caption is ARR is incorrect.

Looking at the values for Victoria, the pre-burst depths are reasonably small (0 – 15 mm) but are of similar magnitude to initial loss so need to be considered in design and modelling particularly in urban areas where initial losses are low. Figure 1: Pre-burst depth (Source: Figure 2.5.10 of Australian Rainfall and Runoff)

Specific information on pre-burst rainfall for locations throughout Australia is available from the ARR data hub:

• pre-burst depths
• pre-burst to burst ratios.

There is some information on the variability of these values with estimates available for a range of percentiles (10, 25, 50 (median), 75, 90).

Consider Toomuc Creek (-38.064520N, 145.463277E) as an example.  The pre-burst depths and pre-burst to burst ratios for a 1% AEP, 2 hour event, available from the data hub are shown in Table 1.

We can also estimate the pre-burst depths independently by multiplying the burst depth by the pre-burst to burst ratios.  The 1% AEP, 2 hour burst depth is 52 mm (available from the IFD page at the Bureau).  So the  90th percentile pre-burst depth will be 52 x 0.739 = 38.4 mm.  The value of  0.739 is from Table 1, bottom value in column 3.

Table 1: Pre-burst information for Toomuc Ck, sourced from the data hub
Percentile Depth Ratio
10 0 0
25 0 0
50 1.5 0.029
75 13.3 0.256
90 38.4 0.739

For Monte Carlo analysis, the data in Table 1 can be treated as an empirical distribution so that pre-burst depths and ratios can be randomly generated using the methods in the ARR supporting document, Monte Carlo Simulation Techniques (See Section 7.2).  Here, I’ve linearly interpolated between the given percentiles, and linearly extrapolated beyond 90% and between 0 and 10% (Figure 2). Figure 2: Empirical distribution of pre-burst rainfall depth for Toomuc Creek (2 hour 1% AEP), data are from Table 1

Histograms of 10,000 randomly generated pre-burst depths and ratios are show in Figures 3 and 4.  The data are highly skewed with many values near zero. The median pre-burst depth is 1.5 mm and the mean is 10.66.  For the ratios, the median is 0.029 and the mean is 0.21.  As expected, the median values are equal to those shown in Table 1.

Also note that some of the time, pre-burst is quite large.  The depth is greater than 20 mm for 21% of values.  This is comparable with the median initial loss in rural catchments (there is a summary of initial loss values here).  This suggests that in a wet catchment, the pre-burst could be larger than the initial loss and runoff could have started before the main burst arrives. Figure 3: Histogram of randomly generated pre-burst depths Figure 4: Histogram of randomly generated pre-burst ratios

Code to reproduce the figures and calculations are available as a gist.

### References

Loveridge, M., Babister, M., Stensmyr, P., and Adam, M. (2015a) Estimation of pre-burst rainfall for design flood estimation in Australia.  36th Hydrology and Water Resources Symposium.  Engineers, Australia. (link to paper)

Loveridge, M.,  Stensmyr, P., Babister, M., (2015) Project 3: Temporal Patterns of rainfall, Part 2 – pre-burst rainfall. Australian Rainfall and Runoff.  Engineers, Australia. (link to report)

# The distribution of losses – II

The last post looked at the distribution of Storm Initial and Continuing loss associated with the conversion rainfall to runoff.  Histograms of randomly generated values were used to show the variation in losses.

An alternative way of generating similar graphs is to numerically differentiate empirical data on the cumulative distribution of losses which is provided in Table 5.3.13 of Australian Rainfall and Runoff (and shown in Table 1).  The following figures show estimated probability density functions based on differentiating the empirical cumulative distribution functions shown in Table 1.

Table 1: Standardised loss factors
Percentage of time loss is exceeded Initial Loss Continuing Loss
0 3.19 3.85
10 2.26 2.48
20 1.71 1.88
30 1.4 1.5
40 1.2 1.24
50 1 1
60 0.85 0.79
70 0.68 0.61
80 0.53 0.48
90 0.39 0.35
100 0.14 0.15 Figure 1: Empirical probability density of the Standardised Initial Loss Figure 2: Empirical probability density of the Standardised Continuing Loss

Calculations are available as a gist.