Experiential learning: peak flow estimation for the Po River at Pontelagoscuro

[This lecture was partially created with the support of Artificial Intelligence tools (ChatGPT)]

Introduction

The Po River Basin, located in northern Italy, is the country’s largest and most important drainage area.  It covers roughly 70,000 square kilometers and extends from the western Alps to the Adriatic Sea.  The Po River itself is about 652 kilometers long, making it Italy’s longest river.  Major tributaries include the Dora Baltea, Ticino, Adda, Oglio, and Mincio rivers.  The basin encompasses several major cities, such as Turin, Milan, and Ferrara.

It is a highly fertile region, forming the core of Italy’s agricultural production—especially rice, maize, and wheat.  Intensive irrigation and industrial development have made it an economic powerhouse.  However, these activities have also caused pollution and significant water management challenges.  The area is prone to both flooding and drought due to climate variability and human pressures. Today, sustainable management of the Po River Basin is crucial for Italy’s environmental and economic future.

Climate change is significantly altering rainfall patterns and water availability across the basin. Rising temperatures increase evaporation and reduce snow accumulation in the Alps, which traditionally feeds the river during warmer months. As a result, summer river flows are decreasing, leading to more frequent and severe droughts. One example is the extreme low water levels recorded during the 2022 European drought. Reduced water flow threatens irrigation systems that support intensive farming in the Po Valley. Climate change also increases the risk of floods due to more intense rainfall events. Saltwater intrusion from the Adriatic Sea is becoming more common when river discharge is low. These environmental pressures affect biodiversity, agriculture, and hydroelectric power generation in the basin. Adapting water management strategies is therefore essential to protect the long-term sustainability of the Po River Basin.

In particular, floods over the Po River Basin are widely studied. The Po River is intensively monitored; in particular, river flow is monitored in several river cross sections. At Pontelagoscuro, which is considered the closure of the catchment, daily river levels are monitored since the 19th century. Since 1920 mean daily levels are converted into mean daily flows. the resulting time series is available here

Note that the mean daily flow is not equivalent to the peak flow that occurred that day, but is lower. For a large basin like the Po an empirical relationship used in Italy to give the ratio between mean daily flow and peak flow in the day of a large discharge is given by the relationship

\(\frac{Q_{m}}{Q_{c}} \approx 0.75 - 0.85\) ,

that is called "Formula di Cotecchia". For the Po we usually adopt a value around 0.75.

Estimation of the peak value of the mean daily flow for the Po River at Pontelagoscuro with the annual maxima method

The peak river flow is usually estimated by fitting a proper extreme value probability distribution. Let us estimate it by fitting a Gumbel, a Frechet and a GEV distribution through the method of annual maxima and evaluate the goodness of the fit. Let us start with R.

poflow=read.table("Po-pontelagoscuro1920-2026.txt", header = FALSE, sep = " ", col.names = c("Date", "Flow"), stringsAsFactors = FALSE, skip=1)
# Convert Date column to Date format
poflow$Date=as.Date(poflow$Date, format = "%d/%m/%Y")
#Extract annual maxima
poflow$Year=format(poflow$Date, "%Y")
annual_max=aggregate(Flow ~ Year, data = poflow, max)

Now let's install the required packages to fit the distributions.

#Install and load required packages
install.packages(c("evd","fitdistrplus","ismev","goftest"))
library(evd)
library(fitdistrplus)
library(ismev)
library(goftest)

Let's estimate the Gumbel and GEV distributions.

#Fit a Gumbel distribution - note how parameters are estimated
gumbel_fit <- fgev(annual_max$Flow, shape = 0)
summary(gumbel_fit)
#Fit a GEV distribution
gev_fit <- fgev(annual_max$Flow)
summary(gev_fit)

For the Frechet we need to apply maximum likelihood with an optimisation routine to constrain shape parameter. Note: given that GEV estimation ended up with a negative estimate for shape, it is likely that Frechet converges to shape zero.

#Fit a Fréchet distribution
#We need to ensure that shape parameter is greater than zero. Then, we need a custom maximum likelihood estimation with a lower bound on the shape parameter
#We first fit a GEV because we need an output with a specific format to make diagnostic with the loaded packages
frechet_fit <- fgev(annual_max$Flow)
#By the way, if the shape parameter is greater than zero then we can stop here, the estimated GEV is already a Frechet

#Define parameter vector
frechet_fit1=list()
frechet_fit1$param=c(mean(annual_max$Flow),sd(annual_max$Flow),0.1)

# Define the function giving the likelihood
negloglik <- function(par, x)
{
loc=par[1]
scale=par[2]
shape=par[3]
-sum(dgev(x, loc=loc, scale=scale, shape=shape, log=TRUE))
}

# Impose Lower bounds: shape > 0, scale > 0
frechet_fit1$param=optim(par=frechet_fit$param,fn=negloglik,x=annual_max$Flow,method="L-BFGS-B",lower=c(-Inf, 1e-6, 1e-6))$par

#Copy parameter values into the dummy results obtained with GEV
frechet_fit[1]$estimate=frechet_fit1$param
frechet_fit[4]$param=frechet_fit1$param

Let's check goodness of the fit with the Kolmogorov-Smirnov test and probability plots.

#Diagnostic plots
par(mfrow=c(2,2))
plot(gumbel_fit)
plot(gev_fit)
plot(frechet_fit)
 
#Kolmogorv-Smirnov goodness of fit testing
#Gumbel
ks.test(annual_max$Flow,"pgumbel",loc=gumbel_fit$estimate["loc"],scale = gumbel_fit$estimate["scale"])
#If you get a warning about ties, it is due to the presence of duplicate values in the series of annual maxima
#Frechet
ks.test(annual_max$Flow,"pgumbel",loc=frechet_fit$estimate[1],scale=frechet_fit$estimate[2])
#GEV
ks.test(annual_max$Flow,"pgev",loc=gev_fit$estimate["loc"],scale = gev_fit$estimate["scale"],shape = gev_fit$estimate["shape"])

You may have noticed that KS for the Frechet distribution is computed with the R function "pgumbel" instead of "pfrechet", which is available. The reason for using "pgumbel" is that the resulting Frechet distribution has a shape parameter that is very close to zero and therefore the Frechet distribution degenerates, so that an appropriate rescaling is needed in order to converge to the Gumbel distribution when \(\xi \to 0\). If you are interested to learn more on the convergence of the Frechet/GEV distribution to the Gumbel distribution when  \(\xi \to 0\) see here.

For the sake of comparison, let's compute the peak flow for a given return period

#Compute and print peak flow for given return period
T <- c(30,50,100,200)

#Design floods (GEV)
Qgev <- qgev(1-1/T,loc=gev_fit$estimate["loc"],scale=gev_fit$estimate["scale"],shape=gev_fit$estimate["shape"])

#Design floods (Gumbel)
Qgumbel <- qgumbel(1-1/T,loc=gumbel_fit$estimate["loc"],scale=gumbel_fit$estimate["scale"])

# Results table
results <- data.frame(ReturnPeriod=T,Q_GEV=Qgev,Q_Gumbel=Qgumbel)
print(results)

To conclude, fit Gumbel with the method of moments and compare the resulting parameter values and goodness of fit tests. You may also compare with the "homemade" estimation of distribution parameters that you can find here.

Estimation of the peak value of the mean daily flow for the Po River at Pontelagoscuro with the peak over threshold method

To apply the peak over threshold method we first need to fix the value \(u\) of the threshold. A rule of thumb may be to select the value of the daily river flow that was exceeded 95\% of the time. In terms of coding: 

u=quantile(poflow$Flow, 0.95) # example threshold
POTFlow=poflow$Flow[poflow$Flow > u]-u
#Now POTFlow contains the POT excesses, which are modeled with a GPD
#Fit the GPD
fit_gpd=fpot(poflow$Flow, u)
summary(fit_gpd)
#Diagnostic plots
plot(fit_gpd) #Alternative plot with built-in diagnostics
qqplot(qgpd(ppoints(length(POTFlow)), scale=fit_gpd$estimate[1], shape=fit_gpd$estimate[2]), POTFlow)
abline(0,1)

Now let's apply goodness-of-fit testing with the Anderson-Darling test:

#Goodness-of-fit test using goftest (Anderson–Darling test)
#Extract estimated parameters
scale_hat=fit_gpd$estimate["scale"]
shape_hat=fit_gpd$estimate["shape"]
ad.test(POTFlow, null = "pgpd", scale = scale_hat, shape = shape_hat)

By repeatedly using the instruction for computing the probability of the Generalised Pareto Distribution, for all the peak flow values included in the vector poflow$Flow, we can compute the statistics of the Anderson-Darling test (which is explained here) by using standard R functions. Here below we provide an example of the R instruction for computing the probability of the Generalised Pareto Distribution for a given quantile that here is assumed to be equal to 5:

pgpd(5, scale=fit_gpd$estimate[1], shape=fit_gpd$estimate[2]).

For the critical value of the Anderson-Darling statistic see the above output from the function "ad.test".

The above set of codes can be run under Python. The suggested comprehensive code is copied here below (credit: Sepehr Almasi):


# ==============================================================================
# EXTREME VALUE THEORY PIPELINE: PO RIVER ANALYSIS
# Complete translation from R to Python (scipy & pandas) 
# Programmed by Sepehr Almasi
# ==============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.optimize import minimize
import sys

# ---------------------------------------------------------
# PART 1: DATA INGESTION & BLOCK MAXIMA EXTRACTION
# ---------------------------------------------------------
print("--- PART 1: DATA PROCESSING ---")
file_name = "Po-pontelagoscuro1920-2026.txt"

poflow = pd.read_csv(file_name, sep=r"\s+", header=None, names=["Date", "Flow"], skiprows=1)
poflow['Date'] = pd.to_datetime(poflow['Date'], format="%d/%m/%Y")
poflow['Year'] = poflow['Date'].dt.year

# Extract Annual Maxima (Block Maxima)
annual_max = poflow.groupby('Year')['Flow'].max().reset_index()
maxima_data = annual_max['Flow'].values
print(f"Success: Extracted {len(maxima_data)} annual maxima records.")

# ---------------------------------------------------------
# PART 2: STABLE MLE FITTING (Gumbel & Guided GEV)
# ---------------------------------------------------------
print("\n--- PART 2: DISTRIBUTION FITTING ---")

# 1. Fit Gumbel
gum_params = stats.gumbel_r.fit(maxima_data)
ll_gumbel = np.sum(stats.gumbel_r.logpdf(maxima_data, *gum_params))
print(f"GUMBEL -> Loc: {gum_params[0]:.2f}, Scale: {gum_params[1]:.2f} | LogLik: {ll_gumbel:.2f}")

# 2. Fit GEV (THE FIX)
# We force Scipy to start searching with a positive 'c' (which means negative shape, xi < 0)
# This prevents the optimizer from getting trapped in a heavy-tailed local minimum.
initial_c = 0.1 
gev_params = stats.genextreme.fit(maxima_data, initial_c, loc=np.mean(maxima_data), scale=np.std(maxima_data))

ll_gev = np.sum(stats.genextreme.logpdf(maxima_data, *gev_params))
print(f"GEV -> Shape (xi): {-gev_params[0]:.4f}, Loc: {gev_params[1]:.2f}, Scale: {gev_params[2]:.2f} | LogLik: {ll_gev:.2f}")

if -gev_params[0] < 0:
    print("[SUCCESS] Python optimizer successfully found the bounded Weibull tail, matching R!")

# ---------------------------------------------------------
# PART 3: FORCED FRÉCHET FITTING (Bounded Optimization)
# ---------------------------------------------------------
print("\n--- PART 3: THE FRÉCHET CONSTRAINT ---")

def neg_log_lik_frechet(params, data):
    c, loc, scale = params
    if scale <= 0: return np.inf
    ll = stats.genextreme.logpdf(data, c, loc=loc, scale=scale)
    return -np.sum(ll)

# Enforce positive shape (Scipy c < 0 requires bounding between -Inf and slightly below 0)
bnds = ((-np.inf, -1e-6), (None, None), (1e-6, np.inf))
initial_guess = [-0.1, np.mean(maxima_data), np.std(maxima_data)]

frechet_res = minimize(neg_log_lik_frechet, initial_guess, args=(maxima_data,),
    method='L-BFGS-B', bounds=bnds)

frechet_shape = -frechet_res.x[0]
print(f"Forced Fréchet Shape (xi): {frechet_shape:.6f}")
if np.isclose(frechet_shape, 1e-6, atol=1e-5):
    print("WARNING: Optimizer hit the boundary. The data mathematically rejects a Fréchet tail.")

# ---------------------------------------------------------
# PART 4: GOODNESS-OF-FIT TESTING (Kolmogorov-Smirnov)
# ---------------------------------------------------------
print("\n--- PART 4: KS TESTING ---")
ks_gum, p_gum = stats.kstest(maxima_data, 'gumbel_r', args=gum_params)
ks_gev, p_gev = stats.kstest(maxima_data, 'genextreme', args=gev_params)

print(f"Gumbel KS Test -> Stat: {ks_gum:.4f}, p-value: {p_gum:.4f}")
print(f"GEV KS Test -> Stat: {ks_gev:.4f}, p-value: {p_gev:.4f}")

# ---------------------------------------------------------
# PART 5: DESIGN FLOODS (Return Levels)
# ---------------------------------------------------------
print("\n--- PART 5: DESIGN LOADS ---")
T_values = np.array([30, 50, 100, 200])
probabilities = 1 - (1 / T_values)

Q_gev = stats.genextreme.ppf(probabilities, *gev_params)
Q_gumbel = stats.gumbel_r.ppf(probabilities, *gum_params)

results_df = pd.DataFrame({
    'Return Period (T)': T_values,
    'Non-Exceedance Prob': probabilities.round(4),
    'Q_GEV (m³/s)': Q_gev.round(2),
    'Q_Gumbel (m³/s)': Q_gumbel.round(2)
})
print(results_df.to_string(index=False))

# ---------------------------------------------------------
# PART 6: DIAGNOSTIC DASHBOARD VISUALIZATION
# ---------------------------------------------------------
def plot_evt_diagnostics(data, dist_name, params, title_prefix):
    sorted_data = np.sort(data)
    n = len(data)
    empirical_cdf = np.arange(1, n + 1) / (n + 1)
    dist = getattr(stats, dist_name)
    #
    theoretical_cdf = dist.cdf(sorted_data, *params)
    theoretical_quantiles = dist.ppf(empirical_cdf, *params)
    return_periods = 1 / (1 - empirical_cdf)
    #
    fig, axs = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle(f'Diagnostic Plots: {title_prefix}', fontsize=16, fontweight='bold')
    #
    # P-P Plot
    axs[0, 0].scatter(empirical_cdf, theoretical_cdf, facecolors='none', edgecolors='b')
    axs[0, 0].plot([0, 1], [0, 1], 'k--')
    axs[0, 0].set_title('Probability Plot (P-P)')
    axs[0, 0].grid(True, alpha=0.3)
    #
    # Q-Q Plot
    axs[0, 1].scatter(theoretical_quantiles, sorted_data, facecolors='none', edgecolors='b')
    min_val = min(min(theoretical_quantiles), min(sorted_data))
    max_val = max(max(theoretical_quantiles), max(sorted_data))
    axs[0, 1].plot([min_val, max_val], [min_val, max_val], 'k--')
    axs[0, 1].set_title('Quantile Plot (Q-Q)')
    axs[0, 1].grid(True, alpha=0.3)
    #
    # Return Level Plot
    axs[1, 0].scatter(return_periods, sorted_data, facecolors='none', edgecolors='b')
    smooth_T = np.logspace(0.1, 3, 100)
    smooth_levels = dist.ppf(1 - (1/smooth_T), *params)
    axs[1, 0].plot(smooth_T, smooth_levels, 'r-')
    axs[1, 0].set_xscale('log')
    axs[1, 0].set_title('Return Level Plot')
    axs[1, 0].set_xlabel('Return Period (Years)')
    axs[1, 0].grid(True, alpha=0.3, which="both", ls="--")
    #
    # Density Plot
    axs[1, 1].hist(data, bins=15, density=True, alpha=0.5, color='gray', edgecolor='black')
    x_smooth = np.linspace(min(data)*0.8, max(data)*1.2, 200)
    axs[1, 1].plot(x_smooth, dist.pdf(x_smooth, *params), 'r-', lw=2)
    axs[1, 1].set_title('Density Plot')
    #
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

# Trigger the plots
plot_evt_diagnostics(maxima_data, 'gumbel_r', gum_params, 'Gumbel Fit')
plot_evt_diagnostics(maxima_data, 'genextreme', gev_params, 'GEV Fit')
 

Last modified on March 16, 2026