These data were collected at Copenhagen Reinsurance and comprise 2167 fire losses over the period 1980 to 1990, They have been adjusted for inflation to reflect 1985 values and are expressed in millions of Danish Kron. Note that it is possible to work with the same data as above but the total claim has been divided into a building loss, a loss of contents and a loss of profits.

> base1=read.table( + "http://freakonometrics.free.fr/danish-univariate.txt", + header=TRUE) > base2=read.table( + "http://freakonometrics.free.fr/danish-multivariate.txt", + header=TRUE)

Consider here the first dataset (we deal – so far – with univariate extremes),

> X=base1$Loss.in.DKM > D=as.Date(as.character(base1$Date),"%m/%d/%Y") > plot(D,X,type="h")

The graph is the following,

A natural idea is then to plot

i.e.

> Xs=sort(X) > logXs=rev(log(Xs)) > n=length(X) > plot(log(Xs),log((n:1)/(n+1)))

Points are on a straight line here. The slope can be obtained using a linear regression,

> B=data.frame(X=log(Xs),Y=log((n:1)/(n+1))) > reg=lm(Y~X,data=B) > summary(reg) Call: lm(formula = Y ~ X, data = B) Residuals: Min 1Q Median 3Q Max -0.59999 -0.00777 0.00878 0.02461 0.20309 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.089442 0.001572 56.88 <2e-16 *** X -1.382181 0.001477 -935.55 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.04928 on 2165 degrees of freedom Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975 F-statistic: 8.753e+05 on 1 and 2165 DF, p-value: < 2.2e-16 > reg=lm(Y~X,data=B[(n-500):n,]) > summary(reg) Call: lm(formula = Y ~ X, data = B[(n - 500):n, ]) Residuals: Min 1Q Median 3Q Max -0.48502 -0.02148 -0.00900 0.01626 0.35798 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.186188 0.010033 18.56 <2e-16 *** X -1.432767 0.005105 -280.68 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.07751 on 499 degrees of freedom Multiple R-squared: 0.9937, Adjusted R-squared: 0.9937 F-statistic: 7.878e+04 on 1 and 499 DF, p-value: < 2.2e-16 > reg=lm(Y~X,data=B[(n-100):n,]) > summary(reg) Call: lm(formula = Y ~ X, data = B[(n - 100):n, ]) Residuals: Min 1Q Median 3Q Max -0.33396 -0.03743 0.02279 0.04754 0.62946 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.67377 0.06777 9.942 <2e-16 *** X -1.58536 0.02240 -70.772 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1299 on 99 degrees of freedom Multiple R-squared: 0.9806, Adjusted R-squared: 0.9804 F-statistic: 5009 on 1 and 99 DF, p-value: < 2.2e-16

The slope here is somehow related to the tail index of the distribution. Consider some heavy tailed distribution, i.e. , so that , where is some slowly varying function. Equivalently, the exists a slowly varying function such that . Then

i.e. since a natural estimator for is the order statistic , the slope of the straight line is the opposite of tail index . The estimator of the slope is (considering only the largest observations)

Hill‘s estimator is based on the assumption that the denominator above is almost 1 (which means that , as ), i.e.

Note that, if , but not two fast, i.e. as , then (one can even get with stronger convergence assumptions). Further

Based on that (asymptotic) distribution, it is possible to get a (asymptotic) confidence interval for

> xi=1/(1:n)*cumsum(logXs)-logXs > xise=1.96/sqrt(1:n)*xi > plot(1:n,xi,type="l",ylim=range(c(xi+xise,xi-xise)), + xlab="",ylab="",) > polygon(c(1:n,n:1),c(xi+xise,rev(xi-xise)), + border=NA,col="lightblue") > lines(1:n,xi+xise,col="red",lwd=1.5) > lines(1:n,xi-xise,col="red",lwd=1.5) > lines(1:n,xi,lwd=1.5) > abline(h=0,col="grey")

It is also possible to work with , then . And similarly as (and again with additional assumptions on the rate of convergence), and

(obtained using the delta-method). Again, we can use that result to derive (asymptotic) confidence intervals

> alpha=1/xi > alphase=1.96/sqrt(1:n)/xi > YL=c(0,3) > plot(1:n,alpha,type="l",ylim=YL,xlab="",ylab="",) > polygon(c(1:n,n:1),c(alpha+alphase,rev(alpha-alphase)), + border=NA,col="lightblue") > lines(1:n,alpha+alphase,col="red",lwd=1.5) > lines(1:n,alpha-alphase,col="red",lwd=1.5) > lines(1:n,alpha,lwd=1.5) > abline(h=0,col="grey")

The Deckers-Einmahl-de Haan estimator is

where for

Then (given again conditions on the speed of convergence i.e. , with as ),

Finally, Pickands‘ estimator

it is possible to prove that, as ,

Here the code is

> Xs=rev(sort(X)) > xi=1/log(2)*log( (Xs[seq(1,length=trunc(n/4),by=1)]- + Xs[seq(2,length=trunc(n/4),by=2)])/ + (Xs[seq(2,length=trunc(n/4),by=2)]-Xs[seq(4, + length=trunc(n/4),by=4)]) ) > xise=1.96/sqrt(seq(1,length=trunc(n/4),by=1))* +sqrt( xi^2*(2^(xi+1)+1)/((2*(2^xi-1)*log(2))^2)) > plot(seq(1,length=trunc(n/4),by=1),xi,type="l", + ylim=c(0,3),xlab="",ylab="",) > polygon(c(seq(1,length=trunc(n/4),by=1),rev(seq(1, + length=trunc(n/4),by=1))),c(xi+xise,rev(xi-xise)), + border=NA,col="lightblue") > lines(seq(1,length=trunc(n/4),by=1), + xi+xise,col="red",lwd=1.5) > lines(seq(1,length=trunc(n/4),by=1), + xi-xise,col="red",lwd=1.5) > lines(seq(1,length=trunc(n/4),by=1),xi,lwd=1.5) > abline(h=0,col="grey")

It is also possible to use maximum likelihood techniques to fit a GPD distribution over a high threshold.

> library(evd) > library(evir) > gpd(X,5) $n [1] 2167 $threshold [1] 5 $p.less.thresh [1] 0.8827873 $n.exceed [1] 254 $method [1] "ml" $par.ests xi beta 0.6320499 3.8074817 $par.ses xi beta 0.1117143 0.4637270 $varcov [,1] [,2] [1,] 0.01248007 -0.03203283 [2,] -0.03203283 0.21504269 $information [1] "observed" $converged [1] 0 $nllh.final [1] 754.1115 attr(,"class") [1] "gpd"

or equivalently (or almost)

> gpd.fit(X,5) $threshold [1] 5 $nexc [1] 254 $conv [1] 0 $nllh [1] 754.1115 $mle [1] 3.8078632 0.6315749 $rate [1] 0.1172127 $se [1] 0.4636270 0.1116136

The interest of the latest function is that it is possible to visualize the profile likelihood of the tail index,

> gpd.profxi(gpd.fit(X,5),xlow=0,xup=3)

or

> gpd.profxi(gpd.fit(X,20),xlow=0,xup=3)

Hence, it is possible to plot the maximum likelihood estimator of the tail index, as a function of the threshold (including a confidence interval),

> GPDE=Vectorize(function(u){gpd(X,u)$par.ests[1]}) > GPDS=Vectorize(function(u){ + gpd(X,u)$par.ses[1]}) > u=c(seq(2,10,by=.5),seq(11,25)) > XI=GPDE(u) > XIS=GPDS(u) > plot(u,XI,ylim=c(0,2)) > segments(u,XI-1.96*XIS,u,XI+ + 1.96*XIS,lwd=2,col="red")

Finally, it is possible to use block-maxima techniques.

> gev.fit(X) $conv [1] 0 $nllh [1] 3392.418 $mle [1] 1.4833484 0.5930190 0.9168128 $se [1] 0.01507776 0.01866719 0.03035380

The estimator of the tail index is here the last coefficient, on the right.

Since it is rather difficult to install a package in class rooms, here is the source of rcodes used here (to fit a GPD for exceedances)

> source("http://freakonometrics.blog.free.fr/public/code/gpd.R")

Next time, we will discuss how to use those estimators.

Hi Arthur, this is an excellent post. What happened to all the figures? I have tried with different browsers, so that is not it.

pb with my blog transfert… I will try to fix it asap

Np, I simply think it is a shame – because of the work, and because it is of excellent quality.

Thanks . Quite informative.

Hi there, just became aware of your blog through Google, and found that it’s truly informative. I will appreciate if you continue this in future. Lots of people will be benefited from your writing. Cheers!

It’s a pity original sum insured’s are missing in original data, we/you could have worked on damage ratio instead of loss amount ! Maybe one day we’ll get there?