极值分析介绍（3）——置信区间估计

发表于 2019-01-29 分类于他山攻错阅读次数： Disqus：

在前面的工作中，我们采用不同的极值分布对时间序列数据记进行了极值分析，并通过假设检验的方法选择了最优的分布。在这一部分，我们将采用bootstrap方法确定选择的最优分布的置信区间。

前序工作铺垫

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.legend_handler import HandlerLine2D
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 9

import seaborn as sns

import lmoments3
from lmoments3 import distr

from plotnine.ggplot import *
from plotnine.qplot import *
from plotnine.geoms import *
from plotnine.coords import *
from plotnine.labels import *
from plotnine.facets import *
from plotnine.scales import *
from plotnine.themes.theme import *


%matplotlib inline

data = pd.read_csv('CHM00054511.csv')
df = data.groupby('YYYY', as_index=False).TMAX.max()
df = df[df.TMAX > 36]
gevfit = distr.gev.lmom_fit(df.TMAX)
gevfit

OrderedDict([('c', 0.019491703778070434),
             ('loc', 37.54717142024424),
             ('scale', 1.1375234247019073)])

Boostrap方法估计置信区间

Bootstrap方法介绍

在统计学中，自助法（Bootstrap Method，Bootstrapping，或自助抽样法）是一种从给定训练集中有放回的均匀抽样，也就是说，每当选中一个样本，它等可能地被再次选中并被再次添加到训练集中。自助法由Bradley Efron于1979年在《Annals of Statistics》上发表。当样本来自总体，能以正态分布来描述，其抽样分布(Sampling Distribution)为正态分布(The Normal Distribution)；但当样本来自的总体无法以正态分布来描述，则以渐进分析法、自助法等来分析。采用随机可置换抽样(random sampling with replacement)。对于小数据集，自助法效果很好

估计步骤

采用Bootstrap估计极值置信区间的总体思路为：

利用bootstrap方法对原始数据重复采样N多次；
对每次重采样的样本做极值分析
对极值分析的结果按照bootstrap的方法进行百分位、均值等统计分析。

Bootstrap估计置信区间的方法也有多种，查阅了一些资料包括：

基于百分位（Percentile）的方法（最基本，很直观，有偏差，不建议使用）
基于扩展百分位（Expanded Percentile）的方法(用总体标准差替代样本标准差）
基于经验bootstrap（empirical/pivotal bootstrap）的方法（根据boostrap的样本均值与原始样本均值的偏差构造枢轴变量）
基于bootstrap-t的方法（在经验方法的基础上，进一步转化为t分布）
基于Bias Corrected and Accelerated(BCa）的方法（state-of-the-art，通过acceleration parameter和bias-correction factor估计置信区间的分位点）

与别的算法相比，Bootstrap估计置信区间虽然并不复杂，但是自己写也怕搞错了。好在已经有现成的Python包，找了两个：scikits-bootstrap和 bootstrapped。然后惊喜地发现：scikits-bootstrap实现了最基本的percentile、可能是最好的BCa以及我不知道的ABC方法；bootstrapped实现了最基本的percentile和简单易用的pivotal方法。目前来看，两者各有所长，可以根据需要使用。

import bootstrap
from collections import OrderedDict

def extreme_bsci(df, gevfit):
    # Calculate confidence intervals using parametric bootstrap and the
    # percentil interval method
    # This is used to obtain confidence intervals for the estimators and
    # the return values for several return values.    
    # More info about bootstrapping can be found on:
    #     - https://github.com/cgevans/scikits-bootstrap
    #     - Efron: "An Introduction to the Bootstrap", Chapman & Hall (1993)
    #     - https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29

    # parametric bootstrap for return levels and parameters   

    # The function to bootstrap
    def func(data):
        fitted_gev = distr.gev(**gevfit)
        sample = fitted_gev.rvs(len(df.index))
        samgevfit = distr.gev.lmom_fit(sample)

        T = np.arange(0.1, 999.1, 0.1) + 1
        fitted_sam_gev = distr.gev(**samgevfit)
        sT = fitted_sam_gev.ppf(1.0 - 1./T)

        res = list(samgevfit.values())
        res.extend(sT.tolist())
        return tuple(res)

    # the calculations itself

    out = bootstrap.ci(df, statfunction=func, n_samples=10000)

    ci_Td = out[0, 3:]
    ci_Tu = out[1, 3:]
    params_ci = OrderedDict()
    params_ci['c']    = (out[0,0], out[1,0])
    params_ci['loc'] = (out[0,1], out[1,1])
    params_ci['scale']    = (out[0,2], out[1,3])

    return{'ci_Td':ci_Td, 'ci_Tu':ci_Tu, 'params_ci':params_ci}

#return years （1.1 to 1000)
T  = np.arange(0.1, 999.1, 0.1) + 1
fitted_gev = distr.gev(**gevfit)
sT = fitted_gev.ppf(1.0-1./T)


# get confidence intervals
bootout = extreme_bsci(df.TMAX, gevfit)
ci_Td = bootout["ci_Td"]
ci_Tu = bootout["ci_Tu"]
params_ci = bootout["params_ci"]

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:32: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.

# prepare index for obs
N = np.r_[1:len(df.index)+1]*1.0 #must *1.0 to convert int to float
Nmax = max(N)
y=sorted(df.TMAX, reverse=True)
empirical = pd.DataFrame({
    'T':Nmax/N,
    'empirical':y
})
empirical = pd.melt(empirical,id_vars='T')

# begin plotting

fig, ax = plt.subplots(figsize=(15,9))

plt.setp(ax.lines, linewidth = 2, color = 'magenta')

ax.set_title("GEV Distribution")
ax.set_xlabel("Return Period (Year)")
ax.set_ylabel("Precipitation")
ax.semilogx(T, sT)
ax.scatter(Nmax/N, sorted(df.TMAX)[::-1], color = 'orangered')

ax.semilogx(T, ci_Td, '--')
ax.semilogx(T, ci_Tu, '--')
ax.fill_between(T, ci_Td, ci_Tu, color = '0.75', alpha = 0.5)

output_9_1

plot_bsci = pd.DataFrame({
    'T': T,
    'sT': sT,
    'ci_Td': ci_Td,
    'ci_Tu': ci_Tu
})

p = (ggplot(plot_bsci) +
 aes(x='T') +
 geom_line(aes(y='sT'), color='blue') +
 geom_ribbon(aes(ymin='ci_Td', ymax='ci_Tu'), color=None, alpha=0.3) +
 geom_line(aes(y='ci_Td'), color='red', linetype='dashed') +
 geom_line(aes(y='ci_Tu'), color='green', linetype='dashed') +
 geom_point(aes(x='T', y='value'), data=empirical, color='orange') +
 scale_x_log10() +
 theme(figure_size=(15, 10))
)
p

output_10_0

小结

至此为止，我们已经做了3部分工作：

用ghcnpy下载NOAA的公开数据；
基于拟合优度度量（KS检验）选择最优的极值分布；
基于所选分布，利用Bootstrap计算置信区间。

在接下来的部分，我们将进一步探讨如何构建极端降水量的IDF和DDF曲线。