散布図に信頼限界と予測限界を表示

Question

高さと重みの2つのデータ配列があります。

import numpy as np, matplotlib.pyplot as plt heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65]) weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45]) plt.plot(heights,weights,'bo') plt.show()

これに似たプロットを作成したい：

http://www.sas.com/en_us/software/analytics/stat.html#m=screenshot6

どんなアイデアでも大歓迎です。

pylang · Accepted Answer

これが私がまとめたものです。私はあなたのスクリーンショットを綿密にエミュレートしようとしました。

与えられた

信頼区間をプロットするためのいくつかの詳細なヘルパー関数。

_import numpy as np import scipy as sp import scipy.stats as stats import matplotlib.pyplot as plt %matplotlib inline def plot_ci_manual(t, s_err, n, x, x2, y2, ax=None): """Return an axes of confidence bands using a simple approach. Notes ----- .. math:: \left| \: \hat{\mu}_{y|x0} - \mu_{y|x0} \: \right| \; \leq \; T_{n-2}^{.975} \; \hat{\sigma} \; \sqrt{\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n{(x_i-\bar{x})^2}}} .. math:: \hat{\sigma} = \sqrt{\sum_{i=1}^n{\frac{(y_i-\hat{y})^2}{n-2}}} References ---------- .. [1] M. Duarte. "Curve fitting," Jupyter Notebook. http://nbviewer.ipython.org/github/demotu/BMC/blob/master/notebooks/CurveFitting.ipynb """ if ax is None: ax = plt.gca() ci = t * s_err * np.sqrt(1/n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2)) ax.fill_between(x2, y2 + ci, y2 - ci, color="#b9cfe7", edgecolor="") return ax def plot_ci_bootstrap(xs, ys, resid, nboot=500, ax=None): """Return an axes of confidence bands using a bootstrap approach. Notes ----- The bootstrap approach iteratively resampling residuals. It plots `nboot` number of straight lines and outlines the shape of a band. The density of overlapping lines indicates improved confidence. Returns ------- ax : axes - Cluster of lines - Upper and Lower bounds (high and low) (optional) Note: sensitive to outliers References ---------- .. [1] J. Stults. "Visualizing Confidence Intervals", Various Consequences. http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html """ if ax is None: ax = plt.gca() bootindex = sp.random.randint for _ in range(nboot): resamp_resid = resid[bootindex(0, len(resid) - 1, len(resid))] # Make coeffs of for polys pc = sp.polyfit(xs, ys + resamp_resid, 1) # Plot bootstrap cluster ax.plot(xs, sp.polyval(pc, xs), "b-", linewidth=2, alpha=3.0 / float(nboot)) return ax _

コード

_# Computations ---------------------------------------------------------------- # Raw Data heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65]) weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45]) x = heights y = weights # Modeling with Numpy def equation(a, b): """Return a 1D polynomial.""" return np.polyval(a, b) p, cov = np.polyfit(x, y, 1, cov=True) # parameters and covariance from of the fit of 1-D polynom. y_model = equation(p, x) # model using the fit parameters; NOTE: parameters here are coefficients # Statistics n = weights.size # number of observations m = p.size # number of parameters dof = n - m # degrees of freedom t = stats.t.ppf(0.975, n - m) # used for CI and PI bands # Estimates of Error in Data/Model resid = y - y_model chi2 = np.sum((resid / y_model)**2) # chi-squared; estimates error in data chi2_red = chi2 / dof # reduced chi-squared; measures goodness of fit s_err = np.sqrt(np.sum(resid**2) / dof) # standard deviation of the error # Plotting -------------------------------------------------------------------- fig, ax = plt.subplots(figsize=(8, 6)) # Data ax.plot( x, y, "o", color="#b9cfe7", markersize=8, markeredgewidth=1, markeredgecolor="b", markerfacecolor="None" ) # Fit ax.plot(x, y_model, "-", color="0.1", linewidth=1.5, alpha=0.5, label="Fit") x2 = np.linspace(np.min(x), np.max(x), 100) y2 = equation(p, x2) # Confidence Interval (select one) plot_ci_manual(t, s_err, n, x, x2, y2, ax=ax) #plot_ci_bootstrap(x, y, resid, ax=ax) # Prediction Interval pi = t * s_err * np.sqrt(1 + 1/n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2)) ax.fill_between(x2, y2 + pi, y2 - pi, color="None", linestyle="--") ax.plot(x2, y2 - pi, "--", color="0.5", label="95% Prediction Limits") ax.plot(x2, y2 + pi, "--", color="0.5") # Figure Modifications -------------------------------------------------------- # Borders ax.spines["top"].set_color("0.5") ax.spines["bottom"].set_color("0.5") ax.spines["left"].set_color("0.5") ax.spines["right"].set_color("0.5") ax.get_xaxis().set_tick_params(direction="out") ax.get_yaxis().set_tick_params(direction="out") ax.xaxis.tick_bottom() ax.yaxis.tick_left() # Labels plt.title("Fit Plot for Weight", fontsize="14", fontweight="bold") plt.xlabel("Height") plt.ylabel("Weight") plt.xlim(np.min(x) - 1, np.max(x) + 1) # Custom legend handles, labels = ax.get_legend_handles_labels() display = (0, 1) anyArtist = plt.Line2D((0, 1), (0, 0), color="#b9cfe7") # create custom artists legend = plt.legend( [handle for i, handle in enumerate(handles) if i in display] + [anyArtist], [label for i, label in enumerate(labels) if i in display] + ["95% Confidence Limits"], loc=9, bbox_to_anchor=(0, -0.21, 1., 0.102), ncol=3, mode="expand" ) frame = legend.get_frame().set_edgecolor("0.5") # Save Figure plt.tight_layout() plt.savefig("filename.png", bbox_extra_artists=(legend,), bbox_inches="tight") plt.show() _

出力

plot_ci_manual()の使用：

plot_ci_bootstrap()の使用：

お役に立てれば。乾杯。

詳細

凡例は図の外にあるため、matplotblibのポップアップウィンドウには表示されないと思います。 Jupyterでは_%maplotlib inline_を使用して正常に動作します。
一次信頼区間コード（plot_ci_manual()）は別の source から適応され、OPと同様のプロットを生成します。 2番目のオプションplot_ci_bootstrap()のコメントを外すと、 residual bootstrapping と呼ばれるより高度な手法を選択できます。

更新

この投稿は、Python 3.と互換性のある改訂されたコードで更新されています。
stats.t.ppf()は、より低いテール確率を受け入れます。以下のリソースによると、t = sp.stats.t.ppf(0.95, n - m)はt = sp.stats.t.ppf(0.975, n - m)に修正され、両側95％t統計（または片側97.5％t統計）を反映しています。
- 元のノートと方程式
- 統計参照（@Bonlenfumと@tryptofanに感謝）
- 指定された検証済みのt値_dof=17_
_y2_は、特定のモデル（@regeneration）でより柔軟に応答するように更新されました。
抽象化されたequation関数がモデル関数をラップするために追加されました。実証されていませんが、非線形回帰が可能です。必要に応じて適切な変数を修正します（@PJWに感謝）。

関連項目

この投稿 statsmodelsライブラリを使用したバンドのプロット。
このチュートリアルバンドのプロットとuncertaintiesライブラリを使用した信頼区間の計算（別の環境に注意してインストール）。

user1319128 · Answer

シーボーンプロットライブラリを使用して、必要に応じてプロットを作成できます。

In [18]: import seaborn as sns In [19]: heights = np.array([50,52,53,54,58,60,62,64,66,67, 68,70,72,74,76,55,50,45,65]) ...: weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45]) ...: In [20]: sns.regplot(heights,weights, color ='blue') Out[20]: <matplotlib.axes.AxesSubplot at 0x13644f60>

enter image description here

regeneration · Answer

PJWへの応答としてのpylangの素晴らしい答えの更新：1次よりも大きい多項式を当てはめようとしている場合、y2の計算を以下から更新する必要があります。

y2 = np.linspace(np.min(y_model), np.max(y_model), 100)

に

y2 = np.polyval(p,x2)

元のコードは、1次多項式（つまり、単純な線）に対してのみ機能します。

はい、tryptofanのコメントに応じて、95％両側t統計を取得するには、コードを次のように更新する必要があります。

t = stats.t.ppf(0.95, n - m)

に

t = stats.t.ppf(1-0.025, n - m)

mf13 · Answer

答えてくれたpylangに感謝します。 y2の計算に問題がありました。回帰直線が減少しているときに、信頼度の反復が減少しなかったためです。現在のy2の計算では、予測y_modelは常に最小から最大までの範囲になります。したがって、私はy2の計算を次のように変更しました：

y2 = np.linspace(y_model[x.index(np.min(x))], y_model[x.index(np.max(x))], 100)