シーボーンディスプロットを正規化する方法は？

Question

再現性の理由、データセット、および再現性の理由から、私はそれを共有しています[ここ] [1]。

これが私がやっていることです-2列目から現在の行を読み取り、それを前の行の値と比較しています。大きい場合は比較を続けます。現在の値が前の行の値よりも小さい場合は、現在の値（小さい方）を以前の値（大きい方）で除算します。したがって、次のコード：

これにより、次のプロットが得られます。

sns.distplot(quotient, hist=False, label=protname)

プロットからわかるように

Data-Vは、quotient_timesは3未満であり、quotient_timesは3より大きい。

値を正規化して、y-axisの0と1の間の2番目のプロット値。Pythonでそれを行うにはどうすればよいですか？

LoneWanderer · Accepted Answer

序文

私が理解していることから、デフォルトでseaborn distplotはkde推定を行います。正規化されたディスプロットグラフが必要な場合は、グラフのYが[0; 1]の範囲内にあると想定していることが原因である可能性があります。もしそうなら、スタックオーバーフローの質問は kde推定量が1より大きい値を示すの問題を提起しました。

引用 1つの答え：

連続確率密度関数（pdf =確率密度関数）は、値が1未満になることは決してありません。連続確率変数の確率密度関数f unction p(x)は確率ではありません。連続確率変数とその分布を参照できます

importanceofbeingernest の最初のコメントを引用：

pdfの積分は1 です。ここで見られる矛盾はありません。

私の知識から、それは CDF（累積密度関数）であり、その値は[0; 1]。

通知：可能なすべての連続フィット可能関数は SciPyサイトで、パッケージscipy.statsで利用可能

多分確率質量関数も見てください。

本当に同じグラフを正規化したい場合は、プロットされた関数（オプション1）または関数定義（オプション2）の実際のデータポイントを収集し、自分で正規化して再度プロットする必要があります。

オプション1

import numpy as np import matplotlib import matplotlib.pyplot as plt import seaborn as sns import sys print('System versions : {}'.format(sys.version)) print('System versions : {}'.format(sys.version_info)) print('Numpy versqion : {}'.format(np.__version__)) print('matplotlib.pyplot version: {}'.format(matplotlib.__version__)) print('seaborn version : {}'.format(sns.__version__)) protocols = {} types = {"data_v": "data_v.csv"} for protname, fname in types.items(): col_time,col_window = np.loadtxt(fname,delimiter=',').T trailing_window = col_window[:-1] # "past" values at a given index leading_window = col_window[1:] # "current values at a given index decreasing_inds = np.where(leading_window < trailing_window)[0] quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds] quotient_times = col_time[decreasing_inds] protocols[protname] = { "col_time": col_time, "col_window": col_window, "quotient_times": quotient_times, "quotient": quotient, } fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False) g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True) ax1.set_title('basic distplot (kde=True)') # get distplot line points line = g.get_lines()[0] xd = line.get_xdata() yd = line.get_ydata() # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python def normalize(x): return (x - x.min(0)) / x.ptp(0) #normalize points yd2 = normalize(yd) # plot them in another graph ax2.plot(xd, yd2) ax2.set_title('basic distplot (kde=True)
with normalized y plot values') plt.show()

オプション2

以下では、kdeを実行して、得られた推定値を正規化しようとしました。私は統計の専門家ではないので、kdeの使用法が何らかの点で間違っている可能性があります（スクリーンショットで確認できるように、これはseabornとは異なります。これは、seabornが私よりもはるかにうまく機能するためです。 scipyを使用したkdeフィッティング結果はそれほど悪くありません）

スクリーンショット：

コード：

import numpy as np from scipy import stats import matplotlib import matplotlib.pyplot as plt import seaborn as sns import sys print('System versions : {}'.format(sys.version)) print('System versions : {}'.format(sys.version_info)) print('Numpy versqion : {}'.format(np.__version__)) print('matplotlib.pyplot version: {}'.format(matplotlib.__version__)) print('seaborn version : {}'.format(sns.__version__)) protocols = {} types = {"data_v": "data_v.csv"} for protname, fname in types.items(): col_time,col_window = np.loadtxt(fname,delimiter=',').T trailing_window = col_window[:-1] # "past" values at a given index leading_window = col_window[1:] # "current values at a given index decreasing_inds = np.where(leading_window < trailing_window)[0] quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds] quotient_times = col_time[decreasing_inds] protocols[protname] = { "col_time": col_time, "col_window": col_window, "quotient_times": quotient_times, "quotient": quotient, } fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False) diff=quotient_times ax1.plot(diff, quotient, ".", label=protname, color="blue") ax1.set_ylim(0, 1.0001) ax1.set_title(protname) ax1.set_xlabel("quotient_times") ax1.set_ylabel("quotient") ax1.legend() sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True) ax2.set_title('basic distplot (kde=True)') # taken from seaborn's source code (utils.py and distributions.py) def seaborn_kde_support(data, bw, gridsize, cut, clip): if clip is None: clip = (-np.inf, np.inf) support_min = max(data.min() - bw * cut, clip[0]) support_max = min(data.max() + bw * cut, clip[1]) return np.linspace(support_min, support_max, gridsize) kde_estim = stats.gaussian_kde(quotient, bw_method='scott') # manual linearization of data #linearized = np.linspace(quotient.min(), quotient.max(), num=500) # or better: mimic seaborn's internal stuff bw = kde_estim.scotts_factor() * np.std(quotient) linearized = seaborn_kde_support(quotient, bw, 100, 3, None) # computes values of the estimated function on the estimated linearized inputs Z = kde_estim.evaluate(linearized) # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python def normalize(x): return (x - x.min(0)) / x.ptp(0) # normalize so it is between 0;1 Z2 = normalize(Z) for name, func in {'min': np.min, 'max': np.max}.items(): print('{}: source={}, normalized={}'.format(name, func(Z), func(Z2))) # plot is different from seaborns because not exact same method applied ax3.plot(linearized, Z, ".", label=protname, color="orange") ax3.set_title('Non linearized gaussian kde values') # manual kde result with Y axis avalues normalized (between 0;1) ax4.plot(linearized, Z2, ".", label=protname, color="green") ax4.set_title('Normalized gaussian kde values') plt.show()

出力：

System versions : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] System versions : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0) Numpy versqion : 1.16.2 matplotlib.pyplot version: 3.0.2 seaborn version : 0.9.0 min: source=0.0021601491646143518, normalized=0.0 max: source=9.67319154426489, normalized=1.0

コメントとは逆に、プロットする：

[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]

動作は変わりません！カーネル密度推定のソースデータのみが変更されます。曲線の形状は同じままです。

seabornのdistplotドキュメントを引用：

この関数は、matplotlib hist関数（適切なデフォルトのビンサイズの自動計算を使用）とseaborn kdeplot（）およびrugplot（）関数を組み合わせます。また、scipy.stats分布に適合し、データに対して推定PDFをプロットすることもできます。

デフォルトでは：

kde：bool、オプションでTrueに設定ガウスカーネル密度推定をプロットするかどうか。

デフォルトではkdeを使用します。 SeabornのKDEドキュメントを引用：

1変量または2変量のカーネル密度推定を近似してプロットします。

引用 SCiPyガウスkdeメソッドdoc ：

ガウスカーネルを使用したカーネル密度推定の表現。

カーネル密度推定は、ランダム変数の確率密度関数（PDF）をノンパラメトリックな方法で推定する方法です。 gaussian_kdeは、単変量データと多変量データの両方で機能します。自動帯域幅決定が含まれます。推定は、単峰型分布に最適です。バイモーダルまたはマルチモーダルの分布は、過度に平滑化される傾向があります。

あなたがそれをあなた自身が述べたように、私はあなたのデータが二峰性であると私は信じていることに注意してください。彼らはまた、離散的に見えます。私の知る限り、離散分布関数は、連続的であるのと同じ方法で分析されない場合があり、フィッティングはトリッキーであることが判明する場合があります。

以下は、さまざまな法律の例です。

import numpy as np from scipy.stats import uniform, powerlaw, logistic import matplotlib import matplotlib.pyplot as plt import seaborn as sns import sys print('System versions : {}'.format(sys.version)) print('System versions : {}'.format(sys.version_info)) print('Numpy versqion : {}'.format(np.__version__)) print('matplotlib.pyplot version: {}'.format(matplotlib.__version__)) print('seaborn version : {}'.format(sns.__version__)) protocols = {} types = {"data_v": "data_v.csv"} for protname, fname in types.items(): col_time,col_window = np.loadtxt(fname,delimiter=',').T trailing_window = col_window[:-1] # "past" values at a given index leading_window = col_window[1:] # "current values at a given index decreasing_inds = np.where(leading_window < trailing_window)[0] quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds] quotient_times = col_time[decreasing_inds] protocols[protname] = { "col_time": col_time, "col_window": col_window, "quotient_times": quotient_times, "quotient": quotient, } fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False) diff=quotient_times ax1.plot(diff, quotient, ".", label=protname, color="blue") ax1.set_ylim(0, 1.0001) ax1.set_title(protname) ax1.set_xlabel("quotient_times") ax1.set_ylabel("quotient") ax1.legend() quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient] print(quotient2) sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True) ax2.set_title('basic distplot (kde=True)') sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True) ax3.set_title('logistic distplot') sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform) ax4.set_title('uniform distplot') sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw) ax5.set_title('powerlaw distplot') sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic) ax6.set_title('logistic distplot') plt.show()

出力：

System versions : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] System versions : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0) Numpy versqion : 1.16.2 matplotlib.pyplot version: 3.0.2 seaborn version : 0.9.0 [1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]

スクリーンショット：