numpyを使用した加重パーセンタイル

Question

Numpy.percentile関数を使用して加重パーセンタイルを計算する方法はありますか？または、加重パーセンタイルを計算するための代替のpython関数を知っている人はいますか？

ありがとう！

Joan Smith · Accepted Answer

残念ながら、numpyにはすべてに重み付き関数が組み込まれているわけではありませんが、いつでも何かを組み合わせることができます。

def weight_array(ar, weights): zipped = Zip(ar, weights) weighted = [] for i in zipped: for j in range(i[1]): weighted.append(i[0]) return weighted np.percentile(weight_array(ar, weights), 25)

Alleo · Answer

完全にベクトル化されたnumpyソリューション

これが私が使用しているコードです。これは最適なものではありませんが（numpyで書き込むことはできません）、受け入れられているソリューションよりもはるかに高速で信頼性があります。

def weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False): """ Very close to numpy.percentile, but supports weights. NOTE: quantiles should be in [0, 1]! :param values: numpy.array with data :param quantiles: array-like with many quantiles needed :param sample_weight: array-like of the same length as `array` :param values_sorted: bool, if True, then will avoid sorting of initial array :param old_style: if True, will correct output to be consistent with numpy.percentile. :return: numpy.array with computed quantiles. """ values = np.array(values) quantiles = np.array(quantiles) if sample_weight is None: sample_weight = np.ones(len(values)) sample_weight = np.array(sample_weight) assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \ 'quantiles should be in [0, 1]' if not values_sorted: sorter = np.argsort(values) values = values[sorter] sample_weight = sample_weight[sorter] weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight if old_style: # To be convenient with numpy.percentile weighted_quantiles -= weighted_quantiles[0] weighted_quantiles /= weighted_quantiles[-1] else: weighted_quantiles /= np.sum(sample_weight) return np.interp(quantiles, weighted_quantiles, values)

例：

weighted_quantile（[1、2、9、3.2、4]、[0.0、0.5、1。]）

array（[1.、3.2、9。]）

weighted_quantile（[1、2、9、3.2、4]、[0.0、0.5、1。]、sample_weight = [2、1、2、4、1]）

array（[1.、3.2、9。]）

Kambrian · Answer

最初に並べ替えてから補間することによる簡単な解決策：

def weighted_percentile(data, percents, weights=None): ''' percents in units of 1% weights specifies the frequency (count) of data. ''' if weights is None: return np.percentile(data, percents) ind=np.argsort(data) d=data[ind] w=weights[ind] p=1.*w.cumsum()/w.sum()*100 y=np.interp(percents, p, d) return y

grovduck · Answer

追加の（非オリジナルの）回答についてお詫びします（@nayyarvにコメントするのに十分な担当者がいません）。彼の解決策は私にとってはうまくいきました（つまり、np.percentageのデフォルトの動作を複製します）が、元のnp.percentageの記述方法から手がかりを使ってforループを排除できると思います。

def weighted_percentile(a, q=np.array([75, 25]), w=None): """ Calculates percentiles associated with a (possibly weighted) array Parameters ---------- a : array-like The input array from which to calculate percents q : array-like The percentiles to calculate (0.0 - 100.0) w : array-like, optional The weights to assign to values of a. Equal weighting if None is specified Returns ------- values : np.array The values associated with the specified percentiles. """ # Standardize and sort based on values in a q = np.array(q) / 100.0 if w is None: w = np.ones(a.size) idx = np.argsort(a) a_sort = a[idx] w_sort = w[idx] # Get the cumulative sum of weights ecdf = np.cumsum(w_sort) # Find the percentile index positions associated with the percentiles p = q * (w.sum() - 1) # Find the bounding indices (both low and high) idx_low = np.searchsorted(ecdf, p, side='right') idx_high = np.searchsorted(ecdf, p + 1, side='right') idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1 # Calculate the weights weights_high = p - np.floor(p) weights_low = 1.0 - weights_high # Extract the low/high indexes and multiply by the corresponding weights x1 = np.take(a_sort, idx_low) * weights_low x2 = np.take(a_sort, idx_high) * weights_high # Return the average return np.add(x1, x2) # Sample data a = np.array([1.0, 2.0, 9.0, 3.2, 4.0], dtype=np.float) w = np.array([2.0, 1.0, 3.0, 4.0, 1.0], dtype=np.float) # Make an unweighted "copy" of a for testing a2 = np.repeat(a, w.astype(np.int)) # Tests with different percentiles chosen q1 = np.linspace(0.0, 100.0, 11) q2 = np.linspace(5.0, 95.0, 10) q3 = np.linspace(4.0, 94.0, 10) for q in (q1, q2, q3): assert np.all(weighted_percentile(a, q, w) == np.percentile(a2, q))

HYRY · Answer

加重パーセンタイルの意味はわかりませんが、@ Joan Smithの回答から、arのすべての要素を繰り返す必要があるようです。numpy.repeat()を使用できます。

import numpy as np np.repeat([1,2,3], [4,5,6])

結果は次のとおりです。

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3])

PiHalbe · Answer

私は自分のニーズに合わせてこの関数を使用します。

def quantile_at_values(values, population, weights=None): values = numpy.atleast_1d(values).astype(float) population = numpy.atleast_1d(population).astype(float) # if no weights are given, use equal weights if weights is None: weights = numpy.ones(population.shape).astype(float) normal = float(len(weights)) # else, check weights else: weights = numpy.atleast_1d(weights).astype(float) assert len(weights) == len(population) assert (weights >= 0).all() normal = numpy.sum(weights) assert normal > 0. quantiles = numpy.array([numpy.sum(weights[population <= value]) for value in values]) / normal assert (quantiles >= 0).all() and (quantiles <= 1).all() return quantiles

それは私が行くことができる限りベクトル化されています。
健全性チェックがたくさんあります。
フロートをウェイトとして使用します。
重みなしで機能します（→等しい重み）。
一度に複数の分位数を計算できます。

分位数ではなくパーセンタイルが必要な場合は、結果に100を掛けます。

nayyarv · Answer

def weighted_percentile(a, percentile = np.array([75, 25]), weights=None): """ O(nlgn) implementation for weighted_percentile. """ percentile = np.array(percentile)/100.0 if weights is None: weights = np.ones(len(a)) a_indsort = np.argsort(a) a_sort = a[a_indsort] weights_sort = weights[a_indsort] ecdf = np.cumsum(weights_sort) percentile_index_positions = percentile * (weights.sum()-1)+1 # need the 1 offset at the end due to ecdf not starting at 0 locations = np.searchsorted(ecdf, percentile_index_positions) out_percentiles = np.zeros(len(percentile_index_positions)) for i, empiricalLocation in enumerate(locations): # iterate across the requested percentiles if ecdf[empiricalLocation-1] == np.floor(percentile_index_positions[i]): # i.e. is the percentile in between 2 separate values uppWeight = percentile_index_positions[i] - ecdf[empiricalLocation-1] lowWeight = 1 - uppWeight out_percentiles[i] = a_sort[empiricalLocation-1] * lowWeight + \ a_sort[empiricalLocation] * uppWeight else: # i.e. the percentile is entirely in one bin out_percentiles[i] = a_sort[empiricalLocation] return out_percentiles

これは私の機能です、それはと同じ振る舞いをします

np.percentile(np.repeat(a, weights), percentile)

より少ないメモリオーバーヘッドで。 np.percentileはO(n)の実装であるため、重みが小さい場合は高速になる可能性があります。すべてのEdgeケースが分類されており、正確なソリューションです。上記の補間の回答は、線形であると想定しています。重みが1の場合を除いて、ほとんどの場合のステップ。

重み[3、11、7]のデータ[1,2,3]があり、25％のパーセンタイルが必要だとします。私のecdfは[3、10、21]になり、5番目の値を探しています。補間では、[3,1]と[10、2]が一致していると見なされ、値2の2番目のビンに完全に含まれているにもかかわらず、1.28が得られます。

Qwerty · Answer

コメントで述べたように、フロートウェイトでは値を単純に繰り返すことは不可能であり、非常に大きなデータセットでは実用的ではありません。ここに加重パーセンタイルを実行するライブラリがあります： http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html それは私のために働きました。

Max Ghenis · Answer

weightedcalcs package サポート quantiles ：

import weightedcalcs as wc import pandas as pd df = pd.DataFrame({'v': [1, 2, 3], 'w': [3, 2, 1]}) calc = wc.Calculator('w') # w designates weight calc.quantile(df, 'v', 0.5) # 1.5

Luca Jokull · Answer

ここに私の解決策：

def my_weighted_perc(data,perc,weights=None): if weights==None: return nanpercentile(data,perc) else: d=data[(~np.isnan(data))&(~np.isnan(weights))] ix=np.argsort(d) d=d[ix] wei=weights[ix] wei_cum=100.*cumsum(wei*1./sum(wei)) return interp(perc,wei_cum,d)

データの加重CDFを計算し、それを使用して加重パーセンタイルを推定します。