sklearnのバランスの取れたトレーニングセットとテストセットのデータを分割する方法

Question

複数の分類タスクにsklearnを使用しています。 alldataをtrain_setとtest_setに分割する必要があります。各クラスから同じサンプル番号をランダムに取得したい。実際、私はこの機能を楽しんでいます

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)

しかし、それは不均衡なデータセットを提供します！なにか提案を。

Christian Hirsch · Accepted Answer

StratifiedShuffleSplit を使用して、元のクラスと同じ割合のクラスを特徴とするデータセットを作成できます。

import numpy as np from sklearn.model_selection import StratifiedShuffleSplit X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]]) y = np.array([0, 1, 0, 1]) stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42) for train_idx, test_idx in stratSplit: X_train=X[train_idx] y_train=y[train_idx] print(X_train) # [[3 7] # [2 4]] print(y_train) # [1 0]

Guiem Bosch · Answer

クリスチャンの提案は正しいですが、技術的にはtrain_test_splitは、stratifyパラメーターを使用して、階層化された結果を提供する必要があります。

だからあなたができる：

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

ここでのコツは、バージョンから始まる 0.17 in sklearn。

パラメーターstratifyに関するドキュメントから：

stratify：array-likeまたはNone（デフォルトはNone）Noneでない場合、データを層状に分割し、これをラベル配列として使用します。バージョン0.17の新機能：階層化分割

antike · Answer

クラスのバランスが取れていないが、スプリットのバランスを取りたい場合は、層別化は役に立ちません。 sklearnでバランスの取れたサンプリングを行う方法はないようですが、基本的なnumpyを使用すると簡単です。たとえば、次のような関数が役立ちます。

_def split_balanced(data, target, test_size=0.2): classes = np.unique(target) # can give test_size as fraction of input data size of number of samples if test_size<1: n_test = np.round(len(target)*test_size) else: n_test = test_size n_train = max(0,len(target)-n_test) n_train_per_class = max(1,int(np.floor(n_train/len(classes)))) n_test_per_class = max(1,int(np.floor(n_test/len(classes)))) ixs = [] for cl in classes: if (n_train_per_class+n_test_per_class) > np.sum(target==cl): # if data has too few samples for this class, do upsampling # split the data to training and testing before sampling so data points won't be # shared among training and test data splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl))) ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class), np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)]) else: ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class, replace=False)) # take same num of samples from all classes ix_train = np.concatenate([x[:n_train_per_class] for x in ixs]) ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs]) X_train = data[ix_train,:] X_test = data[ix_test,:] y_train = target[ix_train] y_test = target[ix_test] return X_train, X_test, y_train, y_test _

これを使用し、入力データよりもクラスごとに多くのポイントをサンプリングする場合、それらはアップサンプリングされることに注意してください（置換のあるサンプル）。その結果、一部のデータポイントが複数回表示され、これが精度測定などに影響する可能性があります。また、一部のクラスにデータポイントが1つしかない場合、エラーが発生します。たとえば、np.unique(target, return_counts=True)を使用して、クラスごとのポイント数を簡単に確認できます。

Cobry · Answer

これは、トレーニング/テストデータインデックスの取得に使用する実装です

def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None): classes, counts = np.unique(target, return_counts=True) nPerClass = float(len(target))*float(trainSize)/float(len(classes)) if nPerClass > np.min(counts): print("Insufficient data to produce a balanced training data split.") print("Classes found %s"%classes) print("Classes count %s"%counts) ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target)) print("trainSize is reset from %s to %s"%(trainSize, ts)) trainSize = ts nPerClass = float(len(target))*float(trainSize)/float(len(classes)) # get number of classes nPerClass = int(nPerClass) print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass )) # get indexes trainIndexes = [] for c in classes: if seed is not None: np.random.seed(seed) cIdxs = np.where(target==c)[0] cIdxs = np.random.choice(cIdxs, nPerClass, replace=False) trainIndexes.extend(cIdxs) # get test indexes testIndexes = None if getTestIndexes: testIndexes = list(set(range(len(target))) - set(trainIndexes)) # shuffle if shuffle: trainIndexes = random.shuffle(trainIndexes) if testIndexes is not None: testIndexes = random.shuffle(testIndexes) # return indexes return trainIndexes, testIndexes