データフレームからの階層化ランダムサンプリング

Question

次の形式のデータフレームがあります。

head(subset) # ants 0 1 1 0 1 # age 1 2 2 1 3 # lc 1 1 0 1 0

年齢とlcに応じたランダムサンプルで新しいデータフレームを作成する必要があります。たとえば、age：1とlc：1の30個のサンプル、age：1とlc：0の30個のサンプルなどが必要です。

私は次のようなランダムサンプリング方法を見ました。

newdata <- function(subset, age, 30)

しかし、私が望んでいるのはコードではありません。

Thomas · Answer

データは次のとおりです。

set.seed(1) n <- 1e4 d <- data.frame(age = sample(1:5,n,TRUE), lc = rbinom(n,1,.5), ants = rbinom(n,1,.7))

あなたはsplit your data.frame（この例ではd）、各サブサンプルから行/観測値をサンプリングし、それからrbind。仕組みは次のとおりです。

sp <- split(d, list(d$age, d$lc)) samples <- lapply(sp, function(x) x[sample(1:nrow(x), 30, FALSE),]) out <- do.call(rbind, samples)

結果：

> str(out) 'data.frame': 300 obs. of 3 variables: $ age : int 1 1 1 1 1 1 1 1 1 1 ... $ lc : int 0 0 0 0 0 0 0 0 0 0 ... $ ants: int 1 1 0 1 1 1 1 1 1 1 ... > head(out) age lc ants 1.0.2242 1 0 1 1.0.4417 1 0 1 1.0.389 1 0 0 1.0.4578 1 0 1 1.0.8170 1 0 1 1.0.5606 1 0 1

djhurio · Answer

パッケージの関数strataを参照してください sampling 。この関数は、層化された単純なランダムサンプリングを選択し、結果としてサンプルを提供します。余分な2つの列が追加されます-包含確率（Prob）と階層インジケーター（Stratum）。例を参照してください。

require(data.table) require(sampling) set.seed(1) n <- 1e4 d <- data.table(age = sample(1:5, n, T), lc = rbinom(n, 1 , .5), ants = rbinom(n, 1, .7)) # Sort setkey(d, age, lc) # Population size by strata d[, .N, keyby = list(age, lc)] # age lc N # 1: 1 0 1010 # 2: 1 1 1002 # 3: 2 0 993 # 4: 2 1 1026 # 5: 3 0 1021 # 6: 3 1 982 # 7: 4 0 958 # 8: 4 1 940 # 9: 5 0 1012 # 10: 5 1 1056 # Select sample set.seed(2) s <- data.table(strata(d, c("age", "lc"), rep(30, 10), "srswor")) # Sample size by strata s[, .N, keyby = list(age, lc)] # age lc N # 1: 1 0 30 # 2: 1 1 30 # 3: 2 0 30 # 4: 2 1 30 # 5: 3 0 30 # 6: 3 1 30 # 7: 4 0 30 # 8: 4 1 30 # 9: 5 0 30 # 10: 5 1 30

AdamO · Answer

質問を誤解していない限り、これは単純な関数を使用するのはとてつもなく簡単です。

ステップ1：interaction関数を使用してストラタムインジケーターを作成します。

ステップ2：一連の行インジケーターでtapplyを使用して、ランダムサンプルのインデックスを識別します。

ステップ3：これらのインデックスでデータをサブセット化する

@Thomasのデータ例を使用します。

set.seed(1) n <- 1e4 d <- data.frame(age = sample(1:5,n,TRUE), lc = rbinom(n,1,.5), ants = rbinom(n,1,.7)) ## stratum indicator d$group <- interaction(d[, c('age', 'lc')]) ## sample selection indices <- tapply(1:nrow(d), d$group, sample, 30) ## obtain subsample subsampd <- d[unlist(indices, use.names = FALSE), ]

適切な層別化を検証する

> table(subsampd$group) 1.0 2.0 3.0 4.0 5.0 1.1 2.1 3.1 4.1 5.1 30 30 30 30 30 30 30 30 30 30

mrbrich · Answer

data.tableを使用したワンライナーは次のとおりです。

set.seed(1) n <- 1e4 d <- data.table(age = sample(1:5, n, T), lc = rbinom(n, 1, .5), ants = rbinom(n, 1, .7)) out <- d[, .SD[sample(1:.N, 30)], by=.(age, lc)] # Check out[, table(age, lc)] ## lc ## age 0 1 ## 1 30 30 ## 2 30 30 ## 3 30 30 ## 4 30 30 ## 5 30 30