Rで効率的に合体を実装する方法

Question

バックグラウンド

いくつかのSQL言語（私は主にpostgreSQLを使用しています）には、各行の最初の非NULL列要素を返すCoalesceという関数があります。テーブルに多くのNULL要素が含まれている場合、これは非常に効率的に使用できます。

Rの多くのシナリオで、これは多くのNAが含まれるそれほど構造化されていないデータを扱うときにも発生します。

私は自分で素朴な実装をしましたが、とてつもなく遅いです。

coalesce <- function(...) { apply(cbind(...), 1, function(x) { x[which(!is.na(x))[1]] }) }

例

a <- c(1, 2, NA, 4, NA) b <- c(NA, NA, NA, 5, 6) c <- c(7, 8, NA, 9, 10) coalesce(a,b,c) # [1] 1 2 NA 4 6

質問

Rにcoalesceを実装する効率的な方法はありますか？

mrip · Accepted Answer

私のマシンでは、Reduceを使用するとパフォーマンスが5倍向上します。

coalesce2 <- function(...) { Reduce(function(x, y) { i <- which(is.na(x)) x[i] <- y[i] x}, list(...)) } > microbenchmark(coalesce(a,b,c),coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100 coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100

Martin Morgan · Answer

Coalesce1はまだ利用可能です

_coalesce1 <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- is.na(ans) ans[i] <- elt[i] } ans } _

まだ高速です（ただし、Reduceの手による書き換えは多かれ少なかれ、あまり一般的ではありません）

_> identical(coalesce(a, b, c), coalesce1(a, b, c)) [1] TRUE > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100 coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100 coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100 _

または、より大きなデータを比較する

_coalesce1a <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- which(is.na(ans)) ans[i] <- elt[i] } ans } _

which()は、インデックスの2回目のパスを意味する場合でも、効果的である場合があることを示しています。

_> aa <- sample(a, 100000, TRUE) > bb <- sample(b, 100000, TRUE) > cc <- sample(c, 100000, TRUE) > microbenchmark(coalesce1(aa, bb, cc), + coalesce1a(aa, bb, cc), + coalesce2(aa,bb,cc), times=10) Unit: milliseconds expr min lq median uq max neval coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10 coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10 coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10 _

zx8754 · Answer

dplyrパッケージを使用：

library(dplyr) coalesce(a, b, c) # [1] 1 2 NA 4 6

ベンチマーク、受け入れられているソリューションほど速くない：

coalesce2 <- function(...) { Reduce(function(x, y) { i <- which(is.na(x)) x[i] <- y[i] x}, list(...)) } microbenchmark::microbenchmark( coalesce(a, b, c), coalesce2(a, b, c) ) # Unit: microseconds # expr min lq mean median uq max neval cld # coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b # coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a

しかし、大規模なデータセットでは、同等です：

aa <- sample(a, 100000, TRUE) bb <- sample(b, 100000, TRUE) cc <- sample(c, 100000, TRUE) microbenchmark::microbenchmark( coalesce(aa, bb, cc), coalesce2(aa, bb, cc)) # Unit: milliseconds # expr min lq mean median uq max neval cld # coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a # coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a

krlmlr · Answer

my misc package にcoalesce.naというすぐに使える実装があります。競争力があるようですが、最速ではありません。また、異なる長さのベクトルに対しても機能し、長さ1のベクトルに対して特別な処理を行います。

 expr min lq median uq max neval coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10 coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10 coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10 coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10 coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10

コードは次のとおりです。

coalesce.na <- function(x, ...) { x.len <- length(x) ly <- list(...) for (y in ly) { y.len <- length(y) if (y.len == 1) { x[is.na(x)] <- y } else { if (x.len %% y.len != 0) warning('object length is not a multiple of first object length') pos <- which(is.na(x)) x[pos] <- y[(pos - 1) %% y.len + 1] } } x }

もちろん、Kevinが指摘したように、Rcppソリューションは桁違いに高速かもしれません。

sdgfsdh · Answer

very簡単な解決策は、ifelseパッケージのbase関数を使用することです。

coalesce3 <- function(x, y) { ifelse(is.na(x), y, x) }

上記のcoalesce2よりも遅いように見えますが：

test <- function(a, b, func) { for (i in 1:10000) { func(a, b) } } system.time(test(a, b, coalesce2)) user system elapsed 0.11 0.00 0.10 system.time(test(a, b, coalesce3)) user system elapsed 0.16 0.00 0.15

Reduceを使用して、任意の数のベクトルに対して機能させることができます。

coalesce4 <- function(...) { Reduce(coalesce3, list(...)) }

Henrik · Answer

data.table >= 1.12.3からcoalesceを使用できます。

library(data.table) coalesce(a, b, c) # [1] 1 2 NA 4 6

ベンチマークを含む詳細については、開発バージョン1.12.3のニュース項目＃18 を参照してください。開発バージョンのインストールについては、 here を参照してください。

Taz · Answer

私の解決策は次のとおりです。

coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) } NAではない最初の値を返し、_data.table_で動作します。たとえば、いくつかの列で合体を使用し、これらの列名が文字列のベクトルにある場合：

column_names <- c("col1", "col2", "col3")

使い方：

ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]

lmo · Answer

mapplyを使用した別の適用方法。

mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c) [1] 1 2 NA 4 6

複数の値が存在する場合、これにより最初の非NA値が選択されます。最後の非欠損要素は、tailを使用して選択できます。

おそらく、ベアボーンを使用して、この選択肢からもう少し速度を絞ることができます.mapply関数は、少し異なっています。

unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, dots=list(a, b, c), MoreArgs=NULL)) [1] 1 2 NA 4 6

.mapplyは、点線のないいとことは重要な点で異なります。

リスト（Mapなど）を返すため、unlistやcなどの関数でラップしてベクトルを返す必要があります。
fUNの関数に並列に渡される引数のセットは、ドット引数のリストで指定する必要があります。
最後に、mapply、moreArgs引数にはデフォルトがないため、明示的にNULLを指定する必要があります。