dplyrを使用して各グループの最大値を持つ行を選択する方法は？

Question

Dplyrで各グループの最大値を持つ行を選択したいと思います。

まず、質問を表示するためにランダムデータを生成します

set.seed(1) df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5)) df$value <- runif(nrow(df))

Plyrでは、カスタム関数を使用してこの行を選択できました。

library(plyr) ddply(df, .(A, B), function(x) x[which.max(x$value),])

Dplyrでは、このコードを使用して最大値を取得していますが、最大値の行（この場合は列C）は取得していません。

library(dplyr) df %>% group_by(A, B) %>% summarise(max = max(value))

どうすればこれを達成できますか？提案をありがとう。

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.2 plyr_1.8.1 loaded via a namespace (and not attached): [1] assertthat_0.1.0.99 parallel_3.1.0 Rcpp_0.11.1 [4] tools_3.1.0

thelatemail · Accepted Answer

これを試して：

result <- df %>% group_by(A, B) %>% filter(value == max(value)) %>% arrange(A,B,C)

動作するようです：

identical( as.data.frame(result), ddply(df, .(A, B), function(x) x[which.max(x$value),]) ) #[1] TRUE

コメントで@docendoが指摘したように、グループごとに厳密に1行だけが必要な場合は、以下の@RoyalITSの回答に従ってsliceがここで優先される場合があります。最大値が同じ複数の行がある場合、この回答は複数の行を返します。

mnel · Answer

top_nを使用できます

df %>% group_by(A, B) %>% top_n(n=1)

これは、最後の列（value）でランク付けされ、先頭のn=1行を返します。

現在、エラーを発生させずにこのデフォルトを変更することはできません（ https://github.com/hadley/dplyr/issues/426 を参照）

RoyalTS · Answer

df %>% group_by(A,B) %>% slice(which.max(value))

nassimhddd · Answer

このより冗長なソリューションは、最大値が重複している場合の動作をより詳細に制御します（この例では、対応する行の1つをランダムに取得します）

library(dplyr) df %>% group_by(A, B) %>% mutate(the_rank = rank(-value, ties.method = "random")) %>% filter(the_rank == 1) %>% select(-the_rank)

ksvrd · Answer

私にとっては、グループごとの値の数を数えるのに役立ちました。カウントテーブルを新しいオブジェクトにコピーします。次に、最初のグループ化特性に基づいてグループの最大値をフィルタリングします。例えば：

count_table <- df %>% group_by(A, B) %>% count() %>% arrange(A, desc(n)) count_table %>% group_by(A) %>% filter(n == max(n))

または

count_table %>% group_by(A) %>% top_n(1, n)

Kalin · Answer

より一般的には、特定のグループ内でsortedである行の「トップ」を取得したいと思うかもしれません。

単一の値が最大化されている場合、基本的に1列のみでソートされています。ただし、複数の列（たとえば、日付列と時刻列）で階層的に並べ替えることは、しばしば役立ちます。

# Answering the question of getting row with max "value". df %>% # Within each grouping of A and B values. group_by( A, B) %>% # Sort rows in descending order by "value" column. arrange( desc(value) ) %>% # Pick the top 1 value slice(1) %>% # Remember to ungroup in case you want to do further work without grouping. ungroup() # Answering an extension of the question of # getting row with the max value of the lowest "C". df %>% # Within each grouping of A and B values. group_by( A, B) %>% # Sort rows in ascending order by C, and then within that by # descending order by "value" column. arrange( C, desc(value) ) %>% # Pick the one top row based on the sort slice(1) %>% # Remember to ungroup in case you want to do further work without grouping. ungroup()