Rを使用して複数の列を持つデータフレームから（共）出現行列を計算する方法

Question

私はRの新人で、現在32列、約200.000行のEdgeリストの形式でコラボレーションデータを扱っています。国間の相互作用に基づいて（共）発生行列を作成したいと思います。ただし、オブジェクトの総数でインタラクションの数を数えたいと思います。

野心的な成果の基本的な例

1行に「イングランド」が3回出現し、「中国」が1回のみ出現する場合、結果は次の行列になります。

 England China England 3 3 China 3 1

再現可能な例

df <- data.frame(ID = c(1,2,3,4), V1 = c("England", "England", "China", "England"), V2 = c("Greece", "England", "Greece", "England"), V32 = c("USA", "China", "Greece", "England"))

したがって、現在のデータフレームの例は次のようになります。

ID V1 V2 ... V32 1 England Greece USA 2 England England China 3 China Greece Greece 4 England England England . . .

目指す結果

行ごとに、順序に関係なく（共）発生をカウントして、エッジループ（たとえば、イングランド-イングランド）の低周波数を説明する（共）発生行列を取得します。これにより、次の結果が得られます。

 China England Greece USA China 2 2 2 0 England 2 6 1 1 Greece 2 1 3 1 USA 0 1 1 1

これまでに試みられたこと

igraphを使用して、共起のある隣接行列を取得しました。ただし、同じように2つのオブジェクトの2つ以下の相互作用が計算されるため、場合によっては、行/パブリケーションによってオブジェクトの実際の頻度をはるかに下回る値が得られます。

df <- data.frame(ID = c(1,2,3,4), V1 = c("England", "England", "China", "England"), V2 = c("Greece", "England", "Greece", "England"), V32 = c("USA", "China", "Greece", "England")) # remove ID column df[1] <- list(NULL) # calculate co-occurrences and return as dataframe library(igraph) library(Matrix) countrydf <- graph.data.frame(df) countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE) countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))

 China England Greece USA China 0 0 1 0 England 0 2 1 0 Greece 1 1 0 0 USA 0 0 0 0

baseおよび/またはdplyrおよび/またはtableおよび/またはreshape2 [1に類似した] 、 [2] 、 [3] 、 [4] または [5] だが何もないこれまでのところトリックを行っており、コードを自分のニーズに合わせることができませんでした。 [6] をベースとして使用することも試みましたが、同じ問題がここでも当てはまります。

library(tidry) library(dplyr) library(stringr) # collapse observations into one column df2 <- df %>% unite(concat, V1:V32, sep = ",") # calculate weights df3 <- df2$concat %>% str_split(",") %>% lapply(function(x){ expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE) }) %>% bind_rows df4 <- apply(df3[, -5], 1, sort) %>% t %>% data.frame(stringsAsFactors = FALSE) %>% mutate(w = df3$w)

誰かが私を正しい方向に向けてくれたらうれしいです。

chinsoon12 · Answer

base::tableを使用するオプション：

df <- data.frame(ID = c(1,2,3,4), V1 = c("England", "England", "China", "England"), V2 = c("Greece", "England", "Greece", "England"), V3 = c("USA", "China", "Greece", "England")) #get paired combi and remove those from same country pairs <- as.data.frame(do.call(rbind, by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L))))) pairs <- pairs[pairs$V1!=pairs$V2, ] #repeat data frame with columns swap so that #upper and lower tri have same numbers and all countries are shown pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1)) #tabulate pairs tab <- table(pairs) #set diagonals to be the count of countries cnt <- c(table(unlist(df[-1L]))) diag(tab) <- cnt[names(diag(tab))] tab

出力：

 V2 V1 China England Greece USA China 2 2 2 0 England 2 6 1 1 Greece 2 1 3 1 USA 0 1 1 1

Nareman Darwish · Answer

これは、dplyrおよびtidyrパッケージを使用する方法です。全体のアイデアは、各国の行ごとの発生を含むデータフレームを作成し、それ自体に結合することです。

library(dplyr) # Create dataframe sammple df <- data.frame(ID = c(1,2,3,4), V1 = c("England", "England", "China", "England"), V2 = c("Greece", "England", "Greece", "England"), V32 = c("USA", "China", "Greece", "England"), stringsAsFactors = FALSE) # Get the occurance of each country in every row. row_occurance <- df %>% tidyr::gather(key = "identifier", value = "country", -ID) %>% group_by(ID, country) %>% count() row_occurance %>% # Join row_occurance on itself to simulate the matrix left_join(row_occurance, by = "ID") %>% # Get the highest occurance row wise, this to handle when country # name is repeated within same row mutate(Occurance = pmax(n.x, n.y)) %>% # Group by 2 countries group_by(country.x, country.y) %>% # Sum the occurance of 2 countries together summarise(Occurance = sum(Occurance)) %>% # Spread the data to make it in matrix format tidyr::spread(key = "country.y", value = "Occurance", fill = 0) # # A tibble: 4 x 5 # # Groups: country.x [4] # country.x China England Greece USA # <chr> <dbl> <dbl> <dbl> <dbl> # China 2 2 2 0 # England 2 6 1 1 # Greece 2 1 3 1 # USA 0 1 1 1