web-dev-qa-db-ja.com

特定の長さのランを延長する

数値といくつかのNA値を持つ640 x 2500データフレームがあります。私の目標は、各行で最小75の連続するNA値を見つけることです。そのような実行ごとに、前のandに続く50個のセルをNAの値にも置き換えます。

以下は、1つの行の縮小された例です。

_x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
#        run of four NA:  ^   ^   ^   ^     
_

4つの連続するNAの実行を検出し、実行前と実行後の3つの値をNAに置き換えます。

_c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3) 
#           ^   ^   ^                   ^   ^   ^
_

最初に、連続するNAsをrleで識別しようとしましたが、rle(is.na(df))を実行すると、エラー_'x' must be a vector of an atomic type_が発生します。これは、単一の行を選択した場合でも発生します。

残念ながら、前と後の50個のセルをNAに変換するための次のステップはどうなるかわかりません。

事前に感謝します。

4
NickB

これが私の解決策です。でも私よりもきちんとした解決策はあるのでしょうか。

library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
  as.data.table() # dataset created

# A function to do what you need
NA_replacer <- function(x){
  Vector <- unlist(x) # pull the values into a vector

  NAs <- which(is.na(Vector)) # locate the positions of the NAs
  NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
  NAs_Position_2 <- rle(NAs_Position_1) # Find their values

  NAs <- NAs[which(
    NAs_Position_1 == with(NAs_Position_2,
                           values[which(
                             lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times

  if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
    Vector[seq(NAs[1]-3,
               NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
    Vector[seq(NAs[length(NAs)]+1,
               NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
  }
  Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1:  1  3  4  5  4  3 NA NA NA  NA   6   9   3   2   4   3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
  as.data.table() %>%
  data.table::transpose()

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1:  1  3  4 NA NA NA NA NA NA  NA  NA  NA  NA   2   4   3

余談ですが、640 * 2500サイズの架空のデータフレームの場合、速度は非常に良好です。75以上のNAのストレッチを配置する必要があり、前後の50の値をNAに置き換える必要があります。

df <- matrix(nrow = 640,ncol = 2500)

for(i in 1:nrow(df)){
  df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}

NA_replacer <- function(x){
  Vector <- unlist(x) # pull the values into a vector

  NAs <- which(is.na(Vector)) # locate the positions of the NAs
  NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
  NAs_Position_2 <- rle(NAs_Position_1) # Find their values

  NAs <- NAs[which(
    NAs_Position_1 == with(NAs_Position_2,
                           values[which(
                             lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times

  if(length(NAs >= 75)){ # Check if the condition is met
    Vector[seq(NAs[1]-50,
               NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
    Vector[seq(NAs[length(NAs)]+1,
               NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
  }
  Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
              as_tibble() %>%
              slice(1) %>%
              unlist())) %>% # run the code till here to get the indices of the NAs
  length() 

[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
  as.data.table() %>%
  data.table::transpose()

# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
              slice(1) %>%
              unlist())) %>% # run the code till here to get the indices of the NAs
  length()

[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
              as.data.table() %>%
              data.table::transpose())
user  system elapsed 
  0.216   0.002   0.220
1