Rで複数のカテゴリ変数をホットエンコードする方法

Question

私は予測問題に取り組んでおり、Rで意思決定ツリーを構築しています。いくつかのカテゴリ変数があり、トレーニングおよびテストセットで一貫してワンホットエンコードしたいと思います。私は自分のトレーニングデータでそれをどうにかしてやった：

temps <- X_train tt <- subset(temps, select = -output) oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)

しかし、テストセットに同じエンコードを適用する方法が見つかりません。どうすればよいですか？

Esteban PS · Answer

キャレットパッケージでdummyVars関数を使用することをお勧めします。

customers <- data.frame( id=c(10, 20, 30, 40, 50), gender=c('male', 'female', 'female', 'male', 'female'), mood=c('happy', 'sad', 'happy', 'sad','happy'), outcome=c(1, 1, 0, 0, 0)) customers id gender mood outcome 1 10 male happy 1 2 20 female sad 1 3 30 female happy 0 4 40 male sad 0 5 50 female happy 0 # dummify the data dmy <- dummyVars(" ~ .", data = customers) trsf <- data.frame(predict(dmy, newdata = customers)) trsf id gender.female gender.male mood.happy mood.sad outcome 1 10 0 1 1 0 1 2 20 1 0 0 1 1 3 30 1 0 1 0 0 4 40 0 1 0 1 0 5 50 1 0 1 0 0

例ソース

同じ手順をトレーニングセットと検証セットの両方に適用します。

Roman · Answer

コード

library(data.table) library(mltools) customers_1h <- one_hot(as.data.table(customers))

結果

> customers_1h id gender_female gender_male mood_happy mood_sad outcome 1: 10 0 1 1 0 1 2: 20 1 0 0 1 1 3: 30 1 0 1 0 0 4: 40 0 1 0 1 0 5: 50 1 0 1 0 0

データ

customers <- data.frame( id=c(10, 20, 30, 40, 50), gender=c('male', 'female', 'female', 'male', 'female'), mood=c('happy', 'sad', 'happy', 'sad','happy'), outcome=c(1, 1, 0, 0, 0))

D A Wells · Answer

パッケージを使用せずにカテゴリをワンホットエンコードする簡単なソリューションを次に示します。

解決

model.matrix(~0+category)

カテゴリ変数を要因にする必要があります。因子レベルは、トレーニングデータとテストデータで同じでなければなりません。levels(train$category)およびlevels(test$category)で確認してください。テストセットでいくつかのレベルが発生しなくてもかまいません。

例

アイリスデータセットを使用した例を次に示します。

_data(iris) #Split into train and test sets. train <- sample(1:nrow(iris),100) test <- -1*train iris[test,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 34 5.5 4.2 1.4 0.2 setosa 106 7.6 3.0 6.6 2.1 virginica 112 6.4 2.7 5.3 1.9 virginica 127 6.2 2.8 4.8 1.8 virginica 132 7.9 3.8 6.4 2.0 virginica _

model.matrix()は、データに存在しない場合でも、因子の各レベルの列を作成します。ゼロはそのレベルではないことを示し、1はそうであることを示します。ゼロを追加すると、インターセプトレベルまたは参照レベルが不要になり、-1と同等になります。

_oh_train <- model.matrix(~0+iris[train,'Species']) oh_test <- model.matrix(~0+iris[test,'Species']) #Renaming the columns to be more concise. attr(oh_test, "dimnames")[[2]] <- levels(iris$Species) setosa versicolor virginica 1 1 0 0 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 _

追伸一般に、すべてのカテゴリをトレーニングデータとテストデータに含めることをお勧めします。しかし、それは私のビジネスではありません。

Shubham Joshi · Answer

こんにちは、私の同じバージョンです。この関数は'factors'であるすべてのカテゴリ変数をエンコードし、ダミー変数の1つを削除してdummy variable trapを返し、エンコードされた新しいデータフレーム：-

onehotencoder <- function(df_orig) { df<-cbind(df_orig) df_clmtyp<-data.frame(clmtyp=sapply(df,class)) df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp) for (rownm in 1:nrow(df_col_typ)) { if (df_col_typ[rownm,"clmtyp"]=="factor") { clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])] dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj)) dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))] df[toString(df_col_typ[rownm,"clmnm"])]<-NULL df<-cbind(df,dummy_matx) df[toString(df_col_typ[rownm,"clmnm"])]<-NULL } } return(df) }