2つのデータセットのあいまい一致文字列を一致させるにはどうすればよいですか？

Question

私は、会社名などの不完全な文字列に基づいて2つのデータセットを結合する方法に取り組んでいます。以前は、2つの非常に汚いリストを照合する必要がありました。1つのリストには名前と財務情報があり、もう1つのリストには名前と住所がありました。一致する一意のIDもありませんでした。 クリーニングがすでに適用されていて、タイプミスと挿入がある可能性があると想定しています。

これまでのところ、AGREPは私が見つけた最も効果的なツールです。 2つの文字列間の削除、挿入、置換の数を測定するAGREPパッケージでレーベンシュタイン距離を使用できます。 AGREPは、距離が最も短い（最も近い）文字列を返します。

ただし、このコマンドを単一の値からデータフレーム全体に適用するためにこのコマンドを変換するのに問題がありました。私は大まかにforループを使用してAGREP関数を繰り返しましたが、もっと簡単な方法があるはずです。

次のコードを参照してください。

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10)) for (i in 1:6){ a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) }

C8H10N4O2 · Accepted Answer

解決策は、一致するaからbまでのカーディナリティによって異なります。 1対1の場合、上記の3つの最も近い一致が得られます。多対1の場合は6になります。

1対1の場合（割り当てアルゴリズムが必要）：

距離行列と割り当てヒューリスティック（以下で使用する貪欲割り当て）の割り当て問題として扱う前に、これを行う必要があったとき。「最適な」ソリューションが必要な場合は、optimを使用することをお勧めします。

AGREPについては詳しくありませんが、距離行列にstringdistを使用した例を次に示します。

_library(stringdist) d <- expand.grid(a$name,b$name) # Distance matrix in long form names(d) <- c("a_name","b_name") d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here) # Greedy assignment heuristic (Your favorite heuristic here) greedyAssign <- function(a,b,d){ x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, # 1 for already assigned, -1 for unassigned and unassignable while(any(x==0)){ min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs a_sel <- a[d==min_d & x==0][1] b_sel <- b[d==min_d & a == a_sel & x==0][1] x[a==a_sel & b == b_sel] <- 1 x[x==0 & (a==a_sel|b==b_sel)] <- -1 } cbind(a=a[x==1],b=b[x==1],d=d[x==1]) } data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist)) _

割り当てを生成します。

_ a b d 1 Ace Co Ace Co. 0.04762 2 Bayes Bayes Inc. 0.16667 3 asd asdf 0.08333 _

貪欲な割り当てヒューリスティックを実行するはるかにエレガントな方法があると確信していますが、上記の方法でうまくいきます。

多対一のケース（割り当ての問題ではない）：

_do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),]))) _

結果を生成します：

_ a_name b_name dist 1 Ace Co Ace Co. 0.04762 11 Baes Bayes Inc. 0.20000 8 Bayes Bayes Inc. 0.16667 12 Bays Bayes Inc. 0.20000 10 Bcy Bayes Inc. 0.37778 15 asd asdf 0.08333 _

編集： _method="jw"_を使用して目的の結果を生成します。 help("stringdist-package")を参照

Arthur Yip · Answer

これはfuzzyjoinパッケージを使用したソリューションです。 dplyrに似た構文とstringdistをファジーマッチングの可能なタイプの1つとして使用します。

C8H10N4O2によって推奨のように、stringdist method = "jw"は、例に最も一致するものを作成します。

Fuzzyjoinの開発者であるdgrtwoによって推奨のように、大きなmax_distを使用してからdplyr::group_byおよびdplyr::top_n最短距離で最適な一致のみを取得します。

a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price = c(10, 13, 2, 1, 15, 1)) b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'), qty = c(9, 99, 10)) library(fuzzyjoin); library(dplyr); stringdist_join(a, b, by = "name", mode = "left", ignore_case = FALSE, method = "jw", max_dist = 99, distance_col = "dist") %>% group_by(name.x) %>% top_n(1, -dist) #> # A tibble: 6 x 5 #> # Groups: name.x [6] #> name.x price name.y qty dist #> <fctr> <dbl> <fctr> <dbl> <dbl> #> 1 Ace Co 10 Ace Co. 9 0.04761905 #> 2 Bayes 13 Bayes Inc. 99 0.16666667 #> 3 asd 2 asdf 10 0.08333333 #> 4 Bcy 1 Bayes Inc. 99 0.37777778 #> 5 Baes 15 Bayes Inc. 99 0.20000000 #> 6 Bays 1 Bayes Inc. 99 0.20000000

lawyeR · Answer

これがジョンアンドリュースにとってあなたにとって有用な方向性であるかどうかはわかりませんが、それは（RecordLinkageパッケージからの）別のツールを提供し、役立つかもしれません。

install.packages("ipred") install.packages("evd") install.packages("RSQLite") install.packages("ff") install.packages("ffbase") install.packages("ada") install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source") require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above compareJW <- function(string, vec, cutoff) { require(RecordLinkage) jarowinkler(string, vec) > cutoff } a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10)) a$name <- as.character(a$name) b$name <- as.character(b$name) test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course data.frame(name = a$name, price = a$price, test = test) > data.frame(name = a$name, price = a$price, test = test) name price test 1 Ace Co 10 TRUE 2 Bayes 13 TRUE 3 asd 2 TRUE 4 Bcy 1 FALSE 5 Baes 15 TRUE 6 Bays 1 FALSE

YummyLin Yang · Answer

上記の回答に同意します "AGREPに精通していませんが、距離行列にstringdistを使用する例を示します。"しかし、以下のように部分的に一致するデータ要素に基づくデータセットのマージからの署名関数のアドオンは、LVの計算が位置/追加/削除に基づいているため、より正確になります

##Here's where the algorithm starts... ##I'm going to generate a signature from country names to reduce some of the minor differences between strings ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. ##So for example, United Kingdom would become kingdomunited ##We might also remove stopwords such as 'the' and 'of'. signature=function(x){ sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='') return(sig) }

user3909910 · Answer

これらの状況ではlapplyを使用します。

yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),

それをcsvとして書くのはそれほど簡単ではありません：

write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)