可変長文字列のより良い類似度ランキングアルゴリズム

Question

私は、通常提案されているもの（レベンシュタイン距離、soundexなど）よりも可変長文字列でより良い結果をもたらす文字列類似性アルゴリズムを探しています。

例えば、

与えられた文字列A：「ロバート」、

次に、ストリングB： "Amy Robertson"

より良いマッチになるだろう

ストリングC：「リチャード」

また、このアルゴリズムは言語に依存しないことが望ましいです（英語以外の言語でも機能します）。

marzagao · Accepted Answer

CatalysoftのSimon Whiteは、隣接する文字ペアを比較する非常に賢いアルゴリズムについての記事を書きました。

http://www.catalysoft.com/articles/StrikeAMatch.html

SimonにはJavaバージョンのアルゴリズムがあり、その下にPL/Rubyバージョンを書きました（関連するフォーラムエントリで行われたプレーンRubyバージョンMark Wong-VanHarenによるコメント）、PostgreSQLクエリで使用できるようにします：

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar) RETURNS float8 AS ' str1.downcase! pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject { |pair| pair.include? " "} str2.downcase! pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject { |pair| pair.include? " "} union = pairs1.size + pairs2.size intersection = 0 pairs1.each do |p1| 0.upto(pairs2.size-1) do |i| if p1 == pairs2[i] intersection += 1 pairs2.slice!(i) break end end end (2.0 * intersection) / union ' LANGUAGE 'plruby';

チャームのように機能します！

Michael La Voie · Answer

marzagaoの答えは素晴らしい。私はそれをC＃に変換したので、ここに投稿したいと思いました：

ペーストビンリンク

/// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public class SimilarityTool { /// <summary> /// Compares the two strings based on letter pair matches /// </summary> /// <param name="str1"></param> /// <param name="str2"></param> /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns> public double CompareStrings(string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual Word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each Word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; } }

John Rutledge · Answer

marzagao's answerの別のバージョンがあります。これはPythonで書かれています：

def get_bigrams(string): """ Take a string and return a list of bigrams. """ s = string.lower() return [s[i:i+2] for i in list(range(len(s) - 1))] def string_similarity(str1, str2): """ Perform bigram comparison between two strings and return a percentage match in decimal form. """ pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) union = len(pairs1) + len(pairs2) hit_count = 0 for x in pairs1: for y in pairs2: if x == y: hit_count += 1 break return (2.0 * hit_count) / union if __== "__main__": """ Run a test using the example taken from: http://www.catalysoft.com/articles/StrikeAMatch.html """ w1 = 'Healed' words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold'] for w2 in words: print('Healed --- ' + w2) print(string_similarity(w1, w2)) print()

quantum · Answer

John Rutledge's answerの短縮版：

def get_bigrams(string): ''' Takes a string and returns a list of bigrams ''' s = string.lower() return {s[i:i+2] for i in xrange(len(s) - 1)} def string_similarity(str1, str2): ''' Perform bigram comparison between two strings and return a percentage match in decimal form ''' pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

Igal Alkon · Answer

ここに私のPHP推奨されるStrikeAMatchアルゴリズムの実装、Simon Whiteによる。利点は（リンクで言うように）：

語彙の類似性の真の反映-わずかな違いのある文字列は類似していると認識されるべきです。特に、重要な部分文字列の重複は、文字列間の類似性が高いことを示す必要があります。
単語の順序の変更に対する堅牢性-同じ単語を含むが順序が異なる2つの文字列は、類似していると認識される必要があります。一方、1つの文字列が他の文字列に含まれる文字のランダムなアナグラムである場合、（通常）異なる文字列として認識される必要があります。
Language Independence-アルゴリズムは英語だけでなく、多くの異なる言語で動作するはずです。

<?php /** * LetterPairSimilarity algorithm implementation in PHP * @author Igal Alkon * @link http://www.catalysoft.com/articles/StrikeAMatch.html */ class LetterPairSimilarity { /** * @param $str * @return mixed */ private function wordLetterPairs($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each Word for ($w = 0; $w < count($words); $w++) { // Find the pairs of characters $pairsInWord = $this->letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p++) { $allPairs[] = $pairsInWord[$p]; } } return $allPairs; } /** * @param $str * @return array */ private function letterPairs($str) { $numPairs = mb_strlen($str)-1; $pairs = array(); for ($i = 0; $i < $numPairs; $i++) { $pairs[$i] = mb_substr($str,$i,2); } return $pairs; } /** * @param $str1 * @param $str2 * @return float */ public function compareStrings($str1, $str2) { $pairs1 = $this->wordLetterPairs(strtoupper($str1)); $pairs2 = $this->wordLetterPairs(strtoupper($str2)); $intersection = 0; $union = count($pairs1) + count($pairs2); for ($i=0; $i < count($pairs1); $i++) { $pair1 = $pairs1[$i]; $pairs2 = array_values($pairs2); for($j = 0; $j < count($pairs2); $j++) { $pair2 = $pairs2[$j]; if ($pair1 === $pair2) { $intersection++; unset($pairs2[$j]); break; } } } return (2.0*$intersection)/$union; } }

bchatham · Answer

この議論は本当に役に立ちました、ありがとう。 Excelで使用するためにアルゴリズムをVBAに変換し、いくつかのバージョンのワークシート関数を作成しました。 strSimLookupバージョンは、文字列、配列インデックス、または類似性メトリックとして最後に一致したものを返します。

この実装は、Simon WhiteのWebサイトのAmazonの例にリストされているものと同じ結果を生成しますが、低スコアの一致に関するいくつかの小さな例外があります。違いがどこに忍び込んでいるかはわかりませんが、VBAのSplit関数かもしれませんが、目的のためにうまく機能しているので調査していません。

'Implements functions to rate how similar two strings are on 'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar) 'Source: http://www.catalysoft.com/articles/StrikeAMatch.html 'Author: Bob Chatham, bob.chatham at gmail.com '9/12/2010 Option Explicit Public Function stringSimilarity(str1 As String, str2 As String) As Variant 'Simple version of the algorithm that computes the similiarity metric 'between two strings. 'NOTE: This verision is not efficient to use if you're comparing one string 'with a range of other values as it will needlessly calculate the pairs for the 'first string over an over again; use the array-optimized version for this case. Dim sPairs1 As Collection Dim sPairs2 As Collection Set sPairs1 = New Collection Set sPairs2 = New Collection WordLetterPairs str1, sPairs1 WordLetterPairs str2, sPairs2 stringSimilarity = SimilarityMetric(sPairs1, sPairs2) Set sPairs1 = Nothing Set sPairs2 = Nothing End Function Public Function strSimA(str1 As Variant, rRng As Range) As Variant 'Return an array of string similarity indexes for str1 vs every string in input range rRng Dim sPairs1 As Collection Dim sPairs2 As Collection Dim arrOut As Variant Dim l As Long, j As Long Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 l = rRng.Count ReDim arrOut(1 To l) For j = 1 To l Set sPairs2 = New Collection WordLetterPairs CStr(rRng(j)), sPairs2 arrOut(j) = SimilarityMetric(sPairs1, sPairs2) Set sPairs2 = Nothing Next j strSimA = Application.Transpose(arrOut) End Function Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant 'Return either the best match or the index of the best match 'depending on returnTYype parameter) between str1 and strings in rRng) ' returnType = 0 or omitted: returns the best matching string ' returnType = 1 : returns the index of the best matching string ' returnType = 2 : returns the similarity metric Dim sPairs1 As Collection Dim sPairs2 As Collection Dim metric, bestMetric As Double Dim i, iBest As Long Const RETURN_STRING As Integer = 0 Const RETURN_INDEX As Integer = 1 Const RETURN_METRIC As Integer = 2 If IsMissing(returnType) Then returnType = RETURN_STRING Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 bestMetric = -1 iBest = -1 For i = 1 To rRng.Count Set sPairs2 = New Collection WordLetterPairs CStr(rRng(i)), sPairs2 metric = SimilarityMetric(sPairs1, sPairs2) If metric > bestMetric Then bestMetric = metric iBest = i End If Set sPairs2 = Nothing Next i If iBest = -1 Then strSimLookup = CVErr(xlErrValue) Exit Function End If Select Case returnType Case RETURN_STRING strSimLookup = CStr(rRng(iBest)) Case RETURN_INDEX strSimLookup = iBest Case Else strSimLookup = bestMetric End Select End Function Public Function strSim(str1 As String, str2 As String) As Variant Dim ilen, iLen1, ilen2 As Integer iLen1 = Len(str1) ilen2 = Len(str2) If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1 strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen)) End Function Sub WordLetterPairs(str As String, pairColl As Collection) 'Tokenize str into words, then add all letter pairs to pairColl Dim Words() As String Dim Word, nPairs, pair As Integer Words = Split(str) If UBound(Words) < 0 Then Set pairColl = Nothing Exit Sub End If For Word = 0 To UBound(Words) nPairs = Len(Words(Word)) - 1 If nPairs > 0 Then For pair = 1 To nPairs pairColl.Add Mid(Words(Word), pair, 2) Next pair End If Next Word End Sub Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant 'Helper function to calculate similarity metric given two collections of letter pairs. 'This function is designed to allow the pair collections to be set up separately as needed. 'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection 'if this is not the desired behavior. 'Also assumes that collections will be deallocated somewhere else Dim Intersect As Double Dim Union As Double Dim i, j As Long If sPairs1.Count = 0 Or sPairs2.Count = 0 Then SimilarityMetric = CVErr(xlErrNA) Exit Function End If Union = sPairs1.Count + sPairs2.Count Intersect = 0 For i = 1 To sPairs1.Count For j = 1 To sPairs2.Count If StrComp(sPairs1(i), sPairs2(j)) = 0 Then Intersect = Intersect + 1 sPairs2.Remove j Exit For End If Next j Next i SimilarityMetric = (2 * Intersect) / Union End Function

user4256874 · Answer

申し訳ありませんが、答えは著者によって考案されたものではありません。これは、Digital Equipment Corporationが最初に発表したよく知られたアルゴリズムで、多くの場合、シングリングと呼ばれます。

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

Andy · Answer

より高速なPHPアルゴリズムのバージョン：

/** * * @param $str * @return mixed */ private static function wordLetterPairs ($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each Word for ($w = 0; $w < count($words); $w ++) { // Find the pairs of characters $pairsInWord = self::letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p ++) { $allPairs[$pairsInWord[$p]] = $pairsInWord[$p]; } } return array_values($allPairs); } /** * * @param $str * @return array */ private static function letterPairs ($str) { $numPairs = mb_strlen($str) - 1; $pairs = array(); for ($i = 0; $i < $numPairs; $i ++) { $pairs[$i] = mb_substr($str, $i, 2); } return $pairs; } /** * * @param $str1 * @param $str2 * @return float */ public static function compareStrings ($str1, $str2) { $pairs1 = self::wordLetterPairs(mb_strtolower($str1)); $pairs2 = self::wordLetterPairs(mb_strtolower($str2)); $union = count($pairs1) + count($pairs2); $intersection = count(array_intersect($pairs1, $pairs2)); return (2.0 * $intersection) / $union; }

私が持っていたデータ（約2300の比較）については、 Igal Alkon ソリューションで0.58秒、私のソリューションでは0.35秒でした。

fabiolimace · Answer

Simon WhiteのアルゴリズムをPL/pgSQLに翻訳しました。これが私の貢献です。

<!-- language: lang-sql --> create or replace function spt1.letterpairs(in p_str varchar) returns varchar as $$ declare v_numpairs integer := length(p_str)-1; v_pairs varchar[]; begin for i in 1 .. v_numpairs loop v_pairs[i] := substr(p_str, i, 2); end loop; return v_pairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.wordletterpairs(in p_str varchar) returns varchar as $$ declare v_allpairs varchar[]; v_words varchar[]; v_pairsinword varchar[]; begin v_words := regexp_split_to_array(p_str, '[[:space:]]'); for i in 1 .. array_length(v_words, 1) loop v_pairsinword := spt1.letterpairs(v_words[i]); if v_pairsinword is not null then for j in 1 .. array_length(v_pairsinword, 1) loop v_allpairs := v_allpairs || v_pairsinword[j]; end loop; end if; end loop; return v_allpairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY) returns anyarray as $$ select array(select unnest($1) intersect select unnest($2)) $$ language 'sql'; --=================================================================== create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar) returns float as $$ declare v_pairs1 varchar[]; v_pairs2 varchar[]; v_intersection integer; v_union integer; begin v_pairs1 := wordletterpairs(upper(p_str1)); v_pairs2 := wordletterpairs(upper(p_str2)); v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1); return (2.0 * v_intersection / v_union); end; $$ language 'plpgsql';

Ignobilis · Answer

美しいScalaのバージョン：

 def pairDistance(s1: String, s2: String): Double = { def strToPairs(s: String, acc: List[String]): List[String] = { if (s.size < 2) acc else strToPairs(s.drop(1), if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2))) } val lst1 = strToPairs(s1.toUpperCase, List()) val lst2 = strToPairs(s2.toUpperCase, List()) (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size) }

Anonymous · Answer

String Similarity Metrics には、文字列比較で使用されるさまざまなメトリックの概要が含まれています（ Wikipedia にも概要があります）。これらのメトリックの多くは、ライブラリに実装されています simmetrics 。

特定の概要に含まれていないメトリックのさらに別の例は、たとえば圧縮距離（コルモゴロフの複雑さに近づけようとする）です。これは、あなたが提示したもの。

Natural Language Processing のはるかに幅広い主題を検討することもできます。これら Rパッケージは、すぐに使い始めることができます（または、少なくともいくつかのアイデアを提供します）。

最後の編集-SOでこのテーマに関する他の質問を検索してください。関連する質問がかなりあります。

Rainer · Answer

Rバージョンは次のとおりです。

get_bigrams <- function(str) { lstr = tolower(str) bigramlst = list() for(i in 1:(nchar(str)-1)) { bigramlst[[i]] = substr(str, i, i+1) } return(bigramlst) } str_similarity <- function(str1, str2) { pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) unionlen = length(pairs1) + length(pairs2) hit_count = 0 for(x in 1:length(pairs1)){ for(y in 1:length(pairs2)){ if (pairs1[[x]] == pairs2[[y]]) hit_count = hit_count + 1 } } return ((2.0 * hit_count) / unionlen) }

scripts · Answer

marzagao's answer C99での投稿これらアルゴリズムに触発

double dice_match(const char *string1, const char *string2) { //check fast cases if (((string1 != NULL) && (string1[0] == '\0')) || ((string2 != NULL) && (string2[0] == '\0'))) { return 0; } if (string1 == string2) { return 1; } size_t strlen1 = strlen(string1); size_t strlen2 = strlen(string2); if (strlen1 < 2 || strlen2 < 2) { return 0; } size_t length1 = strlen1 - 1; size_t length2 = strlen2 - 1; double matches = 0; int i = 0, j = 0; //get bigrams and compare while (i < length1 && j < length2) { char a[3] = {string1[i], string1[i + 1], '\0'}; char b[3] = {string2[j], string2[j + 1], '\0'}; int cmp = strcmpi(a, b); if (cmp == 0) { matches += 2; } i++; j++; } return matches / (length1 + length2); }

元の記事に基づくいくつかのテスト：

#include <stdio.h> void article_test1() { char *string1 = "FRANCE"; char *string2 = "FRENCH"; printf("====%s====
", __func__); printf("%2.f%% == 40%%
", dice_match(string1, string2) * 100); } void article_test2() { printf("====%s====
", __func__); char *string = "Healed"; char *ss[] = {"Heard", "Healthy", "Help", "Herded", "Sealed", "Sold"}; int correct[] = {44, 55, 25, 40, 80, 0}; for (int i = 0; i < 6; ++i) { printf("%2.f%% == %d%%
", dice_match(string, ss[i]) * 100, correct[i]); } } void multicase_test() { char *string1 = "FRaNcE"; char *string2 = "fREnCh"; printf("====%s====
", __func__); printf("%2.f%% == 40%%
", dice_match(string1, string2) * 100); } void gg_test() { char *string1 = "GG"; char *string2 = "GGGGG"; printf("====%s====
", __func__); printf("%2.f%% != 100%%
", dice_match(string1, string2) * 100); } int main() { article_test1(); article_test2(); multicase_test(); gg_test(); return 0; }

gruppler · Answer

私のJavaScript実装は、文字列または文字列の配列と、オプションのフロア（デフォルトのフロアは0.5）を取ります。文字列を渡すと、文字列の類似性スコアがフロア以上かどうかに応じて、trueまたはfalseを返します。文字列の配列を渡すと、類似度スコアがフロア以上である文字列の配列をスコアでソートして返します。

例：

'Healed'.fuzzy('Sealed'); // returns true 'Healed'.fuzzy('Help'); // returns false 'Healed'.fuzzy('Help', 0.25); // returns true 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']); // returns ["Sealed", "Healthy"] 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0); // returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

ここにあります：

(function(){ var default_floor = 0.5; function pairs(str){ var pairs = [] , length = str.length - 1 , pair; str = str.toLowerCase(); for(var i = 0; i < length; i++){ pair = str.substr(i, 2); if(!/\s/.test(pair)){ pairs.Push(pair); } } return pairs; } function similarity(pairs1, pairs2){ var union = pairs1.length + pairs2.length , hits = 0; for(var i = 0; i < pairs1.length; i++){ for(var j = 0; j < pairs1.length; j++){ if(pairs1[i] == pairs2[j]){ pairs2.splice(j--, 1); hits++; break; } } } return 2*hits/union || 0; } String.prototype.fuzzy = function(strings, floor){ var str1 = this , pairs1 = pairs(this); floor = typeof floor == 'number' ? floor : default_floor; if(typeof(strings) == 'string'){ return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase(); }else if(strings instanceof Array){ var scores = {}; strings.map(function(str2){ scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase()); }); return strings.filter(function(str){ return scores[str] >= floor; }).sort(function(a, b){ return scores[b] - scores[a]; }); } }; })();

そして、ここにあなたの便宜のために縮小版があります：

(function(){function g(a){var b=[],e=a.length-1,d;a=a.toLowerCase();for(var c=0;c<e;c++)d=a.substr(c,2),/\s/.test(d)||b.Push(d);return b}function h(a,b){for(var e=a.length+b.length,d=0,c=0;c<a.length;c++)for(var f=0;f<a.length;f++)if(a[c]==b[f]){b.splice(f--,1);d++;break}return 2*d/e||0}String.prototype.fuzzy=function(a,b){var e=this,d=g(this);b="number"==typeof b?b:0.5;if("string"==typeof a)return 1<e.length&&1<a.length&&h(d,g(a))>=b||e.toLowerCase()==a.toLowerCase();if(a instanceof Array){var c={};a.map(function(a){c[a]=1<e.length?h(d,g(a)):1*(e.toLowerCase()==a.toLowerCase())});return a.filter(function(a){return c[a]>=b}).sort(function(a,b){return c[b]-c[a]})}}})();

Frank Rundatz · Answer

Michael La Voieの素晴らしいC＃バージョンに基づいて、それを拡張メソッドにするというリクエストに応じて、私が思いついたものを以下に示します。この方法で行う主な利点は、一致率で汎用リストをソートできることです。たとえば、オブジェクトに「City」という名前の文字列フィールドがあるとします。ユーザーが「チェスター」を検索し、一致の降順で結果を返します。たとえば、チェスターの文字通りの一致をロチェスターの前に表示したいとします。これを行うには、オブジェクトに2つの新しいプロパティを追加します。

 public string SearchText { get; set; } public double PercentMatch { get { return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper()); } }

次に、各オブジェクトで、ユーザーが検索したものにSearchTextを設定します。次に、次のような方法で簡単にソートできます。

 zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

これを拡張メソッドにするためのわずかな変更を次に示します。

 /// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public static double PercentMatchTo(this string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual Word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private static List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each Word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private static string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; }

matthewpalmer · Answer

Haskellバージョン— Haskellをあまり行っていないので、気軽に編集を提案してください。

import Data.Char import Data.List -- Convert a string into words, then get the pairs of words from that phrase wordLetterPairs :: String -> [String] wordLetterPairs s1 = concat $ map pairs $ words s1 -- Converts a String into a list of letter pairs. pairs :: String -> [String] pairs [] = [] pairs (x:[]) = [] pairs (x:ys) = [x, head ys]:(pairs ys) -- Calculates the match rating for two strings matchRating :: String -> String -> Double matchRating s1 s2 = (numberOfMatches * 2) / totalLength where pairsS1 = wordLetterPairs $ map toLower s1 pairsS2 = wordLetterPairs $ map toLower s2 numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2 totalLength = fromIntegral $ length pairsS1 + length pairsS2

Shaun Lebron · Answer

Clojure：

(require '[clojure.set :refer [intersection]]) (defn bigrams [s] (->> (split s #"\s+") (mapcat #(partition 2 1 %)) (set))) (defn string-similarity [a b] (let [a-pairs (bigrams a) b-pairs (bigrams b) total-count (+ (count a-pairs) (count b-pairs)) match-count (count (intersection a-pairs b-pairs)) similarity (/ (* 2 match-count) total-count)] similarity))

s01ipsist · Answer

ダイス係数アルゴリズム（Simon White/marzagaoの答え）は、amatch gemのpair_distance_similarメソッドのRubyで実装されています

https://github.com/flori/amatch

このgemには、いくつかの近似マッチングおよび文字列比較アルゴリズムの実装も含まれています：レーベンシュタイン編集距離、売り手編集距離、ハミング距離、最長共通部分列長、最長共通部分文字列長、ペア距離メトリック、Jaro-Winklerメトリック。

Engr. Hasanuzzaman Sumon · Answer

@marzagaoの答えが示すアルゴリズムの純粋なRuby実装を探していました。残念ながら、@ marzagaoで示されるリンクは壊れています。 @ s01ipsistの回答で、彼はRuby gem amatch を示しましたが、実装は純粋なRubyではありません。そこで、少し検索して、純粋なRuby実装を持つgem fuzzy_match を見つけました（このgemはamatchを使用しています） at here これが私のような人に役立つことを願っています。

tehvan · Answer

レーベンシュタインの距離を最初の弦の長さで割ったもの（または両方の弦の最小/最大/平均長さで割ったもの）はどうですか？これは今のところうまくいきました。

user779420 · Answer

ちょっとみんな私はこれをjavascriptで試してみましたが、私はそれが初めてです、誰もそれを行うより速い方法を知っていますか？

function get_bigrams(string) { // Takes a string and returns a list of bigrams var s = string.toLowerCase(); var v = new Array(s.length-1); for (i = 0; i< v.length; i++){ v[i] =s.slice(i,i+2); } return v; } function string_similarity(str1, str2){ /* Perform bigram comparison between two strings and return a percentage match in decimal form */ var pairs1 = get_bigrams(str1); var pairs2 = get_bigrams(str2); var union = pairs1.length + pairs2.length; var hit_count = 0; for (x in pairs1){ for (y in pairs2){ if (pairs1[x] == pairs2[y]){ hit_count++; } } } return ((2.0 * hit_count) / union); } var w1 = 'Healed'; var Word =['Heard','Healthy','Help','Herded','Sealed','Sold'] for (w2 in Word){ console.log('Healed --- ' + Word[w2]) console.log(string_similarity(w1,Word[w2])); }

chema989 · Answer

Sørensen–Diceインデックス（marzagaoの答え）に基づく類似性の別のバージョンを次に示します。これはC++ 11で記述されています。

/* * Similarity based in Sørensen–Dice index. * * Returns the Similarity between _str1 and _str2. */ double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) { // Base case: if some string is empty. if (_str1.empty() || _str2.empty()) { return 1.0; } auto str1 = upper_string(_str1); auto str2 = upper_string(_str2); // Base case: if the strings are equals. if (str1 == str2) { return 0.0; } // Base case: if some string does not have bigrams. if (str1.size() < 2 || str2.size() < 2) { return 1.0; } // Extract bigrams from str1 auto num_pairs1 = str1.size() - 1; std::unordered_set<std::string> str1_bigrams; str1_bigrams.reserve(num_pairs1); for (unsigned i = 0; i < num_pairs1; ++i) { str1_bigrams.insert(str1.substr(i, 2)); } // Extract bigrams from str2 auto num_pairs2 = str2.size() - 1; std::unordered_set<std::string> str2_bigrams; str2_bigrams.reserve(num_pairs2); for (unsigned int i = 0; i < num_pairs2; ++i) { str2_bigrams.insert(str2.substr(i, 2)); } // Find the intersection between the two sets. int intersection = 0; if (str1_bigrams.size() < str2_bigrams.size()) { const auto it_e = str2_bigrams.end(); for (const auto& bigram : str1_bigrams) { intersection += str2_bigrams.find(bigram) != it_e; } } else { const auto it_e = str1_bigrams.end(); for (const auto& bigram : str2_bigrams) { intersection += str1_bigrams.find(bigram) != it_e; } } // Returns similarity coefficient. return (2.0 * intersection) / (num_pairs1 + num_pairs2); }