Javaの類似文字列の比較

Question

いくつかの文字列を互いに比較し、最も類似しているものを見つけたいです。どの文字列が他の文字列に似ているかを返すライブラリ、メソッド、またはベストプラクティスがあるかどうか疑問に思いました。例えば：

「クイックキツネがジャンプしました」->「キツネがジャンプしました」
「クイックキツネがジャンプしました」->「キツネ」

この比較では、最初のものが2番目のものよりも類似していることが返されます。

次のような方法が必要だと思います。

double similarityIndex(String s1, String s2)

どこかにそのようなことがありますか？

編集：なぜ私はこれをしているのですか？ MS Projectファイルの出力を、タスクを処理するレガシーシステムの出力と比較するスクリプトを書いています。レガシーシステムのフィールド幅は非常に限られているため、値が追加されると説明は省略されます。生成されたキーを取得できるように、MS Projectのどのエントリがシステムのエントリに似ているかを見つけるための半自動化された方法が必要です。まだ手動でチェックする必要があるため、欠点がありますが、多くの作業を節約できます

dfa · Accepted Answer

はい、次のようなよく文書化された多くのアルゴリズムがあります。

コサイン類似度
ジャカードの類似性
サイコロの係数
一致する類似性
重複の類似性
などなど

またはこれを確認できます

これらのプロジェクトも確認してください。

acdcjunior · Answer

2つの文字列間の類似度を0％-100％の方法で計算する一般的な方法は、多くのライブラリで使用されているように、％）長い文字列を変更して短くする必要があります：

_/** * Calculates the similarity (a number within 0 and 1) between two strings. */ public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { // longer should always have greater length longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } // you can use StringUtils.getLevenshteinDistance() as the editDistance() function // full copy-paste working code is below _

`editDistance()`の計算：

上記のeditDistance()関数は、2つの文字列間の編集距離を計算することが期待されています。このステップには複数の実装があり、それぞれが特定のシナリオにより適している場合があります。最も一般的なのは レーベンシュタイン距離アルゴリズム で、これを使用します以下の例では（非常に大きな文字列の場合、他のアルゴリズムのパフォーマンスが向上する可能性があります）。

編集距離を計算する2つのオプションを次に示します。

Apache Commons Text のレーベンシュタイン距離の実装を使用できます： apply(CharSequence left, CharSequence rightt)
自分で実装します。以下に実装例を示します。

作業例：

こちらのオンラインデモをご覧ください。

_public class StringSimilarity { /** * Calculates the similarity (a number within 0 and 1) between two strings. */ public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { // longer should always have greater length longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } /* // If you have Apache Commons Text, you can use it to calculate the edit distance: LevenshteinDistance levenshteinDistance = new LevenshteinDistance(); return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */ return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } // Example implementation of the Levenshtein Edit Distance // See http://rosettacode.org/wiki/Levenshtein_distance#Java public static int editDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) costs[j] = j; else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) costs[s2.length()] = lastValue; } return costs[s2.length()]; } public static void printSimilarity(String s, String t) { System.out.println(String.format( "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t)); } public static void main(String[] args) { printSimilarity("", ""); printSimilarity("1234567890", "1"); printSimilarity("1234567890", "123"); printSimilarity("1234567890", "1234567"); printSimilarity("1234567890", "1234567890"); printSimilarity("1234567890", "1234567980"); printSimilarity("47/2010", "472010"); printSimilarity("47/2010", "472011"); printSimilarity("47/2010", "AB.CDEF"); printSimilarity("47/2010", "4B.CDEFG"); printSimilarity("47/2010", "AB.CDEFG"); printSimilarity("The quick fox jumped", "The fox jumped"); printSimilarity("The quick fox jumped", "The fox"); printSimilarity("kitten", "sitting"); } } _

出力：

_1.000 is the similarity between "" and "" 0.100 is the similarity between "1234567890" and "1" 0.300 is the similarity between "1234567890" and "123" 0.700 is the similarity between "1234567890" and "1234567" 1.000 is the similarity between "1234567890" and "1234567890" 0.800 is the similarity between "1234567890" and "1234567980" 0.857 is the similarity between "47/2010" and "472010" 0.714 is the similarity between "47/2010" and "472011" 0.000 is the similarity between "47/2010" and "AB.CDEF" 0.125 is the similarity between "47/2010" and "4B.CDEFG" 0.000 is the similarity between "47/2010" and "AB.CDEFG" 0.700 is the similarity between "The quick fox jumped" and "The fox jumped" 0.350 is the similarity between "The quick fox jumped" and "The fox" 0.571 is the similarity between "kitten" and "sitting" _

user493744 · Answer

レーベンシュタイン距離アルゴリズムをJavaScriptに変換しました：

String.prototype.LevenshteinDistance = function (s2) { var array = new Array(this.length + 1); for (var i = 0; i < this.length + 1; i++) array[i] = new Array(s2.length + 1); for (var i = 0; i < this.length + 1; i++) array[i][0] = i; for (var j = 0; j < s2.length + 1; j++) array[0][j] = j; for (var i = 1; i < this.length + 1; i++) { for (var j = 1; j < s2.length + 1; j++) { if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1]; else { array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1); array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1); } } } return array[this.length][s2.length]; };

Florian Fankhauser · Answer

レーベンシュタイン距離を使用して、2つの文字列の差を計算できます。 http://en.wikipedia.org/wiki/Levenshtein_distance

Thibault Debatty · Answer

実際、文字列の類似性の尺度はたくさんあります。

レーベンシュタイン編集距離;
ダメラウ-レーベンシュタイン距離;
Jaro-Winklerの類似性。
最長共通サブシーケンス編集距離。
Q-Gram（ウッコネン）;
n-グラム距離（Kondrak）;
ジャカードインデックス。
ソレンセン-ダイス係数;
コサイン類似度;
...

説明とJavaこれらの実装はここにあります： https://github.com/tdebatty/Java-string-similarity

noelicus · Answer

Apache commons Java library を使用してこれを実現できます。その中のこれら2つの関数を見てください。
- getLevenshteinDistance
- getFuzzyDistance

Mohsen Abasi · Answer

最初の回答者のおかげで、computeEditDistance（s1、s2）には2つの計算があると思います。時間がかかるため、コードのパフォーマンスを改善することにしました。そう：

public class LevenshteinDistance { public static int computeEditDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) { costs[j] = j; } else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) { newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; } costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) { costs[s2.length()] = lastValue; } } return costs[s2.length()]; } public static void printDistance(String s1, String s2) { double similarityOfStrings = 0.0; int editDistance = 0; if (s1.length() < s2.length()) { // s1 should always be bigger String swap = s1; s1 = s2; s2 = swap; } int bigLen = s1.length(); editDistance = computeEditDistance(s1, s2); if (bigLen == 0) { similarityOfStrings = 1.0; /* both strings are zero length */ } else { similarityOfStrings = (bigLen - editDistance) / (double) bigLen; } ////////////////////////// //System.out.println(s1 + "-->" + s2 + ": " + // editDistance + " (" + similarityOfStrings + ")"); System.out.println(editDistance + " (" + similarityOfStrings + ")"); } public static void main(String[] args) { printDistance("", ""); printDistance("1234567890", "1"); printDistance("1234567890", "12"); printDistance("1234567890", "123"); printDistance("1234567890", "1234"); printDistance("1234567890", "12345"); printDistance("1234567890", "123456"); printDistance("1234567890", "1234567"); printDistance("1234567890", "12345678"); printDistance("1234567890", "123456789"); printDistance("1234567890", "1234567890"); printDistance("1234567890", "1234567980"); printDistance("47/2010", "472010"); printDistance("47/2010", "472011"); printDistance("47/2010", "AB.CDEF"); printDistance("47/2010", "4B.CDEFG"); printDistance("47/2010", "AB.CDEFG"); printDistance("The quick fox jumped", "The fox jumped"); printDistance("The quick fox jumped", "The fox"); printDistance("The quick fox jumped", "The quick fox jumped off the balcany"); printDistance("kitten", "sitting"); printDistance("rosettacode", "raisethysword"); printDistance(new StringBuilder("rosettacode").reverse().toString(), new StringBuilder("raisethysword").reverse().toString()); for (int i = 1; i < args.length; i += 2) { printDistance(args[i - 1], args[i]); } } }

Anton Gogolev · Answer

理論的には、距離の編集と比較できます。

Laurence Gonsalves · Answer

これは通常、距離の編集メジャーを使用して行われます。「距離編集Java」を検索すると、 this one のような多くのライブラリが表示されます。

duffymo · Answer

あなたの文字列がドキュメントに変わったら、私には plagiarism Finder のように聞こえます。その用語で検索すると、何か良い結果が得られるかもしれません。

「Programming Collective Intelligence」には、2つのドキュメントが類似しているかどうかを判断する章があります。コードはPythonで記述されていますが、クリーンで移植が容易です。

Javaの類似文字列の比較

editDistance()の計算：

作業例：

`editDistance()`の計算：