2つの文字列間で共通の部分文字列を見つける

Question

2つの文字列を比較して、一致したままにして、比較が失敗した場所で分割したいと思います。

したがって、2つの文字列がある場合-

string1 = apples string2 = appleses answer = apples

別の例として、文字列には複数のWordを含めることができます。

string1 = Apple pie available string2 = Apple pies answer = Apple pie

簡単なPythonこれを行う方法があるはずですが、解決できません。どんな助けや説明も歓迎します。

thefourtheye · Accepted Answer

その最長共通部分文字列問題と呼ばれます。ここでは、シンプルで理解しやすいが非効率的なソリューションを紹介します。このアルゴリズムの複雑さはO（N ^ 2）であるため、大きな文字列に対して正しい出力を生成するには長い時間がかかります。

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): match = "" for j in range(len2): if (i + j < len1 and string1[i + j] == string2[j]): match += string2[j] else: if (len(match) > len(answer)): answer = match match = "" return answer print longestSubstringFinder("Apple pie available", "Apple pies") print longestSubstringFinder("apples", "appleses") print longestSubstringFinder("bapples", "cappleses")

出力

Apple pie apples apples

RickardSjogren · Answer

完全を期すために、標準ライブラリのdifflibは、シーケンス比較ユーティリティのロードを提供します。例えば - find_longest_match これは、文字列で使用されたときに最も長い共通部分文字列を見つけます。使用例：

from difflib import SequenceMatcher string1 = "Apple pie available" string2 = "come have some Apple pies" match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2)) print(match) # -> Match(a=0, b=15, size=9) print(string1[match.a: match.a + match.size]) # -> Apple pie print(string2[match.b: match.b + match.size]) # -> Apple pie

Eric · Answer

def common_start(sa, sb): """ returns the longest common substring from the beginning of sa and sb """ def _iter(): for a, b in Zip(sa, sb): if a == b: yield a else: return return ''.join(_iter())

>>> common_start("Apple pie available", "Apple pies") 'Apple pie'

または少し奇妙な方法：

def stop_iter(): """An easy way to break out of a generator""" raise StopIteration def common_start(sa, sb): return ''.join(a if a == b else stop_iter() for a, b in Zip(sa, sb))

より読みやすいかもしれません

def terminating(cond): """An easy way to break out of a generator""" if cond: return True raise StopIteration def common_start(sa, sb): return ''.join(a for a, b in Zip(sa, sb) if terminating(a == b))

jonas · Answer

os.path.commonprefix これは文字に対して機能するため、任意の文字列に使用できます。

import os common = os.path.commonprefix(['Apple pie available', 'Apple pies']) assert common == 'Apple pie'

SergeyR · Answer

Evo's と同じですが、比較する文字列の数は任意です：

def common_start(*strings): """ Returns the longest common substring from the beginning of the `strings` """ def _iter(): for z in Zip(*strings): if z.count(z[0]) == len(z): # check all elements in `z` are the same yield z[0] else: return return ''.join(_iter())

user7733798 · Answer

最初の答えでバグを修正します。

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): for j in range(len2): lcs_temp=0 match='' while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]): match += string2[j+lcs_temp] lcs_temp+=1 if (len(match) > len(answer)): answer = match return answer print longestSubstringFinder("dd Apple pie available", "Apple pies") print longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000") print longestSubstringFinder("bapples", "cappleses") print longestSubstringFinder("apples", "apples")

Birei · Answer

試してください：

import itertools as it ''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], Zip(string1, string2)))

両方の文字列の先頭から比較を行います。

Rali Tsanova · Answer

これはそれを行うのに最も効率的な方法ではありませんが、私が思いつくものであり、機能します。誰かがそれを改善できるなら、してください。マトリックスを作成し、文字が一致する場所に1を入れます。次に、行列をスキャンして1の最長の対角線を見つけ、開始点と終了点を追跡します。次に、開始位置と終了位置を引数として入力文字列の部分文字列を返します。

注：これは、最長の共通部分文字列を1つだけ検出します。複数ある場合は、結果を格納する配列を作成して返すことができます。また、大文字と小文字が区別されるため、（Apple pie、Apple pie）はpple pieを返します。

def longestSubstringFinder(str1, str2): answer = "" if len(str1) == len(str2): if str1==str2: return str1 else: longer=str1 shorter=str2 Elif (len(str1) == 0 or len(str2) == 0): return "" Elif len(str1)>len(str2): longer=str1 shorter=str2 else: longer=str2 shorter=str1 matrix = numpy.zeros((len(shorter), len(longer))) for i in range(len(shorter)): for j in range(len(longer)): if shorter[i]== longer[j]: matrix[i][j]=1 longest=0 start=[-1,-1] end=[-1,-1] for i in range(len(shorter)-1, -1, -1): for j in range(len(longer)): count=0 begin = [i,j] while matrix[i][j]==1: finish=[i,j] count=count+1 if j==len(longer)-1 or i==len(shorter)-1: break else: j=j+1 i=i+1 i = i-count if count>longest: longest=count start=begin end=finish break answer=shorter[int(start[0]): int(end[0])+1] return answer

radhikesh93 · Answer

def matchingString(x,y): match='' for i in range(0,len(x)): for j in range(0,len(y)): k=1 # now applying while condition untill we find a substring match and length of substring is less than length of x and y while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]): if len(match) <= len(x[i:i+k]): match = x[i:i+k] k=k+1 return match print matchingString('Apple','ale') #le print matchingString('Apple pie available','Apple pies') #Apple pie

user3838498 · Answer

def LongestSubString(s1,s2): left = 0 right =len(s2) while(left<right): if(s2[left] not in s1): left = left+1 else: if(s2[left:right] not in s1): right = right - 1 else: return(s2[left:right]) s1 = "pineapple" s2 = "applc" print(LongestSubString(s1,s2))

Bantu Manjunath · Answer

これは、「最長シーケンスファインダー」と呼ばれる教室の問題です。私は私のために働いたいくつかの簡単なコードを与えました、また私の入力は文字列でもあり得るシーケンスのリストであり、あなたを助けるかもしれません：

def longest_substring(list1,list2): both=[] if len(list1)>len(list2): small=list2 big=list1 else: small=list1 big=list2 removes=0 stop=0 for i in small: for j in big: if i!=j: removes+=1 if stop==1: break Elif i==j: both.append(i) for q in range(removes+1): big.pop(0) stop=1 break removes=0 return both

xXDaveXx · Answer

最初の最長共通部分文字列を返します。

def compareTwoStrings(string1, string2): list1 = list(string1) list2 = list(string2) match = [] output = "" length = 0 for i in range(0, len(list1)): if list1[i] in list2: match.append(list1[i]) for j in range(i + 1, len(list1)): if ''.join(list1[i:j]) in string2: match.append(''.join(list1[i:j])) else: continue else: continue for string in match: if length < len(list(string)): length = len(list(string)) output = string else: continue return output

wwii · Answer

最初に、helper関数を itertools pairwise recipe から適応させて部分文字列を生成します。

import itertools def n_wise(iterable, n = 2): '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ... n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...''' a = itertools.tee(iterable, n) for x, thing in enumerate(a[1:]): for _ in range(x+1): next(thing, None) return Zip(*a)

次に、関数は部分文字列を最長で最初に反復処理し、メンバーシップをテストします。（効率は考慮されません）

def foo(s1, s2): '''Finds the longest matching substring ''' # the longest matching substring can only be as long as the shortest string #which string is shortest? shortest, longest = sorted([s1, s2], key = len) #iterate over substrings, longest substrings first for n in range(len(shortest)+1, 2, -1): for sub in n_wise(shortest, n): sub = ''.join(sub) if sub in longest: #return the first one found, it should be the longest return sub s = "fdomainster" t = "exdomainid" print(foo(s,t))

>>> domain >>>