ノイズのある文字列から電話番号を抽出する

Question

ランダムなデータとさまざまな形式の電話番号を含むテーブルの列があります。列には

名前
電話
Eメール
HTMLタグ
住所（番号付き）

例：

_1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 2) John Smith 3) xxx@yyy.com 4) John Smith 8 999 888 77 77 _

電話番号の書き方も異なります。 8 927 410 00 22, 8(927)410-00-22, +7(927)410-00-22, +7 (927) 410-00-22, (927)410 00 22, 927 410 00 22, 9(2741) 0 0 0-22などのようになります。

ここでの一般的なルールは、電話番号の形式に10から11桁が含まれることです。

私の推測では、正規表現を使用して、最初に電子メールアドレス（79990001122@gmail.comのように電話番号を含めることができるため）を文字列から削除し、正規表現を使用して10桁または11桁の数字を知っていることに基づいて電話を抽出します,(,),+,-などの文字で区切られた行（誰かが_._を電話の桁区切り文字として使用することはないので、_77.106.46.202_のようなIPアドレスは考えたくない最初のサンプル）。

したがって問題は、これらの値から電話番号を取得する方法です。

上記の3つの例から取得したい最終的な値は次のとおりです。

_1) 79005346546 79005346546 79005346546 2) 3) 4) 89998887777 _

サーバーはMicrosoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit)です

Alan Burstein · Answer

更新済み（20200226）

CLR/regexソリューションは、私が投稿したngram8kソリューションよりも高速である可能性があるというコメントが2つありました。私はこれを6年間聞いたことがありますが、例外なく、テストハーネスは毎回異なる話をします。 CLR RegexのMicrosoft©MDQファミリーをほんの数分で実行するために、以前のコメントの説明で既に投稿しました。これらはMicrosoftによって開発、テスト、調整され、Master Data Services/Data Quality Servicesに付属しています。私は何年も使ってきましたが、とても良いです。

RegexReplace/RegexSplitとPatExtract8k/DigitsOnlyEE：1,000,000行

WHEREclauseで関数を使用したくないのは明らかですが、私のRegexはさびたAFなので、そうする必要がありました。公平を期すために、N-GramソリューションのWHERE句でDigitsOnlyEEを使用して同じことを行いました。

SET NOCOUNT ON; DBCC FREEPROCCACHE WITH NO_INFOMSGS; DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS; SET STATISTICS TIME ON; DECLARE @newData BIT = 0, @string VARCHAR(8000) = '1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 ', @pattern VARCHAR(50) = '[^0-9()+.-]', @srchLen INT = 11; IF @newData = 1 BEGIN IF OBJECT_ID('tempdb..#strings','U') IS NOT NULL DROP TABLE #strings; SELECT StringId = IDENTITY(INT,1,1), String = REPLICATE(@string,ABS(CHECKSUM(NEWID())%3)+1) INTO #strings FROM dbo.rangeAB(1,1000000,1,1) AS r; END PRINT CHAR(10)+'Regex/CLR version Serial'+CHAR(10)+REPLICATE('-',90); SELECT regex.NewString FROM #strings AS s CROSS APPLY ( SELECT STRING_AGG(clr.RegexReplace(f.Token,'[^0-9]','',0),' ') FROM clr.RegexSplit(s.string,@pattern,N'[0-9()+.-]',0) AS f WHERE f.IsValid = 1 AND LEN(clr.RegexReplace(f.Token,'[^0-9]','',0)) = @srchLen ) AS regex(NewString); PRINT CHAR(10)+'NGrams version Serial'+CHAR(10)+REPLICATE('-',90); SELECT ngramsStuff.NewString FROM #strings AS s CROSS APPLY ( SELECT STRING_AGG(ee.digitsOnly,' ') FROM samd.patExtract8K(@string,@pattern) AS pe CROSS APPLY samd.digitsOnlyEE(pe.item) AS ee WHERE LEN(ee.digitsOnly) = @srchLen ) AS ngramsStuff(NewString) OPTION (MAXDOP 1); SET STATISTICS TIME OFF; GO

テスト結果

Regex/CLR version Serial ------------------------------------------------------------------------------------------ SQL Server Execution Times: CPU time = 19918 ms, elapsed time = 12355 ms. NGrams version Serial ------------------------------------------------------------------------------------------ SQL Server Execution Times: CPU time = 844 ms, elapsed time = 971 ms.

NGrams8kは非常に高速で、新しいアセンブリのコンパイル、新しいプログラミング言語の学習、CLR関数の有効化などを必要としません...ガベージコレクションの問題はありません。 MDS/DQSに同梱されているCLR N-GRAM機能でさえ、パフォーマンスのためにNGrams8kを操作することはできません（私の記事の下のコメントを参照してください）。

更新の終わり

最初に ngrams8k のコピーを取得し、それを使用してPatExtract8k（この投稿の下部にある以下のDDL）をビルドします。次に、簡単なウォームアップを行います。

DECLARE @string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222, (313)555-6789, or at 1+800-555-4444 before noon. Thanks!', @pattern VARCHAR(50) = '%[^0-9()+.-]%'; SELECT pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item FROM samd.patExtract8K(@string,@pattern) AS pe WHERE pe.itemLength > 1;

戻り値：

ItemNumber ItemIndex ItemLength Item ----------- ----------- ----------- ---------------- 1 18 8 222-3333 2 42 12 312.555.2222 3 91 13 (313)555-6789 4 112 14 1+800-555-4444

関数は一致したパターン、文字列内の位置、アイテムの長さ、およびアイテムを返すことに注意してください。最初の3つの属性は、さらに処理するために利用できます。私のコメントに注意してください：

-- First for some easily consumable sample data. DECLARE @things TABLE (StringId INT IDENTITY, String VARCHAR(8000)); INSERT @things (String) VALUES ('Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 '), ('John Smith'), ('xxx@yyy.com'), ('John Smith 8 999 888 77 77'); DECLARE @SrchLen INT = 11; SELECT StringId = t.StringId, ItemIndex = pe.itemIndex, ItemLength = @SrchLen, Item = i2.Item FROM @things AS t CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx) CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString) CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item) WHERE pe.itemLength >= @SrchLen;

戻り値：

StringId ItemIndex ItemLength Item ----------- -------------------- ----------- ----------- 1 17 11 79005346546 1 62 11 79005346546 1 221 11 79005346546 4 11 11 89998887777

次に、outer行を次のように処理し、行から列への連結を次のように処理できます。

WITH t AS ( SELECT i2.Item, t.StringId FROM @things AS t CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx) CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString) CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item) WHERE pe.itemLength >= @SrchLen ) SELECT StringId = t2.StringId, NewString = ISNULL(( SELECT t.item+' ' FROM t WHERE t.StringId = t2.StringId FOR XML PATH('')),'') FROM @things AS t2 LEFT JOIN t AS t1 ON t2.StringId = t1.StringId GROUP BY t2.StringId;

戻り値：

StringId NewString --------- -------------------------------------- 1 79005346546 79005346546 79005346546 2 3 4 89998887777

追加の詳細のためにもう少し時間があればいいのですが、計画よりも少し時間がかかりました。どんな質問でも歓迎します。

Patextract：

CREATE FUNCTION samd.patExtract8K ( @string VARCHAR(8000), @pattern VARCHAR(50) ) /***************************************************************************************** [Description]: This can be considered a T-SQL inline table valued function (iTVF) equivalent of Microsoft's mdq.RegexExtract except that: 1. It includes each matching substring's position in the string 2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50) instead of nvarchar(4000) for the pattern 3. The mask parameter is not required and therefore does not exist. 4. You have specify what text we're searching for as an exclusion; e.g. for numeric characters you should search for '[^0-9]' instead of '[0-9]'. 5. There is is no parameter for naming a "capture group". Using the variable below, both the following queries will return the same result: DECLARE @string nvarchar(4000) = N'123 Main Street'; SELECT item FROM samd.patExtract8K(@string, '[^0-9]'); SELECT clr.RegexExtract(@string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1); Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here: http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the rows where [matched]=0. The key benefit of is that it performs substantially better because you are only returning the number of rows required instead of returning twice as many rows then filtering out half of them. Furthermore, because we're The following two sets of queries return the same result: DECLARE @string varchar(100) = 'xx123xx555xx999'; BEGIN -- QUERY #1 -- patExtract8K SELECT ps.itemNumber, ps.item FROM samd.patExtract8K(@string, '[^0-9]') ps; -- patternSplitCM SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item FROM dbo.patternSplitCM(@string, '[^0-9]') ps WHERE [matched] = 0; -- QUERY #2 SELECT ps.itemNumber, ps.item FROM samd.patExtract8K(@string, '[0-9]') ps; SELECT itemNumber = row_number() over (order by itemNumber), item FROM dbo.patternSplitCM(@string, '[0-9]') WHERE [matched] = 0; END; [Compatibility]: SQL Server 2008+ [Syntax]: --===== Autonomous SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item FROM samd.patExtract8K(@string,@pattern) pe; --===== Against a table using APPLY SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item FROM samd.SomeTable t CROSS APPLY samd.patExtract8K(t.someString, @pattern) pe; [Parameters]: @string = varchar(8000); the input string @searchString = varchar(50); pattern to search for [Returns]: itemNumber = bigint; the instance or ordinal position of the matched substring itemIndex = bigint; the location of the matched substring inside the input string itemLength = int; the length of the matched substring item = varchar(8000); the returned text [Developer Notes]: 1. Requires NGrams8k 2. patExtract8K does not return any rows on NULL or empty strings. Consider using OUTER APPLY or append the function with the code below to force the function to return a row on emply or NULL inputs: UNION ALL SELECT 1, 0, NULL, @string WHERE nullif(@string,'') IS NULL; 3. patExtract8K is not case sensitive; use a case sensitive collation for case-sensitive comparisons 4. patExtract8K is deterministic. For more about deterministic functions see: https://msdn.Microsoft.com/en-us/library/ms178091.aspx 5. patExtract8K performs substantially better with a parallel execution plan, often 2-3 times faster. For queries that leverage patextract8K that are not getting a parallel exeution plan you should consider performance testing using Traceflag 8649 in Development environments and Adam Machanic's make_parallel in production. [Examples]: --===== (1) Basic extact all groups of numbers: WITH temp(id, txt) as ( SELECT * FROM (values (1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'), (2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'), (3, '&**OOOOO=+ + + // ==?76543// and today !!222222\	ester{}))22222444 done'))t(x,xx) ) SELECT [temp.id] = t.id, pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item FROM temp AS t CROSS APPLY samd.patExtract8K(t.txt, '[^0-9]') AS pe; ----------------------------------------------------------------------------------------- Revision History: Rev 00 - 20170801 - Initial Development - Alan Burstein Rev 01 - 20180619 - Complete re-write - Alan Burstein *****************************************************************************************/ RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position), itemIndex = f.position, itemLength = itemLen.l, item = SUBSTRING(f.token, 1, itemLen.l) FROM ( SELECT ng.position, SUBSTRING(@string,ng.position,DATALENGTH(@string)) FROM samd.NGrams8k(@string, 1) AS ng WHERE PATINDEX(@pattern, ng.token) < --<< this token does NOT match the pattern ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR PATINDEX(@pattern,SUBSTRING(@string,ng.position-1,1)) --<< always 0 for 1st row ) AS f(position, token) CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+@pattern+'%',f.token),0), DATALENGTH(@string)+2-f.position)-1)) AS itemLen(l); GO