HTMLからテキストを抽出するための正規表現

Question

一般的なHTMLページから、すべてのテキスト（表示されているかどうかに関係なく）を抽出したいと思います。

削除

hTMLタグ
任意のJavaScript
CSSスタイル

それを実現する正規表現（1つ以上）はありますか？

S.Lott · Accepted Answer

正規表現でHTMLを実際に解析することはできません。複雑すぎます。 REは<![CDATA[セクションをまったく正しく処理しません。さらに、<text>のようないくつかの種類の一般的なHTMLは、ブラウザでは適切なテキストとして機能しますが、単純なREを妨げる可能性があります。

適切なHTMLパーサーを使用すると、より幸せで成功するでしょう。 Python人々はしばしば何かを使用します Beautiful Soup HTMLを解析し、タグとスクリプトを取り除きます。

また、ブラウザは、設計上、不正な形式のHTMLを許容します。そのため、明らかに不適切なHTMLを解析しようとしていることに気付くことがよくありますが、ブラウザでは問題なく動作します。

REを使用して不正なHTMLを解析できる可能性があります。必要なのは忍耐と努力だけです。ただし、他の人のパーサーを使用する方が簡単な場合がよくあります。

nickf · Answer

JavascriptとCSSを削除します。

<(script|style).*?</\1>

タグを削除する

<.*?>

Joe Bergevin · Answer

プレーンテキストをPHPSimpleDOMと同じように（またはそれよりも優れて）返す、はるかに高速な正規表現ソリューション（php）が必要でした。これが私が思いついた解決策です：

function plaintext($html) { // remove comments and any content found in the the comment area (strip_tags only removes the actual tags). $plaintext = preg_replace('#<!--.*?-->#s', '', $html); // put a space between list items (strip_tags just removes the tags). $plaintext = preg_replace('#</li>#', ' </li>', $plaintext); // remove all script and style tags $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext); // remove br tags (missed by strip_tags) $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext); // remove all remaining html $plaintext = strip_tags($plaintext); return $plaintext; }

いくつかの複雑なサイトでこれをテストしたところ（フォーラムには解析が難しいhtmlが含まれているようです）、このメソッドはPHPSimpleDOMプレーンテキストと同じ結果を返しましたが、はるかに高速でした。また、PHPSimpleDOMでは処理されなかったリスト項目（liタグ）も適切に処理されました。

速度について：

SimpleDom：0.03248秒.
正規表現：0.00087秒.

7倍速い！

Chris Noe · Answer

正規表現でこれを行うことを検討するのは困難です。 XSLTを検討しましたか？スクリプトとスタイルのコンテンツを除いて、XHTMLドキュメント内のすべてのテキストノードを抽出するXPath式は次のようになります。

 // body // text（）[not（ancestor :: script）] [not（ancestor :: style）]

Matthew Scharley · Answer

正規表現を定義するためにPerl構文を使用すると、開始は次のようになります。

!<body.*?>(.*)</body>!smi

次に、そのグループの結果に次の置換を適用します。

!<script.*?</script>!!smi !<[^>]+/[ 	]*>!!smi !</?([a-z]+).*?>!!smi /<!--.*?-->//smi

もちろん、これはテキストファイルとして適切にフォーマットされませんが、すべてのHTMLを削除します（ほとんどの場合、正しく機能しない場合がいくつかあります）。ただし、HTMLを適切に解析し、そこからテキストを抽出するために使用している言語でXMLパーサーを使用することをお勧めします。

David Avsajanishvili · Answer

単純なHTMLの最も簡単な方法（Pythonの例）：

text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>" import re " ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])

これを返します：

'This is my> example HTML, containing tags'

Ayush · Answer

これは、最も複雑なhtmlタグでさえも削除する関数です。

function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "
\$0", "
\$0", "
\$0", "
\$0", "
\$0", "
\$0", "
\$0", "
\$0", ), $text ); // Remove all remaining tags and comments and return. return strip_tags( $text ); }

unigogo · Answer

わからないこのページ役立つかもしれません。

Shiroy · Answer

C＃で使用できるWebBrowserコントロールを使用することはできませんか？

 System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser(); wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>"; System.Windows.Forms.HtmlDocument h = wc.Document; Console.WriteLine(h.Body.InnerText);

mahesh · Answer

string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html); Regex objRegExp = new Regex("<(.|
)+?>"); string replace = objRegExp.Replace(g, ""); replace = replace.Replace(k, string.Empty); replace.Trim("	
 ".ToCharArray()); then take a label and do "label.text=replace;" see on label out put

。

olliej · Answer

私はあなたがただできると信じています

document.body.innerText

これは、表示されているかどうかに関係なく、ドキュメント内のすべてのテキストノードのコンテンツを返します。

[編集（olliej）：ため息気にしないでください。これはSafariとIEでのみ機能し、Firefoxを毎晩ダウンロードしてトランクに存在するかどうかを確認する必要はありません：-/]