タグを無視して、HTMLを含むテキストを切り捨てます

Question

一部のテキスト（データベースまたはテキストファイルからロードされた）を切り捨てたいのですが、HTMLが含まれているため、タグが含まれ、返されるテキストは少なくなります。これにより、タグが閉じられなかったり、部分的に閉じられたりする可能性があります（そのため、Tidyが正しく機能せず、コンテンツがまだ少ない可能性があります）。テキストに基づいてどのように切り捨てることができますか（そして、より複雑な問題を引き起こす可能性があるため、おそらくテーブルに到達したときに停止します）。

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

結果として：

Hello, my <strong>name</st...

私が望むのは：

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

これどうやってするの？

私の質問はPHPでそれを行う方法についてですが、C＃でそれを行う方法を知るのは良いでしょう...私はメソッドを移植できると思うので大丈夫です（それが組み込みでない限り）方法）。

また、HTMLエンティティ´を含めたことにも注意してください。これは、（この例のように7文字ではなく）1文字と見なす必要があります。

strip_tagsはフォールバックですが、フォーマットとリンクが失われ、HTMLエンティティの問題が引き続き発生します。

S&#248;ren L&#248;vborg · Accepted Answer

有効なXHTMLを使用している場合、HTMLを解析してタグが適切に処理されることを確認するのは簡単です。これまでに開いたタグを追跡し、「途中」でそれらを再び閉じることを確認するだけです。

<?php header('Content-type: text/plain; charset=utf-8'); function printTruncated($maxLength, $html, $isUtf8=true) { $printedLength = 0; $position = 0; $tags = array(); // For UTF-8, we need to count multibyte sequences as one character. $re = $isUtf8 ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}' : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}'; while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position)) { list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = substr($html, $position, $tagPosition - $position); if ($printedLength + strlen($str) > $maxLength) { print(substr($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } print($str); $printedLength += strlen($str); if ($printedLength >= $maxLength) break; if ($tag[0] == '&' || ord($tag) >= 0x80) { // Pass the entity or UTF-8 multibyte sequence through unchanged. print($tag); $printedLength++; } else { // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/') { // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. print($tag); } else if ($tag[strlen($tag) - 2] == '/') { // Self-closing tag. print($tag); } else { // Opening tag. print($tag); $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < strlen($html)) print(substr($html, $position, $maxLength - $printedLength)); // Close any open tags. while (!empty($tags)) printf('</%s>', array_pop($tags)); } printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("
"); printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("
"); printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("
");

エンコードに関する注意：上記のコードは、XHTMLが TF-8 エンコードされていることを前提としています。 ASCII互換のシングルバイトエンコーディング（ Latin-1 など）もサポートされています。3番目の引数としてfalseを渡すだけです。他のマルチバイトエンコーディングはサポートされていませんが、関数を呼び出す前にmb_convert_encodingを使用してUTF-8に変換し、すべてのprintステートメントで再度変換することで、サポートをハックできます。

（ただし、常にUTF-8を使用する必要があります。）

Edit：文字エンティティとUTF-8を処理するように更新されました。その文字が文字エンティティである場合、関数が1つの文字を印刷しすぎるバグを修正しました。

alockwood05 · Answer

私はあなたが提案するようにHTMLを切り捨てる関数を書きましたが、それを印刷する代わりに、すべてを文字列変数に保持します。 HTMLエンティティも処理します。

 /** * function to truncate and then clean up end of the HTML, * truncates by counting characters outside of HTML tags * * @author alex lockwood, alex dot lockwood at websightdesign * * @param string $str the string to truncate * @param int $len the number of characters * @param string $end the end string for truncation * @return string $truncated_html * * **/ public static function truncateHTML($str, $len, $end = '&hellip;'){ //find all tags $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER ); //WSDDebug::dump($matches); exit; $i =0; //loop through each found tag that is within the $len, add those characters to the len, //also track open and closed tags // $matches[$i][0] = the whole tag string --the only applicable field for html enitities // IF its not matching an &htmlentity; the following apply // $matches[$i][1] = the start of the tag either '<' or '</' // $matches[$i][2] = the tag name // $matches[$i][3] = the end of the tag //$matces[$i][$j][0] = the string //$matces[$i][$j][1] = the str offest while($matches[$i][0][1] < $len && !empty($matches[$i])){ $len = $len + strlen($matches[$i][0][0]); if(substr($matches[$i][0][0],0,1) == '&' ) $len = $len-1; //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting //ignore empty/singleton tags for tag counting if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){ //double check if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/') $openTags[] = $matches[$i][2][0]; elseif(end($openTags) == $matches[$i][2][0]){ array_pop($openTags); }else{ $warnings[] = "html has some tags mismatched in it: $str"; } } $i++; } $closeTags = ''; if (!empty($openTags)){ $openTags = array_reverse($openTags); foreach ($openTags as $t){ $closeTagString .="</".$t . ">"; } } if(strlen($str)>$len){ // Finds the last space from the string new length $lastWord = strpos($str, ' ', $len); if ($lastWord) { //truncate with new len last Word $str = substr($str, 0, $lastWord); //finds last character $last_character = (substr($str, -1, 1)); //add the end text $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end); } //restore any open tags $truncated_html .= $closeTagString; }else $truncated_html = $str; return $truncated_html; }

Kornel · Answer

100％正確ですが、かなり難しいアプローチ：

DOMを使用して文字を繰り返す
DOMメソッドを使用して残りの要素を削除する
DOMをシリアル化する

力ずくのアプローチ：

PREG_DELIM_CAPTUREでpreg_split('/(<tag>)/')を使用して、文字列をタグ（要素ではなく）とテキストフラグメントに分割します。
必要なテキストの長さを測定します（分割から1秒ごとの要素になります。html_entity_decode()を使用すると、正確に測定できます）
文字列を切り取ります（トリム&[^\s;]+$切り刻まれたエンティティを取り除くために最後に）
HTML Tidyで修正する

periklis · Answer

http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words にあるNice関数を使用しました。

Stefan Gehrig · Answer

以下は、テストケースを正常に処理する単純なステートマシンパーサーです。タグ自体は追跡しないため、ネストされたタグで失敗します。また、HTMLタグ内のエンティティ（たとえば、_ [のhref-属性内の<a>-鬼ごっこ）。したがって、この問題を100％解決することはできませんが、理解しやすいため、より高度な機能の基礎になる可能性があります。

function substr_html($string, $length) { $count = 0; /* * $state = 0 - normal text * $state = 1 - in HTML tag * $state = 2 - in HTML entity */ $state = 0; for ($i = 0; $i < strlen($string); $i++) { $char = $string[$i]; if ($char == '<') { $state = 1; } else if ($char == '&') { $state = 2; $count++; } else if ($char == ';') { $state = 0; } else if ($char == '>') { $state = 0; } else if ($state === 0) { $count++; } if ($count === $length) { return substr($string, 0, $i + 1); } } return $string; }

david bowies labyrinth crotch · Answer

この場合、厄介な正規表現のハックでDomDocumentを使用できますが、タグが壊れていると、最悪の場合は警告が表示されます。

$dom = new DOMDocument(); $dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)); $html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML()); echo $html;

出力を与える必要があります：Hello, my <strong>**name**</strong>。

hawkip · Answer

バウンスがSørenLøvborgのソリューションにマルチバイト文字のサポートを追加しました-私が追加しました：

ペアになっていないHTMLタグのサポート（例：<hr>、<br> <col>などは閉じません-HTMLでは、これらの最後に「/」は必要ありません（ただし、XHTMLの場合））。
カスタマイズ可能な切り捨てインジケーター（デフォルトは&hellips;すなわち…）、
出力バッファを使用せずに文字列として返す
カバー率100％の単体テスト。

すべてこれは Pastie です。

Andrey Nagikh · Answer

SørenLøvborgのprintTruncated関数に対するもう1つの小さな変更は、UTF-8（mbstringが必要）と互換性があり、文字列を出力せずに文字列を返すようにすることです。もっと便利だと思います。そして、私のコードでは、バウンスバリアントのようなバッファリングを使用していません。

UPD：タグ属性のutf-8文字で適切に機能させるには、以下に示すmb_preg_match関数が必要です。

その機能を提供してくれたSørenLøvborgに感謝します。とても良いです。

/* Truncate HTML, close opened tags * * @param int, maxlength of the string * @param string, html * @return $html */ function htmlTruncate($maxLength, $html) { mb_internal_encoding("UTF-8"); $printedLength = 0; $position = 0; $tags = array(); $out = ""; while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)) { list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = mb_substr($html, $position, $tagPosition - $position); if ($printedLength + mb_strlen($str) > $maxLength) { $out .= mb_substr($str, 0, $maxLength - $printedLength); $printedLength = $maxLength; break; } $out .= $str; $printedLength += mb_strlen($str); if ($tag[0] == '&') { // Handle the entity. $out .= $tag; $printedLength++; } else { // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/') { // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. $out .= $tag; } else if ($tag[mb_strlen($tag) - 2] == '/') { // Self-closing tag. $out .= $tag; } else { // Opening tag. $out .= $tag; $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + mb_strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < mb_strlen($html)) $out .= mb_substr($html, $position, $maxLength - $printedLength); // Close any open tags. while (!empty($tags)) $out .= sprintf('</%s>', array_pop($tags)); return $out; } function mb_preg_match( $ps_pattern, $ps_subject, &$pa_matches, $pn_flags = 0, $pn_offset = 0, $ps_encoding = NULL ) { // WARNING! - All this function does is to correct offsets, nothing else: //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER) if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding(); $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding)); $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset); if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE)) foreach($pa_matches as &$ha_match) { $ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding); } return $ret; }

DavidJ · Answer

CakePHP フレームワークには、TextHelperで動作するHTML対応のtruncate（）関数があります。 Core-Helpers/Text を参照してください。 MITライセンス。

gpilotino · Answer

tidy も使用できます：

function truncate_html($html, $max_length) { return tidy_repair_string(substr($html, 0, $max_length), array('wrap' => 0, 'show-body-only' => TRUE), 'utf8'); }

Bounce · Answer

SørenLøvborgprintTruncated関数に小さな変更を加え、UTF-8互換にしました。

 /* Truncate HTML, close opened tags * * @param int, maxlength of the string * @param string, html * @return $html */ function html_truncate($maxLength, $html){ mb_internal_encoding("UTF-8"); $printedLength = 0; $position = 0; $tags = array(); ob_start(); while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){ list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = mb_strcut($html, $position, $tagPosition - $position); if ($printedLength + mb_strlen($str) > $maxLength){ print(mb_strcut($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } print($str); $printedLength += mb_strlen($str); if ($tag[0] == '&'){ // Handle the entity. print($tag); $printedLength++; } else{ // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/'){ // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. print($tag); } else if ($tag[mb_strlen($tag) - 2] == '/'){ // Self-closing tag. print($tag); } else{ // Opening tag. print($tag); $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + mb_strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < mb_strlen($html)) print(mb_strcut($html, $position, $maxLength - $printedLength)); // Close any open tags. while (!empty($tags)) printf('</%s>', array_pop($tags)); $bufferOuput = ob_get_contents(); ob_end_clean(); $html = $bufferOuput; return $html; }

Antony Carthy · Answer

バリデーターとパーサーを使用せずにこれを行うのは非常に困難です。その理由は、

<div id='x'> <div id='y'> <h1>Heading</h1> 500 lines of html ... etc ... </div> </div>

それをどのように切り捨てて、有効なHTMLにするのですか？

簡単な検索の結果、このリンクが見つかりました。

jlgrall · Answer

関数truncateHTML()を使用： https://github.com/jlgrall/truncateHTML

例：省略記号を含む9文字の後に切り捨てます：

truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]); // => "<p><b>A</b> red ba…</p>"

特徴：UTF-8、構成可能な省略記号、省略記号の長さの包含/除外、自己終了タグ、折りたたみスペース、非表示の要素（<head>、<script>、<noscript>、<style>、）、HTML $entities;、最後にWord全体を切り捨てる（非常に長い単語を切り捨てるオプションもある）、PHP 5.6および7.0 +、240 +の単体テスト）、文字列を返します（出力バッファを使用しません））、よくコメントされたコード。

私はこの関数を書いた、なぜなら私は本当に SørenLøvborg の上記の関数（特に彼がどのようにエンコーディングを管理したか）が好きだったからだが、もう少し機能性と柔軟性が必要だった。