文字列から非UTF8文字を削除します

Question

正しく表示されていない文字列から非utf8文字を削除すると問題が発生します。文字はこのような0x97 0x61 0x6C 0x6F（16進表記）

それらを削除する最良の方法は何ですか？正規表現か何か？

Markus Jarderot · Accepted Answer

正規表現アプローチの使用：

$regex = <<<'END' / ( (?: [\x00-\x7F] # single-byte sequences 0xxxxxxx | [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx | [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2 | [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 ){1,100} # ...one or more times ) | . # anything else /x END; preg_replace($regex, '$1', $text);

UTF-8シーケンスを検索し、それらをグループ1にキャプチャします。また、UTF-8シーケンスの一部として識別できなかった1バイトに一致しますが、それらをキャプチャしません。置換は、グループ1にキャプチャされたものです。これにより、無効なバイトがすべて効果的に削除されます。

無効なバイトをUTF-8文字としてエンコードすることにより、文字列を修復することができます。しかし、エラーがランダムである場合、これはいくつかの奇妙なシンボルを残す可能性があります。

$regex = <<<'END' / ( (?: [\x00-\x7F] # single-byte sequences 0xxxxxxx | [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx | [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2 | [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 ){1,100} # ...one or more times ) | ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111 | ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111 /x END; function utf8replacer($captures) { if ($captures[1] != "") { // Valid byte sequence. Return unmodified. return $captures[1]; } elseif ($captures[2] != "") { // Invalid byte of the form 10xxxxxx. // Encode as 11000010 10xxxxxx. return "\xC2".$captures[2]; } else { // Invalid byte of the form 11xxxxxx. // Encode as 11000011 10xxxxxx. return "\xC3".chr(ord($captures[3])-64); } } preg_replace_callback($regex, "utf8replacer", $text);

編集：

!empty(x)は空でない値と一致します（"0"は空と見なされます）。
x != ""は、"0"を含む空でない値と一致します。
x !== ""は、""以外のすべてに一致します。

x != ""は、この場合に使用するのに最適なようです。

私は試合を少しスピードアップしました。各文字を個別に照合する代わりに、有効なUTF-8文字のシーケンスと照合します。

Sebasti&#225;n Grignoli · Answer

すでにUTF8文字列にutf8_encode()を適用すると、文字化けしたUTF8出力が返されます。

このすべての問題に対処する関数を作成しました。 Encoding::toUTF8()と呼ばれます。

文字列のエンコーディングが何であるかを知る必要はありません。 Latin1（ISO8859-1）、Windows-1252、またはUTF8にすることも、文字列にそれらを混在させることもできます。 Encoding::toUTF8()はすべてをUTF8に変換します。

同じ文字列にこれらのエンコーディングを混在させて、サービスがすべてのデータのフィードを台無しにしているので、私はそれをしました。

使用法：

require_once('Encoding.php'); use \ForceUTF8\Encoding; // It's namespaced now. $utf8_string = Encoding::toUTF8($mixed_string); $latin1_string = Encoding::toLatin1($mixed_string);

別の関数Encoding :: fixUTF8（）を含めました。この関数は、UTF8に複数回エンコードされた結果、文字化けしたように見えるすべてのUTF8文字列を修正します。

使用法：

require_once('Encoding.php'); use \ForceUTF8\Encoding; // It's namespaced now. $utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

出力されます：

Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football

ダウンロード：

https://github.com/neitanod/forceutf8

Frosty Z · Answer

Mbstringを使用できます：

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

...無効な文字を削除します。

参照：無効なUTF-8文字を疑問符で置き換えると、mbstring.substitute_characterは無視されたようです

David D · Answer

この関数はすべてのNON ASCII文字を削除しますが、有用ですが、質問を解決しません：
これはエンコードに関係なく常に機能する私の機能です。

function remove_bs($Str) { $StrArr = str_split($Str); $NewStr = ''; foreach ($StrArr as $Char) { $CharNo = ord($Char); if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £ if ($CharNo > 31 && $CharNo < 127) { $NewStr .= $Char; } } return $NewStr; }

使い方：

echo remove_bs('Hello õhowå åare youÆ?'); // Hello how are you?

Znarkus · Answer

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

これは私が使用しているものです。かなりうまくいくようです。 http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/ から取得

technoarya · Answer

これを試して：

$string = iconv("UTF-8","UTF-8//IGNORE",$string);

iconv manual によると、関数は最初のパラメーターを入力文字セット、2番目のパラメーターを出力文字セット、3番目のパラメーターを実際の入力文字列とします。

入力文字セットと出力文字セットの両方をUTF-8に設定し、//IGNOREフラグを出力文字セットに追加すると、関数はドロップします（ストリップ）出力文字セットで表現できない入力文字列のすべての文字。したがって、有効な入力文字列をフィルタリングします。

HTML5 developer · Answer

テキストには、non-utf8文字を含めることができます。最初にやってみてください：

$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');

詳しくはこちらをご覧ください： http://php.net/manual/en/function.mb-convert-encoding.php news

mumin · Answer

文字列から無効なUTF-8文字を削除する関数を作成しました。 XMLエクスポートファイルを生成する前に、27000製品の説明をクリアするために使用しています。

public function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $ret .= chr($current); } else { $ret .= ""; } } return $ret; }

masakielastic · Answer

UConverterはPHP 5.5以降で使用できます。 UConverterは、int拡張子を使用し、mbstringを使用しない場合に適しています。

function replace_invalid_byte_sequence($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return (new UConverter('UTF-8', 'UTF-8'))->convert($str); }

PHP 5.4以降、htmlspecialcharsを使用して無効なバイトシーケンスを削除できます。 Htmlspecialcharsは、サイズの大きいバイトと正確さを処理するためにpreg_matchよりも優れています。正規表現を使用した多くの間違った実装が見られます。

function replace_invalid_byte_sequence3($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); }

Alix Axel · Answer

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|Grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

Oleksii Chekulaiev · Answer

最近のパッチからDrupalのFeeds JSONパーサーモジュールまで：

//remove everything except valid letters (from any language) $raw = preg_replace('/(?:\\u[\pL\p{Zs}])+/', '', $raw);

気になる場合は、有効な文字としてスペースが保持されます。

必要なことをしました。これは、MySQLの「utf8」文字セットに適合せず、「SQLSTATE [HY000]：General error：1366 Incorrect string value」のようなエラーを引き起こした、現在普及している絵文字を削除します。

詳細については、 https://www.drupal.org/node/1824506#comment-6881382 を参照してください

Will · Answer

そのため、最初の TF-8 オクテットにはマーカーとして上位ビットが設定され、その後1〜4ビットで追加オクレットの数を示すという規則があります。次に、追加の各オクレットの上位2ビットを10に設定する必要があります。

擬似Pythonは次のようになります。

newstring = '' cont = 0 for each ch in string: if cont: if (ch >> 6) != 2: # high 2 bits are 10 # do whatever, e.g. skip it, or skip whole point, or? else: # acceptable continuation of multi-octlet char newstring += ch cont -= 1 else: if (ch >> 7): # high bit set? c = (ch << 1) # strip the high bit marker while (c & 1): # while the high bit indicates another octlet c <<= 1 cont += 1 if cont > 4: # more than 4 octels not allowed; cope with error if !cont: # illegal, do something sensible newstring += ch # or whatever if cont: # last utf-8 was not terminated, cope

これと同じロジックはphpに翻訳できるはずです。ただし、不正な文字を取得した場合、どのようなストリッピングを実行するかは明確ではありません。

Daniel Powers · Answer

Unicode基本言語プレーン以外のすべてのUnicode文字を削除するには：

$str = preg_replace("/[^\x00-\xFFFF]/", "", $str);

misaxi · Answer

質問とは少し異なりますが、私がやっていることはHtmlEncode（string）を使用することです、

擬似コードはこちら

var encoded = HtmlEncode(string); encoded = Regex.Replace(encoded, "&#\d+?;", ""); var result = HtmlDecode(encoded);

入出力

"Headlight\x007E Bracket, &#123; Cafe Racer<> Style,Â Stainless Steel 中文呢？" "Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢？"

私はそれが完璧ではないことを知っていますが、私のために仕事をします。