UTF-8文字列をUnicodeに変換する方法は？

Question

UTF-8でエンコードされた文字を表示する文字列があり、それをUnicodeに変換したい。

今のところ、私の実装は次のとおりです。

public static string DecodeFromUtf8(this string utf8String) { // read the string as UTF-8 bytes. byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String); // convert them into unicode bytes. byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes); // builds the converted string. return Encoding.Unicode.GetString(encodedBytes); }

Word "déjà"で遊んでいます。このオンラインツールによってUTF-8に変換したので、"dÃ©jÃ"という文字列を使用してメソッドのテストを開始しました。

残念ながら、この実装では文字列は同じままです。

どこが間違っているのですか？

bames53 · Accepted Answer

したがって、問題は、UTF-8コードユニットの値が一連の16ビットコードユニットとしてC＃stringに格納されていることです。各コードユニットがバイトの範囲内にあることを確認し、それらの値をバイトにコピーして、新しいUTF-8バイトシーケンスをUTF-16に変換するだけです。

_public static string DecodeFromUtf8(this string utf8String) { // copy the string as UTF-8 bytes. byte[] utf8Bytes = new byte[utf8String.Length]; for (int i=0;i<utf8String.Length;++i) { //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range"); utf8Bytes[i] = (byte)utf8String[i]; } return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length); } DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà _

これは簡単ですが、根本的な原因を見つけるのが最善です。誰かがUTF-8コード単位を16ビットコード単位にコピーしている場所。おそらく犯人は、誰かがバイトをC＃stringに間違ったエンコーディングを使って変換していることです。例えば。 Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)。

または、文字列の生成に使用された誤ったエンコーディングがわかっていて、その誤ったエンコーディング変換がロスレスであることがわかっている場合（通常、誤ったエンコーディングがシングルバイトエンコーディングである場合）、単純に逆エンコーディングを実行できます。元のUTF-8データを取得する手順を実行すると、UTF-8バイトから正しい変換を実行できます。

_public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction) { // the inverse of `mistake.GetString(originalBytes);` byte[] originalBytes = mistake.GetBytes(mangledString); return correction.GetString(originalBytes); } UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8); _

MEN · Answer

すべてのバイトが正しい（ 'Ö'-> [195、0]、[150、0]）のUTF-8文字列がある場合、以下を使用できます。

public static string Utf8ToUtf16(string utf8String) { /*************************************************************** * Every .NET string will store text with the UTF-16 encoding, * * known as Encoding.Unicode. Other encodings may exist as * * Byte-Array or incorrectly stored with the UTF-16 encoding. * * * * UTF-8 = 1 bytes per char * * ["100" for the ansi 'd'] * * ["206" and "186" for the russian '?'] * * * * UTF-16 = 2 bytes per char * * ["100, 0" for the ansi 'd'] * * ["186, 3" for the russian '?'] * * * * UTF-8 inside UTF-16 * * ["100, 0" for the ansi 'd'] * * ["206, 0" and "186, 0" for the russian '?'] * * * * First we need to get the UTF-8 Byte-Array and remove all * * 0 byte (binary 0) while doing so. * * * * Binary 0 means end of string on UTF-8 encoding while on * * UTF-16 one binary 0 does not end the string. Only if there * * are 2 binary 0, than the UTF-16 encoding will end the * * string. Because of .NET we don't have to handle this. * * * * After removing binary 0 and receiving the Byte-Array, we * * can use the UTF-8 encoding to string method now to get a * * UTF-16 string. * * * ***************************************************************/ // Get UTF-8 bytes and remove binary 0 bytes (filler) List<byte> utf8Bytes = new List<byte>(utf8String.Length); foreach (byte utf8Byte in utf8String) { // Remove binary 0 bytes (filler) if (utf8Byte > 0) { utf8Bytes.Add(utf8Byte); } } // Convert UTF-8 bytes to UTF-16 string return Encoding.UTF8.GetString(utf8Bytes.ToArray()); }

私の場合、DLL結果もUTF-8文字列ですが、残念ながらUTF-8文字列はUTF-16エンコーディング（ 'is'-> [195、0]、[ 19、32]）。したがって、150であるANSI '-'は8211であるUTF-16 '-'に変換されました。この場合も、代わりに以下を使用できます。

public static string Utf8ToUtf16(string utf8String) { // Get UTF-8 bytes by reading each byte with ANSI encoding byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String); // Convert UTF-8 bytes to UTF-16 bytes byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes); // Return UTF-16 bytes as UTF-16 string return Encoding.Unicode.GetString(utf16Bytes); }

またはネイティブメソッド：

[DllImport("kernel32.dll")] private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar); public static string Utf8ToUtf16(string utf8String) { Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0); if (iNewDataLen > 1) { StringBuilder utf16String = new StringBuilder(iNewDataLen); MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity); return utf16String.ToString(); } else { return String.Empty; } }

逆の方法で必要な場合は、 tf16ToUtf8 を参照してください。お役に立てれば幸いです。

Hans Passant · Answer

UTF-8でエンコードされた文字を表示する文字列があります

.NETにはそのようなものはありません。文字列クラスは、文字列をUTF-16エンコーディングでのみ格納できます。 UTF-8でエンコードされた文字列は、byte []としてのみ存在できます。バイトを文字列に格納しようとしても、適切な結果が得られません。 UTF-8は、有効なUnicodeコードポイントを持たないバイト値を使用します。文字列が正規化されると、コンテンツは破棄されます。したがって、DecodeFromUtf8（）が実行を開始するまでに文字列を回復するのはすでに遅すぎます。

UTF-8でエンコードされたテキストのみをbyte []で処理します。そして、UTF8Encoding.GetString（）を使用して変換します。

Mark Tolonen · Answer

あなたが持っているものはstringのようですが、別のエンコーディングから誤ってデコードされています。おそらくコードページ1252 で、これはUS Windowsのデフォルトです。他の損失がないと仮定して、反転する方法を次に示します。すぐにはわかりませんが、non-breaking space（U + 00A0）表示されない文字列の末尾。もちろん、そもそもデータソースを正しく読み取る方が適切ですが、データソースが最初から正しく保存されていない可能性があります。

using System; using System.Text; class Program { static void Main(string[] args) { string junk = "dÃ©jÃ\xa0"; // Bad Unicode string // Turn string back to bytes using the original, incorrect encoding. byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk); // Use the correct encoding this time to convert back to a string. string good = Encoding.UTF8.GetString(bytes); Console.WriteLine(good); } }

結果：

déjà