Python / Djangoを使用してHTMLデコード/エンコードを実行するにはどうすればよいですか？

Question

HTMLエンコードされた文字列があります。

'''&lt;img class=&quot;size-medium wp-image-113&quot;\ style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\ src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\ alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

これを次のように変更します。

<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />

これをHTMLとして登録して、テキストとして表示されるのではなく、ブラウザーによってイメージとしてレンダリングされるようにします。

BeautifulSoup と呼ばれるWebスクレイピングツールを使用しているため、文字列はそのように格納されます。Webページを「スキャン」し、そこから特定のコンテンツを取得し、その形式で文字列を返します。

C＃でこれを行う方法を見つけましたが、Pythonではできません。誰か助けてくれますか？

関連する

PythonでXML/HTMLエンティティをUnicode文字列に変換する

Daniel Naab · Accepted Answer

Djangoユースケースを考えると、これには2つの答えがあります。参照用のDjango.utils.html.escape関数を以下に示します。

def escape(html): """Returns the given HTML with ampersands, quotes and carets encoded.""" return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

これを逆にするには、Jakeの答えで説明されているCheetah関数が機能するはずですが、単一引用符が欠落しています。このバージョンには、対称の問題を回避するために置換の順序を逆にして、更新されたタプルが含まれています。

def html_decode(s): """ Returns the ASCII decoded version of the given HTML string. This does NOT remove normal HTML tags like <p>. """ htmlCodes = ( ("'", '&#39;'), ('"', '&quot;'), ('>', '&gt;'), ('<', '&lt;'), ('&', '&amp;') ) for code in htmlCodes: s = s.replace(code[1], code[0]) return s unescaped = html_decode(my_string)

ただし、これは一般的な解決策ではありません。 Django.utils.html.escapeでエンコードされた文字列にのみ適しています。より一般的には、標準ライブラリを使用することをお勧めします。

# Python 2.x: import HTMLParser html_parser = HTMLParser.HTMLParser() unescaped = html_parser.unescape(my_string) # Python 3.x: import html.parser html_parser = html.parser.HTMLParser() unescaped = html_parser.unescape(my_string) # >= Python 3.5: from html import unescape unescaped = unescape(my_string)

提案として：HTMLをエスケープせずにデータベースに保存する方が合理的かもしれません。可能な場合は、BeautifulSoupからエスケープされていない結果を取得し、このプロセスを完全に回避することを検討する価値があります。

Djangoでは、エスケープはテンプレートのレンダリング中にのみ発生します。エスケープを防ぐため、テンプレートエンジンに文字列をエスケープしないように指示するだけです。そのためには、テンプレートで次のオプションのいずれかを使用します。

{{ context_var|safe }} {% autoescape off %} {{ context_var }} {% endautoescape %}

Jiangge Zhang · Answer

標準ライブラリの場合：

HTMLエスケープ

try: from html import escape # python 3.x except ImportError: from cgi import escape # python 2.x print(escape("<"))

HTML Unescape

try: from html import unescape # python 3.4+ except ImportError: try: from html.parser import HTMLParser # python 3.x (<3.4) except ImportError: from HTMLParser import HTMLParser # python 2.x unescape = HTMLParser().unescape print(unescape("&gt;"))

user26294 · Answer

Htmlエンコーディングの場合、標準ライブラリのcgi.escapeがあります。

>> help(cgi.escape) cgi.escape = escape(s, quote=None) Replace special characters "&", "<" and ">" to HTML-safe sequences. If the optional flag quote is true, the quotation mark character (") is also translated.

HTMLデコードには、次のものを使用します。

import re from htmlentitydefs import name2codepoint # for some reason, python 2.5.2 doesn't have this one (apostrophe) name2codepoint['#39'] = 39 def unescape(s): "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml" return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s)

もっと複雑なものには、BeautifulSoupを使用します。

vincent · Answer

エンコードされた文字のセットが比較的制限されている場合は、ダニエルのソリューションを使用してください。それ以外の場合は、多数のHTML解析ライブラリのいずれかを使用します。

奇形のXML/HTMLを処理できるため、BeautifulSoupが好きです。

http://www.crummy.com/software/BeautifulSoup/

あなたの質問については、 documentation に例があります

from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] # u'Sacr\xe9 bleu!'

Collin Anderson · Answer

Python 3.4以降：

import html html.unescape(your_string)

zgoda · Answer

この下部を参照してください Python wikiのページ、htmlを「エスケープしない」ための少なくとも2つのオプションがあります。

dfrankow · Answer

答えとしてのダニエルのコメント：

「エスケープはテンプレートのレンダリング中にDjangoでのみ発生します。したがって、エスケープを解除する必要はありません。テンプレートエンジンにエスケープしないように指示するだけです。{{context_var | safe}}または{％autoescape off％} {{context_var}} {％endautoescape％} "

slowkvant · Answer

素晴らしい機能を見つけました： http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string): import re entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});") def substitute_entity(match): from htmlentitydefs import name2codepoint as n2cp ent = match.group(2) if match.group(1) == "#": return unichr(int(ent)) else: cp = n2cp.get(ent) if cp: return unichr(cp) else: return match.group() return entity_re.subn(substitute_entity, string)[0]

James · Answer

これは本当に古い質問ですが、これはうまくいくかもしれません。

Django 1.5.5

In [1]: from Django.utils.text import unescape_entities In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;') Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

Chris Harty · Answer

Djangoテンプレートを介してこれを行う簡単な方法を探している人がいる場合は、いつでも次のようなフィルターを使用できます。

<html> {{ node.description|safe }} </html>

ベンダーからのデータがいくつかあり、投稿したものにはすべて、実際にソースを見ているかのようにレンダリングされたページにhtmlタグが書き込まれていました。上記のコードは非常に役立ちました。これが他の人に役立つことを願っています。

乾杯！！

Seth Gottlieb · Answer

Django.utils.html.escapeを使用することもできます

from Django.utils.html import escape something_Nice = escape(request.POST['something_naughty'])

Jake · Answer

これをチーターのソースコードで見つけました（ here ）

htmlCodes = [ ['&', '&amp;'], ['<', '&lt;'], ['>', '&gt;'], ['"', '&quot;'], ] htmlCodesReversed = htmlCodes[:] htmlCodesReversed.reverse() def htmlDecode(s, codes=htmlCodesReversed): """ Returns the ASCII decoded version of the given HTML string. This does NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode().""" for code in codes: s = s.replace(code[1], code[0]) return s

彼らがリストを逆にする理由がわからない、それは彼らがエンコードする方法に関係していると思うので、あなたとそれを逆にする必要はないかもしれない。また、私があなただったら、htmlCodesをリストのリストではなくタプルのリストに変更します...これは私のライブラリで行っています:)

タイトルもエンコードを要求されていることに気付いたので、ここにチーターのエンコード機能があります。

def htmlEncode(s, codes=htmlCodes): """ Returns the HTML encoded version of the given string. This is useful to display a plain ASCII text string on a web page.""" for code in codes: s = s.replace(code[0], code[1]) return s

Paolo Melchiorre · Answer

DjangoおよびPythonでこの質問の最も簡単な解決策を検索しています。組み込みの関数を使用して、htmlコードをエスケープ/エスケープできることがわかりました。

例

scraped_htmlおよびclean_htmlにHTMLコードを保存しました：

scraped_html = ( '&lt;img class=&quot;size-medium wp-image-113&quot; ' 'style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; ' 'src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; ' 'alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;' ) clean_html = ( '<img class="size-medium wp-image-113" style="margin-left: 15px;" ' 'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" ' 'alt="" width="300" height="194" />' )

Django

必要なDjango> = 1.0

unescape

削られたhtmlコードをエスケープ解除するには、 Django.utils.text.unescape_entities whichを使用できます。

すべての名前付きおよび数値文字参照を対応するUnicode文字に変換します。

>>> from Django.utils.text import unescape_entities >>> clean_html == unescape_entities(scraped_html) True

escape

クリーンなhtmlコードをエスケープするには、 Django.utils.html.escape whichを使用できます。

HTMLで使用するためにエンコードされたアンパサンド、引用符、山かっこで指定されたテキストを返します。

>>> from Django.utils.html import escape >>> scraped_html == escape(clean_html) True

Python

必要なPython> = 3.4

unescape

削ったhtmlコードをエスケープ解除するには、 html.unescape whichを使用できます。

文字列s内のすべての名前付きおよび数値文字参照（たとえば、>、>、&x3e;）を対応するUnicode文字に変換します。

>>> from html import unescape >>> clean_html == unescape(scraped_html) True

escape

クリーンなhtmlコードをエスケープするには、 html.escape whichを使用できます。

文字列sの文字&、<および>をHTMLセーフシーケンスに変換します。

>>> from html import escape >>> scraped_html == escape(clean_html) True

Mike Samuel · Answer

以下は、モジュールhtmlentitydefsを使用するpython関数です。完璧ではありません。私が持っているhtmlentitydefsのバージョンは不完全であり、すべてのエンティティが&NotEqualTilde;のようなエンティティに対して間違っている1つのコードポイントにデコードすると想定しています。

http://www.w3.org/TR/html5/named-character-references.html

NotEqualTilde; U+02242 U+00338 ≂̸

ただし、これらの警告を使用して、コードを示します。

def decodeHtmlText(html): """ Given a string of HTML that would parse to a single text node, return the text value of that node. """ # Fast path for common case. if html.find("&") < 0: return html return re.sub( '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));', _decode_html_entity, html) def _decode_html_entity(match): """ Regex replacer that expects hex digits in group 1, or decimal digits in group 2, or a named entity in group 3. """ hex_digits = match.group(1) # '&#10;' -> unichr(10) if hex_digits: return unichr(int(hex_digits, 16)) decimal_digits = match.group(2) # '&#x10;' -> unichr(0x10) if decimal_digits: return unichr(int(decimal_digits, 10)) name = match.group(3) # name is 'lt' when '&lt;' was matched. if name: decoding = (htmlentitydefs.name2codepoint.get(name) # Treat &GT; like &gt;. # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML. # If htmlentitydefs included mappings for those entities, # then this code will magically work. or htmlentitydefs.name2codepoint.get(name.lower())) if decoding is not None: return unichr(decoding) return match.group(0) # Treat "&noSuchEntity;" as "&noSuchEntity;"

smilitude · Answer

これがこの問題の最も簡単な解決策です-

{% autoescape on %} {{ body }} {% endautoescape %}

このページから。