PythonでUnicodeURLをASCII（UTF-8パーセントエスケープ）に変換する最良の方法は？

Question

ドメイン名とパスにUnicode文字を含むURLを同等のASCII URL、 RFC 3986に従って、ドメインをIDNAとしてエンコードし、パスを％エンコードします。

ユーザーからUTF-8のURLを取得します。したがって、_http://➡.ws/♥_と入力した場合、Pythonで_'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'_を取得します。そして、私が欲しいのはASCIIバージョン：_'http://xn--hgi.ws/%E2%99%A5'_です。

現在私が行っているのは、正規表現を介してURLを部分に分割し、ドメインを手動でIDNAエンコードし、パスとクエリ文字列を異なるurllib.quote()呼び出しで個別にエンコードすることです。

_# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8') match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})' r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I) if not match: raise BadURLException(url) protocol, domain, port, path, query = match.groups() try: domain = unicode(domain, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in domain domain = domain.encode('idna') if port is None: port = '' path = urllib.quote(path) if query is None: query = '' else: query = urllib.quote(query, safe='=&?/') url = protocol + '://' + domain + port + path + query # url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C' _

これは正しいです？より良い提案はありますか？これを行うための簡単な標準ライブラリ関数はありますか？

Markus Jarderot · Accepted Answer

コード：

import urlparse, urllib def fixurl(url): # turn string into unicode if not isinstance(url,unicode): url = url.decode('utf8') # parse it parsed = urlparse.urlsplit(url) # divide the netloc further userpass,at,hostport = parsed.netloc.rpartition('@') user,colon1,pass_ = userpass.partition(':') Host,colon2,port = hostport.partition(':') # encode each component scheme = parsed.scheme.encode('utf8') user = urllib.quote(user.encode('utf8')) colon1 = colon1.encode('utf8') pass_ = urllib.quote(pass_.encode('utf8')) at = at.encode('utf8') Host = Host.encode('idna') colon2 = colon2.encode('utf8') port = port.encode('utf8') path = '/'.join( # could be encoded slashes! urllib.quote(urllib.unquote(pce).encode('utf8'),'') for pce in parsed.path.split('/') ) query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/') fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8')) # put it back together netloc = ''.join((user,colon1,pass_,at,Host,colon2,port)) return urlparse.urlunsplit((scheme,netloc,path,query,fragment)) print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F') print fixurl(u'http://Åsa:abc123@➡.ws:81/admin') print fixurl(u'http://➡.ws/admin')

出力：

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123@xn--hgi.ws:81/admin
http://xn--hgi.ws/admin

続きを読む：

編集：

文字列内ですでに引用符で囲まれている文字の大文字と小文字を修正しました。
urlparse/urlunparseをurlsplit/urlunsplitに変更しました。
ユーザーとポートの情報をホスト名でエンコードしないでください。（ありがとうJehiah）
「@」が欠落している場合は、ホスト/ポートをユーザー/パスとして扱わないでください。（ありがとうhupf）

ray keung · Answer

mizardXによって提供されたコードは100％正しくありません。この例は機能しません：

example.com/folder/?page=2

django.utils.encoding.iri_to_uri（）をチェックして、UnicodeURLをASCII urlsに変換します。

http://docs.djangoproject.com/en/dev/ref/unicode/

Ben Hoyt · Answer

さて、これらのコメントと私自身のコードのいくつかのバグ修正（フラグメントをまったく処理しませんでした）で、私は次のcanonurl()関数を思いつきました-正規のASCII URLの形式：

import re import urllib import urlparse def canonurl(url): r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or '' if the URL looks invalid. >>> canonurl(' ') '' >>> canonurl('www.google.com') 'http://www.google.com/' >>> canonurl('bad-utf8.com/path\xff/file') '' >>> canonurl('svn://blah.com/path/file') 'svn://blah.com/path/file' >>> canonurl('1234://badscheme.com') '' >>> canonurl('bad$scheme://google.com') '' >>> canonurl('site.badtopleveldomain') '' >>> canonurl('site.com:badport') '' >>> canonurl('http://123.24.8.240/blah') 'http://123.24.8.240/blah' >>> canonurl('http://123.24.8.240:1234/blah?q#f') 'http://123.24.8.240:1234/blah?q#f' >>> canonurl('\xe2\x9e\xa1.ws') # tinyarro.ws 'http://xn--hgi.ws/' >>> canonurl(' http://www.google.com:80/path/file;params?query#fragment ') 'http://www.google.com:80/path/file;params?query#fragment' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') 'http://xn--hgi.ws/%E2%99%A5' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth') 'http://xn--hgi.ws/%E2%99%A5/pa/th' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth;par%2Fams?que%2Fry=a&b=c') 'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?que/ry=a&b=c' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5') 'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5' >>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5') 'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5' >>> canonurl('http://badutf8pcokay.com/%FF?%FE#%FF') 'http://badutf8pcokay.com/%FF?%FE#%FF' >>> len(canonurl('google.com/' + 'a' * 16384)) 4096 """ # strip spaces at the ends and ensure it's prefixed with 'scheme://' url = url.strip() if not url: return '' if not urlparse.urlsplit(url).scheme: url = 'http://' + url # turn it into Unicode try: url = unicode(url, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in URL # parse the URL into its components parsed = urlparse.urlsplit(url) scheme, netloc, path, query, fragment = parsed # ensure scheme is a letter followed by letters, digits, and '+-.' chars if not re.match(r'[a-z][-+.a-z0-9]*$', scheme, flags=re.I): return '' scheme = str(scheme) # ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port] match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?$', netloc, flags=re.I) if not match: return '' domain, port = match.groups() netloc = domain + (port if port else '') netloc = netloc.encode('idna') # ensure path is valid and convert Unicode chars to %-encoded if not path: path = '/' # eg: 'http://google.com' -> 'http://google.com/' path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;') # ensure query is valid query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/') # ensure fragment is valid fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8'))) # piece it all back together, truncating it to a maximum of 4KB url = urlparse.urlunsplit((scheme, netloc, path, query, fragment)) return url[:4096] if __name__ == '__main__': import doctest doctest.testmod()

Alex Martelli · Answer

rFC-3896 rl解析作業が進行中です（たとえば、Summer Of Codeの一部として）が、標準ライブラリにはまだ何もありません-そしてriエンコーディング物事の側面、再びAFAIK。したがって、MizardXのエレガントなアプローチを採用したほうがよいでしょう。

Ben Blank · Answer

urlparse.urlsplit 代わりに、しかしそうでなければ、非常に簡単な解決策があるようです。

protocol, domain, path, query, fragment = urlparse.urlsplit(url)

（戻り値の名前付きプロパティにアクセスすることで、ドメインとポートに別々にアクセスできますが、ポート構文は常にASCIIであるため、IDNAエンコードプロセスの影響を受けません。）