Python正規表現のTypeError

Question

だから、私はこのコードを持っています：

url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg)

ただし、pythonはこのエラーを返します。

links = linkregex.findall(msg) TypeError: can't use a string pattern on a bytes-like object

私は何を間違えましたか？

Lennart Regebro · Accepted Answer

TypeError: can't use a string pattern on a bytes-like object

私は何を間違えましたか？

バイトオブジェクトで文字列パターンを使用しました。代わりにバイトパターンを使用します。

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') ^ Add the b there, it makes it into a bytes object

（ps：

 >>> from disclaimer include dont_use_regexp_on_html "Use BeautifulSoup or lxml instead."

）

Morten Kristensen · Answer

Python 2.6を実行している場合、「urllib」には「request」はありません。したがって、3行目は次のようになります。

m = urllib.urlopen(url)

そして、バージョン3ではこれを使用する必要があります。

links = linkregex.findall(str(msg))

「msg」はバイトオブジェクトであり、findall（）が期待する文字列ではないためです。または、正しいエンコードを使用してデコードできます。たとえば、「latin1」がエンコードの場合：

links = linkregex.findall(msg.decode("latin1"))

Jeremy Whitlock · Answer

私のバージョンのPythonにはリクエスト属性を持つurllibはありませんが、「urllib.urlopen（url）」を使用すると文字列が返されず、オブジェクトが取得されますこれは型エラーです。

Seppo Enarvi · Answer

正規表現パターンと文字列は同じタイプでなければなりません。通常の文字列と一致する場合は、文字列パターンが必要です。バイト文字列を照合する場合は、バイトパターンが必要です。

この場合、-m.read（）はバイト文字列を返すため、バイトパターンが必要です。 Python 3、通常の文字列はUnicode文字列であり、バイト文字列リテラルを指定するにはb修飾子が必要です。

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

John · Answer

GoogleのURLは役に立たなかったので、http://www.google.com/ig?hl=en私のために働くそれのために。

これを試して：

import re import urllib.request url="http://www.google.com/ig?hl=en" linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read(): links = linkregex.findall(str(msg)) print(links)

お役に立てれば。

user3022012 · Answer

それはpython3で私のために働いた。お役に立てれば

import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, str(htmltext)) print(titles) i+=1

そしてこれも正規表現の前にbを追加してバイト配列に変換しました。

import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = b'<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, htmltext) print(titles) i+=1