Python utf-8でエンコードされた応答を要求しますが、デコードできません

Question

pythonを使用してmessenger.com（facebook messenger）チャットをスクレイプしようとしていますが、GoogleChrome開発ツールを使用してPOSTチャットのリクエストを確認しました履歴と私は、ヘッダーと本文全体をリクエストで使用できる形式にコピーしました。

少なくともリクエストがsomethingを取得したことを意味するHTTPコード200を取得しましたが、_print res.encoding_を取得して、返されたエンコーディングを取得できます。 utf-8。しかし、私はそれをデコードすることはできません！

関数は次のとおりです。

_def download_thread(self, limit, offset, message_timestamp): """Download the specified number of messages from the provided thread, with an optional offset """ data = request_data(self.thread, offset=offset, limit=limit, group=self.group, timestamp=message_timestamp) res = self.ses.post(url_thread, data=data, headers=headers) print(res.content) thread_contents = json.loads(res.content) print(thread_contents) return thread_contents _

収量

_UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte _

データを_json.load_（またはloads）しようとしたとき

ただし、_res.encoding_はutf-8を返します。

Gzipで解凍してみましたが、gzipで圧縮されたコンテンツではありません。

print(res.content)を実行しようとすると

_Traceback (most recent call last): File "FBChatScraper.py", line 200, in <module> main() File "FBChatScraper.py", line 134, in main fbms.run() 0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b	\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95
B%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e
\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03' Traceback (most recent call last): File "FBChatScraper.py", line 200, in <module> main() File "FBChatScraper.py", line 134, in main fbms.run() File "FBChatScraper.py", line 43, in run thread_contents = self.download_thread(limit, offset, message_timestamp) File "FBChatScraper.py", line 74, in download_thread thread_contents = json.loads(res.content) File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads s = s.decode(detect_encoding(s), 'surrogatepass') UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte _

奇妙なことに、トレースバックの途中でコンテンツを印刷すると、目に見えない文字がコンテンツを押し下げていると思います。

応答コンテンツをどのように処理しても、jsonライブラリが解釈できるように適切にフォーマットされていないため、応答をjson形式にロードできません。

さらに、print(res.text)を実行すると、ゴミが発生します。

_Traceback (most recent call last): File "FBChatScraper.py", line 200, in <module> main() File "FBChatScraper.py", line 134, in main fbms.run() }sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x� F <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��\v�{b�� � f/V�i̴��_��83� �_����*��O�� ������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a� "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^ � �g���SWQHӳ.��BӄG�,����@E�������� nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ� Traceback (most recent call last): File "FBChatScraper.py", line 200, in <module> main() File "FBChatScraper.py", line 134, in main fbms.run() File "FBChatScraper.py", line 43, in run thread_contents = self.download_thread(limit, offset, message_timestamp) File "FBChatScraper.py", line 74, in download_thread thread_contents = json.loads(res.content) File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads s = s.decode(detect_encoding(s), 'surrogatepass') UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte _

編集：

MWEはできる限り、投稿リクエストのどのデータが非公開かわからないため、一部を省略しました

このデータを使用する

_url_thread = "https://www.messenger.com/api/graphqlbatch/" request_data = { "batch_name": "MessengerGraphQLThreadFetcher", "__user": "<user_id>", "__a": "1", "__dyn": "<dyn>", "__req": "9", '__be' : '-1', '__pc' : 'PHASED:messengerdotcom_pkg', "fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw", "ttstamp": "265817254666710077746711957586581715370521181008510710777", "__rev": "3791607", "jazoest": "<jazoest>", "queries": '<queries>' } headers = { "authority": "www.messenger.com", "method": "POST", "path": "/api/graphqlbatch/", "scheme": "https", "accept": "*/*", "accept-encoding": "gzip, deflate, br", "accept-language": "en-US,en;q=0.9", "cache-control": "no-cache", "content-length": "754", "content-type" : "application/x-www-form-urlencoded", "cookie": "<cookies>", "Origin": "https://www.messenger.com", "pragma": "no-cache", "referer": "https://www.messenger.com/t/<chatID>", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36" } _

chrome開発者ツールを使用し、ネットワークタブでPOST _<items>_へのリクエストを探すことで、すべての_Request URL: https://www.messenger.com/api/graphqlbatch/_を取得できます。

chrome開発ツールの記録中に、上にスクロールして古いメッセージを再読み込みすると、簡単に見つけることができます。

次に、Pythonで簡単なリクエストをまとめます

_import requests as rq import time ses = rq.Session() thread = <ID of thread found in URL of messenger.com> conversation_type = <'thread_fbids' if group chat else 'user_ids'> data = request_data data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0 data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time()) data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000 res = ses.post(url_thread, data=data, headers=headers) print(res.content) thread_contents = json.loads(res.content) print(thread_contents) _

私の開発ツールが戻ってきたので、jsonの始まりを見ることができますここ

abarnert · Accepted Answer

問題は、リクエストヘッダーの次の行です。

"accept-encoding": "gzip, deflate, br",

そのbrは Brotli圧縮、Googleがウェブ上のgzipを置き換えるために推進している新しい圧縮標準（ RFC 7932 を参照）を要求します。 Chromeは、最近のバージョンのChromeがネイティブに理解しているため、Brotliを要求しています。Chromeからヘッダーをコピーしたため、Brotliを要求しています。しかしrequestsはBrotliをネイティブに理解していません。

pip install brotliでデコンプレッサを登録するか、res.contentで手動で呼び出すことができます。しかし、より簡単な解決策は、brを削除することです。

"accept-encoding": "gzip, deflate",

…そして、あなたとgzipがすでに処理方法を知っているrequestsを取得する必要があります。