python 3

Question

私はBS4で使用するソースを取得する必要があるページがありますが、ページの中央はコンテンツをロードするのに1秒（おそらくそれ以下）かかり、セクションがロードされる前にrequests.getがページのソースをキャッチしますデータを取得する前に1秒待つことができますか？

r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 )
    soup = BeautifulSoup(r.content, 'html.parser')
    a = soup.find_all('section', 'wrapper')

ページ

<section class="wrapper" id="resultado_busca">

r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 ) soup = BeautifulSoup(r.content, 'html.parser') a = soup.find_all('section', 'wrapper')

ページ

<section class="wrapper" id="resultado_busca">

Vin&#237;cius Aguiar · Accepted Answer

待機の問題ではなく、要素がJavaScriptによって作成されているように見えます。requestsはJavaScriptによって動的に生成された要素を処理できません。 Selenium と PhantomJS を併用することをお勧めしますページのソースを取得するには、解析にBeautifulSoupを使用できます。以下に示すコードはまさにそれを実行します。

from bs4 import BeautifulSoup from Selenium import webdriver url = "http://legendas.tv/busca/walking%20dead%20s03e02" browser = webdriver.PhantomJS() browser.get(url) html = browser.page_source soup = BeautifulSoup(html, 'lxml') a = soup.find('section', 'wrapper')

また、1つの要素のみを検索する場合は、.findAllを使用する必要はありません。

Ingy Swan · Answer

Python 3）では、urllibモジュールよりも動的なWebページをロードするとき、実際にrequestsモジュールを使用する方が適切に機能します。

すなわち

import urllib.request try: with urllib.request.urlopen(url) as response: html = response.read().decode('utf-8')#use whatever encoding as per the webpage except urllib.request.HTTPError as e: if e.code==404: print(f"{url} is not found") Elif e.code==503: print(f'{url} base webservices are not available') ## can add authentication here else: print('http error',e)