Seleniumを使用してJavaScriptでレンダリングされたソースコードでHTMLを取得する方法

Question

1つのWebページでクエリを実行すると、結果のURLが取得されます。 [htmlソースを見る]を右クリックすると、JSによって生成されたhtmlコードが表示されます。単にurllibを使用する場合、pythonはJSコードを取得できません。したがって、Seleniumを使用したいくつかのソリューションがあります。私のコードは次のとおりです。

from Selenium import webdriver url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2' driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe') driver.get(url) print driver.page_source >>> <html><head></head><body></body></html> Obviously It's not right!!

右クリックウィンドウで必要なソースコードは次のとおりです（情報の部分が必要です）

</script></div><div class="searchColRight"><div id="topActions" class="clearfix noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary" href="Default.aspx? _act=VitalSearchR ...... <<INFORMATION I NEED>> ... to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-Microsoft-com:xslt"> jQuery(document).ready(function() { jQuery(".ancestry-information-tooltip").actooltip({ href: "#AncestryInformationTooltip", orientation: "bottomleft"}); });

===========だから私の質問は=============== JSによって生成された情報を取得する方法ですか？

Victory · Accepted Answer

javascript経由でドキュメントを取得する必要があります。セレンexecute_script関数を使用できます

from time import sleep # this should go at the top of the file sleep(5) html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML") print html

<html>タグ内のすべてを取得します

Darius M. · Answer

その回避策を使用する必要はありません、代わりに使用できます：

driver = webdriver.PhantomJS() driver.get('http://www.google.com/') html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

Robbie Wareham · Answer

JavaScriptがダイナミックHTMLをレンダリングする前にソースコードを取得していると考えています。

最初に、ナビゲートとページソースの取得の間に数秒のスリープを入れてみてください。

これが機能する場合は、別の待機戦略に変更できます。

Harry1992 · Answer

あなたはDryscrapeを試してくださいこのブラウザは完全にサポートされています重いjsコードを試してみてくださいあなたのために働くことを願っています

Vida · Answer

私は同じ問題に出会い、desired_capabilitiesによって最終的に解決しました。

from Selenium import webdriver from Selenium.webdriver.common.proxy import Proxy from Selenium.webdriver.common.proxy import ProxyType proxy = Proxy( { 'proxyType': ProxyType.MANUAL, 'httpProxy': 'ip_or_Host:port' } ) desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy() proxy.add_to_capabilities(desired_capabilities) driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities) driver.get('test_url') print driver.page_source

kuo chang · Answer

私はインターネットからJavascriptソースコードを取得することについて同じ問題を抱えており、上記のVictoryの提案を使用してそれを解決しました。

*まず、execute_script

driver=webdriver.Chrome() driver.get(urls) innerHTML = driver.execute_script("return document.body.innerHTML") #print(driver.page_source)

*第二に、beautifulsoupを使用してhtmlを解析します（pipコマンドでbeautifulsoupをダウンロードできます）

 import bs4 #import beautifulsoup import re from time import sleep sleep(1) #wait one second root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'}) #find the value which you need.

*第三に、必要な値を印刷する

 for span in viewcount: print(span.string)

*完全なコード

from Selenium import webdriver import lxml urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2" driver = webdriver.PhantomJS() ##driver=webdriver.Chrome() driver.get(urls) innerHTML = driver.execute_script("return document.body.innerHTML") ##print(driver.page_source) import bs4 import re from time import sleep sleep(1) root=bs4.BeautifulSoup(innerHTML,"lxml") viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'}) for span in viewcount: print(span.string) driver.quit()