python and beautifulsoup4を使用して、ウェブサイトの複数のページのデータをスクレイピングする方法

Question

PGA.comのWebサイトからデータを取得して、米国のすべてのゴルフコースの表を取得しようとしています。 CSVテーブルに、ゴルフコースの名前、住所、所有権、ウェブサイト、電話番号を含めます。このデータを使用して、ジオコーディングしてマップに配置し、コンピューターにローカルコピーを作成します

私はPythonとBeautiful Soup4を使用してデータを抽出しました。データを抽出してCSVにインポートするまでに至りましたが、現在、複数のページからデータをスクレイピングする問題が発生しています。 PGA Webサイト。すべてのゴルフコースを抽出したいのですが、スクリプトは1ページのみに制限されています。ループして、PGAサイトにあるすべてのページからゴルフコースのすべてのデータをキャプチャします。ゴールドは約18000です。データを取得するためのコースと900ページ

以下は私のスクリプトです。 1つのサイトだけでなく複数のPGA Webサイトからすべてのデータをキャプチャするコードの作成についてサポートが必要です。このようにして、米国のゴールドコースのすべてのデータを提供します。

以下が私のスクリプトです。

import csv import requests from bs4 import BeautifulSoup url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0" r = requests.get(url) soup = BeautifulSoup(r.content) g_data1=soup.find_all("div",{"class":"views-field-nothing-1"}) g_data2=soup.find_all("div",{"class":"views-field-nothing"}) courses_list=[] for item in g_data2: try: name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text except: name='' try: address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text except: address1='' try: address2=item.contents[1].find_all("div",{"class":"views-field-city-state-Zip"})[0].text except: address2='' try: website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text except: website='' try: Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text except: Phonenumber='' course=[name,address1,address2,website,Phonenumber] courses_list.append(course) with open ('filename5.csv','wb') as file: writer=csv.writer(file) for row in courses_list: writer.writerow(row) #for item in g_data1: #try: #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text #except: #pass #try: #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text #except: #pass #for item in g_data2: #try: #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text #except: #pass #try: #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text #except: #pass #try: #print item.contents[1].find_all("div",{"class":"views-field-city-state-Zip"})[0].text #except: #pass

このスクリプトは一度に20のみをキャプチャします。フォームをこするために18000のゴルフコースと900ページを占める1つのスクリプトですべてをキャプチャします。

liamdiprose · Accepted Answer

PGA Webサイトの検索には複数のページがあり、URLは次のパターンに従います。

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

つまり、ページのコンテンツを読み取ってから、ページの値を1ずつ変更して、次のページを読み取ることができます。

import csv import requests from bs4 import BeautifulSoup for i in range(907): # Number of pages plus one url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i) r = requests.get(url) soup = BeautifulSoup(r.content) # Your code for each individual page here

Mr.Bones · Answer

それでもこの投稿を読んでいる場合は、このコードも試すことができます。

from urllib.request import urlopen from bs4 import BeautifulSoup file = "Details.csv" f = open(file, "w") Headers = "Name,Address,City,Phone,Website
" f.write(Headers) for page in range(1,5): url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page) html = urlopen(url) soup = BeautifulSoup(html,"html.parser") Title = soup.find_all("div", {"class":"views-field-nothing"}) for i in Title: try: name = i.find("div", {"class":"views-field-title"}).get_text() address = i.find("div", {"class":"views-field-address"}).get_text() city = i.find("div", {"class":"views-field-city-state-Zip"}).get_text() phone = i.find("div", {"class":"views-field-work-phone"}).get_text() website = i.find("div", {"class":"views-field-website"}).get_text() print(name, address, city, phone, website) f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "
") except: AttributeError f.close()

range（1,5）と書かれている場合、0を最後のページに変更するだけで、すべての詳細がCSVで取得されます。適切な形式でデータを取得するために非常に努力しましたが、難しいです:)。

Leb · Answer

あなたは単一のページへのリンクを張っています、それはそれ自身で各ページを反復するつもりはありません。

ページ1：

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

ページ2：

http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

ページ907：http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

ページ1を実行しているので、20しか得られません。各ページを実行するループを作成する必要があります。

最初に、1ページを実行する関数を作成し、その関数を反復します。

URLのsearch?の直後の2ページ目から、page=1は907ページまでpage=906まで増加し始めます。

Mark M · Answer

これとまったく同じ問題があり、上記の解決策が機能しませんでした。私はcookieを考慮して解決しました。リクエストセッションが役立ちます。セッションを作成すると、番号が付けられたすべてのページにCookieが挿入され、必要なすべてのページがプルされます。

import csv import requests from bs4 import BeautifulSoup url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0" s = requests.Session() r = s.get(url)

Kurtis Pykes · Answer

最初のソリューションでは最初のインスタンスが繰り返されていることに気付きました。これは、0ページと1ページが同じページであるためです。これは、範囲関数で開始ページを指定することによって解決されます。以下の例...

 for i in range(1, 907): #Number of pages plus one url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_Zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i) r = requests.get(url) soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer # Your code for each individual page here