スクレイピー-ページ分割されたアイテムの解析

Question

次の形式のURLがあります。

example.com/foo/bar/page_1.html

合計53ページで、各ページに最大20行あります。

私は基本的に、すべてのページからすべての行を取得したいと考えています。つまり、約53 * 20アイテムです。

私のparseメソッドには、単一のページを解析する作業コードがあり、アイテムごとに1ページ深くなり、アイテムに関する詳細情報を取得します。

 def parse(self, response): hxs = HtmlXPathSelector(response) restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]') for rest in restaurants: item = DegustaItem() item['name'] = rest.select('td[2]/a/b/text()').extract()[0] # some items don't have category associated with them try: item['category'] = rest.select('td[3]/a/text()').extract()[0] except: item['category'] = '' item['urbanization'] = rest.select('td[4]/a/text()').extract()[0] # get profile url rel_url = rest.select('td[2]/a/@href').extract()[0] # join with base url since profile url is relative base_url = get_base_url(response) follow = urljoin_rfc(base_url,rel_url) request = Request(follow, callback = parse_profile) request.meta['item'] = item return request def parse_profile(self, response): item = response.meta['item'] # item['address'] = figure out xpath return item

問題は、各ページをどのようにクロールするかです。

example.com/foo/bar/page_1.html example.com/foo/bar/page_2.html example.com/foo/bar/page_3.html ... ... ... example.com/foo/bar/page_53.html

Achim · Accepted Answer

問題を解決するには2つのオプションがあります。一般的な方法は、yieldの代わりにreturnを使用して新しいリクエストを生成することです。これにより、1つのコールバックから複数の新しいリクエストを発行できます。 http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example で2番目の例を確認してください。

あなたのケースではおそらくもっと簡単な解決策があります：次のようにパターンから開始時間のリストを生成するだけです：

class MySpider(BaseSpider): start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

bslima · Answer

BaseSpiderの代わりにCrawlSpiderを使用し、SgmlLinkExtractorを使用してページネーションのページを抽出できます。

例えば：

start_urls = ["www.example.com/page1"] rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',)) , follow= True), Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',)) , callback='parse_call') )

最初のルールは、xpath式に含まれるリンクをたどるようにscrapyに指示し、2番目のルールは、各ページの何かを解析する場合に備えて、xpath式に含まれるリンクにparse_callを呼び出すようにscrapyに指示します。

詳細については、ドキュメントを参照してください： http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Santosh Pillai · Answer

「スクレイピー-ページ分割されたアイテムの解析」には2つの使用例があります。

A）。テーブルを移動してデータをフェッチするだけです。これは比較的単純です。

class TrainSpider(scrapy.Spider): name = "trip" start_urls = ['somewebsite'] def parse(self, response): ''' do something with this parser ''' next_page = response.xpath("//a[@class='next_page']/@href").extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

最後の4行を確認します。ここに

「次へ」ページネーションボタンから次のページのリンクフォームの次のページのxpathを取得しています。
ページ付けの終わりでないかどうかを確認するif条件。
Url joinを使用して、このリンク（ステップ1で取得したもの）をメインURLに結合します
parseコールバックメソッドの再帰呼び出し。

B）ページ間を移動するだけでなく、そのページの1つ以上のリンクからデータを抽出することもできます。

class StationDetailSpider(CrawlSpider): name = 'train' start_urls = [someOtherWebsite] rules = ( Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True), Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains') ) def parse_trains(self, response): '''do your parsing here'''

ここで、次のことに注意してください。

scrapy.Spider親クラスのCrawlSpiderサブクラスを使用しています
「ルール」に設定しました

a）最初のルールは、利用可能な「next_page」があるかどうかを確認し、それに従うだけです。

b）2番目のルールは、フォーマットが/trains/12343であるページ上のすべてのリンクを要求し、次にparse_trainsを呼び出して操作を実行および解析します。
重要：parseサブクラスを使用しているため、ここでは通常のCrawlSpiderメソッドを使用しないことに注意してください。このクラスにはparseメソッドもあるので、オーバーライドしたくありません。コールバックメソッドにはparse以外の名前を付けてください。