Scrapyを使用してWebサイトからすべてのプレーンテキストを取得するにはどうすればよいですか？

Question

HTMLがレンダリングされた後、Webサイトからすべてのテキストを表示したいのですが。私はPythonでScrapyフレームワークを使用しています。xpath('//body//text()')を使用すると取得できますが、HTMLタグがあれば、テキストのみが必要です。このため？

alecxe · Accepted Answer

最も簡単なオプションは、 extract //body//text()および join 検出されたすべてのものです。

_''.join(sel.select("//body//text()").extract()).strip() _

ここで、selは Selector インスタンスです。

別のオプションは nltk のclean_html()を使用することです：

_>>> import nltk >>> html = """ ... <div class="post-text" itemprop="description"> ... ... <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. ... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> ... ... </div>""" >>> nltk.clean_html(html) "I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !" _

別のオプションは BeautifulSoup のget_text()を使用することです：

get_text()

ドキュメントまたはタグのテキスト部分のみが必要な場合は、get_text()メソッドを使用できます。文書内またはタグの下にあるすべてのテキストを単一のUnicode文字列として返します。

_>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.get_text().strip() I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! _

別のオプションは _lxml.html_ のtext_content()を使用することです：

.text_content()

要素のテキストコンテンツ（子のテキストコンテンツを含む）をマークアップなしで返します。

_>>> import lxml.html >>> tree = lxml.html.fromstring(html) >>> print tree.text_content().strip() I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! _

Pedro Lobito · Answer

やってみました？

xpath('//body//text()').re('(\w+)')

OR

 xpath('//body//text()').extract()

Shubhankar Mohan · Answer

xpath('//body//text()')は、最後に使用したタグ（ケース本体で）のノードにディッパーを常に駆動するとは限りません。xpath('//body/node()/text()').extract()と入力すると、html本体にあるノードが表示されます。 xpath('//body/descendant::text()')を試すことができます。