BeatifulSoup4 get_textにはまだjavascriptがあります

Question

Bs4を使用してすべてのhtml/javascriptを削除しようとしていますが、javascriptは削除されません。私はまだテキストでそれを見ています。どうすればこれを回避できますか？

nltkを使用してみましたが、正常に動作しますが、clean_htmlとclean_urlは今後削除されます。スープget_textを使用して同じ結果を得る方法はありますか？

私はこれらの他のページを見てみました：

BeautifulSoup get_textはすべてのタグとJavaScriptを削除しません

現在、nltkの非推奨機能を使用しています。

[〜＃〜] edit [〜＃〜]

以下に例を示します。

import urllib from bs4 import BeautifulSoup url = "http://www.cnn.com" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) print soup.get_text()

私はまだCNNについて以下を見ています：

$j(function() { "use strict"; if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) { var pushLib = window.safaripushLib, current = pushLib.currentPermissions(); if (current === "default") { pushLib.checkPermissions("helloClient", function() {}); } } }); /*globals MainLocalObj*/ $j(window).load(function () { 'use strict'; MainLocalObj.init(); });

Jsを削除するにはどうすればよいですか？

私が見つけた他のオプションは次のとおりです：

https://github.com/aaronsw/html2text

html2textの問題は、実際には本当に遅いことであり、顕著なラグを作成することです。

Hugh Bothwell · Accepted Answer

一部ベース BeautifulSoupでスクリプトタグを削除できますか？

import urllib from bs4 import BeautifulSoup url = "http://www.cnn.com" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.decompose() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '
'.join(chunk for chunk in chunks if chunk) print(text)

bumpkin · Answer

最後にエンコードエラーを防ぐには...

import urllib from bs4 import BeautifulSoup url = url html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '
'.join(chunk for chunk in chunks if chunk) print(text.encode('utf-8'))