web-dev-qa-db-ja.com

Python nltk.clean_htmlは実装されていません

使ってみました

myNews=urlopen(url).read()    
myNews=nltk.clean_html(myNews)

次のエラーが発生します。

Clean_htmlでファイル「/usr/local/lib/python2.7/dist-packages/nltk-3.0.0-py2.7.Egg/nltk/util.py」の346行目をNotImplementedErrorにレイズ(「HTMLマークアップを削除するには、 BeautifulSoupのget_text()関数を使用してください ")NotImplementedError:HTMLマークアップを削除するには、BeautifulSoupのget_text()関数を使用してください

ファイルutil.pyを見ると、実装されていないことがわかります。

def clean_html(html):
    raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

実装しないでください。

15

clean_html()clean_url()は、NLTKのかわいい関数です。BeautifulSoupがより優れた仕事とマークアップ言語の解析を行うため、ドロップされました https://github.com/nltk/を参照) nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb

BeautifulSoupのドキュメントは次のとおりです。 http://www.crummy.com/software/BeautifulSoup/bs4/doc/

17
alvas

他の回答ノートとして、ntlk この機能は削除されました で、「HTMLマークアップを削除するには、BeautifulSoupのget_text()関数を使用する」ことをお勧めします。 Beautiful Soupは、特定の要素からテキストを抽出する場合に適していますが、ページ全体のテキストが必要な場合は、nltk関数を使用してください。 2つのアプローチの比較を次に示します。

import mechanize
import nltk
from bs4 import BeautifulSoup
from html2text import html2text 
import re


def clean_html(html):
    """
    Copied from NLTK package.
    Remove HTML markup from the given string.

    :param html: the HTML string to be cleaned
    :type html: str
    :rtype: str
    """

    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

url = "http://www.nytimes.com/2015/08/31/business/challenged-on-left-and-right-the-fed-faces-a-decision-on-rates.html"
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
html = br.open(url).read().decode('utf-8')
cleanhtml = clean_html(html)
text = html2text(cleanhtml)
soup = BeautifulSoup(html)
text2 = soup.get_text()

Nltk関数を使用すると、Nice cleanの結果が得られます( ここを参照 、投稿が最大30,000文字を超えたため、投稿するにはPastebinに入れる必要がありました)。そして、美しいスープ:

u'\n  \n\n\n\n\nChallenged on Left and Right, the Fed Faces a Decision on Rates - The New York Times\nwindow.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var o=e[t]={exports:{}};n[t][0].call(o.exports,function(e){var o=n[t][1][e];return r(o?o:e)},o,o.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({QJf3ax:[function(n,e){function t(n){function e(e,t,a){n&&n(e,t,a),a||(a={});for(var u=c(e),f=u.length,s=i(a,o,r),p=0;f>p;p++)u[p].apply(s,t);return s}function a(n,e){f[n]=c(n).concat(e)}function c(n){return f[n]||[]}function u(){return t(e)}var f={};return{on:a,emit:e,create:u,listeners:c,_events:f}}function r(){return{}}var o="nr@context",i=n("gos");e.exports=t()},{gos:"7eSDFh"}],ee:[function(n,e){e.exports=n("QJf3ax")},{}],3:[function(n,e){function t(n){return function(){r(n,[(new Date).getTime()].concat(i(arguments)))}}var r=n("handle"),o=n(1),i=n(2);"undefined"==typeof window.newrelic&&(newrelic=window.NREUM);var a=["setPageViewName","addPageAction","setCustomAttribute","finished","addToTrace","inlineHit","noticeError"];o(a,function(n,e){window.NREUM[e]=t("api-"+e)}),e.exports=window.NREUM},{1:12,2:13,handle:"D5DuLP"}],gos:[function(n,e){e.exports=n("7eSDFh")},{}],"7eSDFh":[function(n,e){function t(n,e,t){if(r.call(n,e))return n[e];var o=t();if(Object.defineProperty&&Object.keys)try{return Object.defineProperty(n,e,{value:o,writable:!0,enumerable:!1}),o}catch(i){}return n[e]=o,o}var r=Object.prototype.hasOwnProperty;e.exports=t},{}],D5DuLP:[function(n,e){function t(n,e,t){return r.listeners(n).length?r.emit(n,e,t):(o[n]||(o[n]=[]),void o[n].Push(e))}var r=n("ee").create(),o={};e.exports=t,t.ee=r,r.q=o},{ee:"QJf3ax"}],handle:[function(n,e){e.exports=n("D5DuLP")},{}],XL7HBI:[function(n,e){function t(n){var e=typeof n;return!n||"object"!==e&&"function"!==e?-1:n===window?0:i(n,o,function(){return r++})}var r=1,o="nr@id",i=n("gos");e.exports=t},{gos:"7eSDFh"}],id:[function(n,e){e.exports=n("XL7HBI")},{}],loader:[function(n,e){e.exports=n("G9z0Bl")},{}],G9z0Bl:[function(n,e){function t(){var n=h.info=NREUM.info;if(n&&n.licenseKey&&n.applicationID&&f&&f.body){c(l,function(e,t){e in n||(n[e]=t)}),h.proto="https"===d.split(":")[0]||n.sslForHttp?"https://":"http://",a("mark",["onload",i()]);var e=f.createElement("script");e.src=h.proto+n.agent,f.body.appendChild(e)}}function r(){"complete"===f.readyState&&o()}function o(){a("mark",["domContent",i()])}function i(){return(new Date).getTime()}var a=n("handle"),c=n(1),u=(n(2),window),f=u.document,s="addEventListener",p="attachEvent",d=(""+location).split("?")[0],l={beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",agent:"js-agent.newrelic.com/nr-593.min.js"},h=e.exports={offset:i(),Origin:d,features:{}};f[s]?(f[s]("DOMContentLoaded",o,!1),u[s]("load",t,!1)):(f[p]("onreadystatechange",r),u[p]("onload",t)),a("mark",["firstbyte",i()])},{1:12,2:3,handle:"D5DuLP"}],12:[function(n,e){function t(n,e){var t=[],o="",i=0;for(o in n)r.call(n,o)&&(t[i]=e(o,n[o]),i+=1);return t}var r=Object.prototype.hasOwnProperty;e.exports=t},{}],13:[function(n,e){function t(n,e,t){e||(e=0),"undefined"==typeof t&&(t=n?n.length:0);for(var r=-1,o=t-e||0,i=Array(0>o?0:o);++r<o;)i[r]=n[e+r];return i}e.exports=t},{}]},{},["G9z0Bl"]);\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{"pageconfig":{"ledeMediaSize":"large","keywords":["article-medium","has-embedded-interactive"]}}\n\n            []    \n\nvar googletag=googletag||{};googletag.cmd=googletag.cmd||[],function(){var t=document.createElement("script");t.async=!0,t.type="text/javascript";t.src="http://www.googletagservices.com/tag/js/gpt.js";var o=document.getElementsByTagName("script")[0];o.parentNode.insertBefore(t,o)}();\n\n\n[\n    {\n        "testId": "0012",\n        "testName": "tallWatchingModule",\n        "throttle": 1.0,\n        "allocation": 0.9,\n        "variants": 1,\n        "applications": ["homepage"]\n    },\n    {\n        "testId": "0033",\n        "testName": "recommendedLabelTest",\n        "throttle": 1,\n        "allocation": 0.833,\n        "variants": 5,\n        "applications": ["article"]\n    },\n    {\n        "testId": "0036",\n        "testName": "velcroSocialFollow",\n        "throttle": 0.1,\n        "allocation": 0.5,\n        "variants": 1,\n        "applications": ["article", "homepage"]\n    },\n    {\n        "testId": "0050",\n        "testName": "styledMostEmailed",\n        "throttle": 1,\n        "allocation": 0.667,\n        "variants": 2,\n        "applications": ["article"]\n    },\n    {\n        "testId": "0051",\n        "testName": "shuffleRecommendations",\n        "throttle": 1.0,\n        "allocation": 0.667,\n        "variants": 1,\n        "applications": ["article"]\n    },\n    {\n        "testId": "0052",\n        "testName": "paidPostDriver",\n        "throttle": 1.0,\n        "allocation": 0.875,\n        "variants": 7,\n        "applications": ["article"]\n    },\n    {\n        "testId": "0061",\n        "testName": "paidPostFivePack",\n        "throttle": 0,\n        "allocation": 0,\n        "variants": 1,\n        "applications": ["homepage"]\n    }\n]\n\n\n\n{ "meta": {},\n  "data": {\n    "id": "0",\n    "name": "",\n    "subscription": ["","_RPV"],\n    "demographics": {}\n  }\n}\n\n\nvar require = {\n    baseUrl: \'http://a1.nyt.com/assets/\',\n    waitSeconds: 20,\n    paths: {\n        \'foundation\': \'article/20150828-192044/js/foundation\',\n        \'shared\': \'article/20150828-192044/js/shared\',\n        \'article\': \'article/20150828-192044/js/article\',\n        \'application\': \'article/20150828-192044/js/article/article\',\n        \'videoFactory\': \'http://static01.nyt.com/js2/build/video/2.0/videofactoryrequire\',\n        \'videoPlaylist\': \'http://static01.nyt.com/js2/build/video/players/extended/2.0/appRequire\',\n        \'auth/mtr\': \'http://static01.nyt.com/js/mtr\',\n        \'auth/growl\': \'http://static01.nyt.com/js/auth/growl/default\',\n        \'vhs\': \'http://static01.nyt.com/video/vhs/build/vhs-2.x.min\'\n    },\n    map: {\n        \'*\': {\n            \'article/main\': \'article/article/main\'\n        }\n    }\n};\n\n\n\n\n\n\nwindow.magnum.processFlags(["limitFabrikSave","moreFollowSuggestions","dfpAds","dfpWhitelist","criticsPickAdditionalInfo","restaurantAttributes","theaterAttributes","movieAttributes","followFeature","restaurantReviewAdditionalDetails","theaterReviewAdditionalDetails","restaurantReviewHideInfoBox","theaterReviewHideInfoBox","restaurantReviewShowRestaurantName","restaurantReviewShowGoogleMap","restaurantReviewShowNotes","restaurantReviewShowLastUpdated","styledMostEmailed","videoVHSCover","restaurantReviewShowMenuLink","allTheEmphases","androidDeepLinks","autoPlayVideos","restaurantOpenStatus","standaloneSlideshowPromo","showNewTMagLogo"]);\n\n\nrequire([\'foundation/main\'], function () {\n    require([\'auth/mtr\', \'auth/growl\']);\n});\n\n\n\n\n    .lt-ie10 .messenger.suggestions {\n        display: block !important;\n        height: 50px;\n    }\n\n    .lt-ie10 .messenger.suggestions .message-bed {\n        background-color: #f8e9d2;\n        border-bottom: 1px solid #ccc;\n    }\n\n    .lt-ie10 .messenger.suggestions .message-container {\n        padding: 11px 18px 11px 30px;\n    }\n\n    .lt-ie10 .messenger.suggestions .action-link {\n        font-family: "nyt-franklin", arial, helvetica, sans-serif;\n        font-size: 10px;\n        font-weight: bold;\n        color: #a81817;\n        text-transform: uppercase;\n    }\n\n    .lt-ie10 .messenger.suggestions .alert-icon {\n        background: url(\'http://i1.nyt.com/images/icons/icon-alert-12x12-a81817.png\') no-repeat;\n        width: 12px;\n        height: 12px;\n        display: inline-block;\n        margin-top: -2px;\n        float: none;\n    }\n\n    .lt-ie10 .masthead,\n    .lt-ie10 .navigation,\n    .lt-ie10 .comments-panel {\n        margin-top: 50px !important;\n    }\n\n    .lt-ie10 .ribbon {\n        margin-top: 97px !important;\n    }\n\n\n\n\n\n\nNYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade your browser.\nLEARN MORE \xbb\n\n\n\n\n\n\n\n\n\nSections\n\nHome\n\nSearch\nSkip to content\nSkip to navigation\nView mobile version\n\n\n\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'business\', \'masthead-theme-standard\', \'standard\', \'branding-heading-link\');\n\n\nEconomy|Challenged on Left and Right, the Fed Faces a Decision on Rates\n\n\n\nAdvertisement\n\n\n\n\n\n\n\nSearch\n\n\nLog In\n0\nSettings\n\n\n\n\nClose search\n\nsearch sponsored by\n\n\n\n\n\n\nSearch NYTimes.com\n\n\n\nClear this text input\n\n\n\nGo\n\n\n\n\n\n\nhttp://nyti.ms/1VpLa1D\n\n\n\n\nLoading...\n\n\n\n\nSee next articles\n\n\n\n\n\nSee previous articles\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\nAdvertisement\n\n\n\n\n\n\nEconomy \nChallenged on Left and Right, the Fed Faces a Decision on Rates\n\nBy BINYAMIN APPELBAUMAUG. 30, 2015\n\n\nInside\n\n\n\nSupported by\n\n\n\n\n\n\n\n\nPhoto\n\n\n\n\n\n\nJanet L. Yellen, the Federal Reserve chairwoman.\n\nCredit\n            Stephen Crowley/The New York Times        \n\n\n\n\nAdvertisement\n\nContinue reading the main story\n\n\n\n\n\nContinue reading the main story\nShare This Page\n\nContinue reading the main story\n\n\nContinue reading the main story\n\n\n\nJACKSON HOLE, Wyo. \u2014  Conservative activists who want the Federal Reserve to raise interest rates distributed chocolate coins in golden wrappers at the local airport last week as Fed officials arrived for their annual policy retreat.Liberal activists in green \u201cWhose Recovery?\u201d T-shirts formed a receiving line at the resort hotel in the heart of Grand Teton National Park where the meeting was held, to personalize their argument that the Fed should wait.Sometime soon \u2014 possibly as early as mid-September and probably no later than the end of the year \u2014 the Fed plans to raise its benchmark interest rate one-quarter of one percentage point, a mathematically minor move that has become a very big deal.Investors, who always pay attention to the Fed, are paying particular attention now. The central bank has held short-term rates near zero since December 2008; the impending end of that era is one cause of recent financial market turmoil. \n\nContinue reading the main story\n\n\n                            Related Coverage\n                    \n\n\n\n\n\n\n\n\n\n\nOptimistic About Inflation, Stanley Fischer Suggests That Fed Will Stick to Plan on RatesAUG. 29, 2015\n\n\n\n\n\n\n\nBut the Fed\u2019s plans have also become the latest point of contention in a broader debate about the government\u2019s management of the American economy, pitting liberals who see a need for more aggressive measures to bolster growth against conservatives concerned that Washington and the Fed are already doing much too much. \nContinue reading the main story\n\n\n\n                When Will the Fed Raise Rates?            \n\n                More than seven years ago the Federal Reserve put its benchmark interest rate close to zero, as a way to bolster the economy. But that policy is about to change.            \n\n\n\n\n\n\n\n\n\n\n\n\u201cThere shouldn\u2019t be this intense interest in a quarter-point increase, and there shouldn\u2019t be this intense interest in whether it comes in September or December,\u201d said Alan S. Blinder, a Princeton economist and the Fed\u2019s vice chairman in the mid-1990s. \u201cBut the Fed remains the center of the financial universe. People stare at it like they stare at the North Star.\u201dAnd so, as Fed officials conferred with other central bankers and academics, the liberal activists held two days of \u201cFed Up\u201d teach-ins in a room directly below the main conference, while the conservatives convened a \u201cJackson Hole Summit\u201d at a nearby dude ranch.In the decades before the financial crisis, policy makers generally agreed that central banks should focus on moderating inflation. Now, both that goal and the best way to achieve it are subjects of debate. Liberals argue that the Fed should aim more broadly to lower unemployment and encourage rising living standards. Conservatives want to strengthen the focus on inflation by requiring officials to follow rules in making policy.\nAdvertisement\n\nContinue reading the main story\nWith the critics lining up outside, central bankers found no escape inside the main conference, where a series of academics warned policy makers that their view of inflation was oversimplified, and that their policies were less effective as a consequence.\u201cThe conference was more about what we don\u2019t know, about a candid willingness to analyze what we don\u2019t know,\u201d said Lucrezia Reichlin, a professor at London Business School and former director general of research at the European Central Bank. \u201cIt did not really inspire confidence\u201d in monetary policy.The formal program, on \u201cInflation Dynamics and Monetary Policy,\u201d was devoted to the vexing reality that inflation in recent years has not behaved as economists predicted. The basic paradigm, known as the Phillips Curve, is that inflation falls as unemployment rises, and rises as unemployment falls. But inflation did not fall as much as expected during the Great Recession, and it has remained surprisingly weak during the recovery.\nAdvertisement\n\nContinue reading the main story\nOver the course of two days, the invited academics argued that the real story was more complicated. One study, for example, presented evidence that prices fall more slowly during recessions because cash-short firms actually tend to increase prices in the face of declining demand for their products.\u201cOnce you integrate all these dynamics, it may turn out that life is not that simple,\u201d said Eric M. Leeper, an economist at Indiana University and co-author of a paper arguing that central banks need better economic models.Central bankers, however, have shown little interest in paradigm shifts. Several said that the basic understanding of inflation, while obviously imperfect, remains more functional than any alternatives.\u201cI don\u2019t think the folks at the Fed are of a mind to redesign monetary policy just because of what happened during the crisis,\u201d said Jon Faust, a professor of economics at Johns Hopkins University and a former adviser to the Fed\u2019s chairwoman, Janet L. Yellen, and her predecessor, Ben S. Bernanke.Indeed, V\xedtor Const\xe2ncio, vice president of the European Central Bank, said the euro area was currently experiencing \u201ca renaissance of the Phillips Curve.\u201dStanley Fischer, vice chairman of the Federal Reserve, painted a somewhat more complicated picture of inflation, arguing that the role of labor market slack is easily overstated, and that exchange rates play an important role.\nContinue reading the main story\nVideo\n\nThe Fed\u2019s Button on the Economy\n\nWhen it comes to raising or lowering interest rates, what the Fed is really trying to do is balance growth and inflation. But they have a limited set of tools to accomplish their goal.\n\n                    By Andrew Ross Sorkin, Aaron Byrd and Erica Berenstein on                                                                Publish Date July 29, 2015.\n                                    \n\n                                            Photo by Aaron Byrd/The New York Times.\n                                    \nWatch in Times Video \xbb\n\n\n\nBut his bottom line, too, was that the Fed understands inflation well enough to predict its movements. While domestic inflation has been surprisingly sluggish for years now, Mr. Fischer said on Friday that his confidence in an eventual rebound remained \u201cpretty high.\u201dThe organizers of the fringe conferences acknowledged the odds against their more radical proposals.\u201cFed Up\u201d is mostly funded by the foundation of a Facebook co-founder, Dustin Moskovitz, which said: \u201cOur best guess is that the campaign is unlikely to have an impact on the Fed\u2019s monetary policy, but that if it does, the benefits would be very large.\u201dJim DeMint, president of the Heritage Foundation, spoke at the conservative conference of \u201ca long and difficult battle that we can and must win.\u201dThe Center for Public Democracy, which organized the \u201cFed Up\u201d campaign, wants the Fed to keep rates near zero even as overall unemployment falls, to spur wage gains and help members of minorities, in particular, find jobs. It brought about 50 people to Jackson Hole as part of an effort to engage community groups that generally focus on civil rights or local issues like minimum wage laws.Dawn O\u2019Neal, 48, makes $8.50 an hour as a day care worker in suburban Atlanta; her husband has not found regular construction work in a year. When Ms. O\u2019Neal needs a refill on her asthma medication, she cuts back on food, buying hot dogs instead of beef and canned vegetables instead of fresh vegetables.\u201cI don\u2019t feel like anyone at the Fed has ever had to make a decision about whether to eat or get medication, and so when I hear that they\u2019re going to raise interest rates in September, it angers me and it scares me,\u201d Ms. O\u2019Neal said.\nAdvertisement\n\nContinue reading the main story\n\n\nAdvertisement\n\nContinue reading the main story\nThe protesters struck a chord with some officials at the main meeting. Jason Furman, President Obama\u2019s chief economic adviser, went downstairs and delivered an impromptu speech. \u201cWe don\u2019t comment on monetary policy, but what I can say is that monetary policy matters,\u201d he told the activists. The prosperity of the late 1990s, he added, resulted in part from \u201ca set of decisions made by the Federal Reserve that allowed that to happen.\u201dOther officials, however, said the Push for low rates was misguided.\u201cThe biggest risk for those that are less fortunate is that we would go back into recession,\u201d said James Bullard, president of the Federal Reserve Bank of St. Louis, who said he leaned toward raising rates in September. \u201cI\u2019m hoping my policy would lengthen out the expansion longer.\u201dThe conservative conference was aligned with efforts by congressional Republicans to impose new restrictions on the Fed\u2019s conduct of monetary policy. A leading proposal would require the Fed to choose a formula for setting rates and stick with it.This view has few fans among the central bankers, who see their own judgment as an essential part of policy making.Mr. Blinder said part of the disconnect between the officials and the activists may reflect that broader concerns motivate liberals and conservatives. Conservatives see the Fed as enabling the growth of the federal debt, while liberals see the Fed as contributing to the rise of inequality.Mr. Blinder said the central bank had little power to reverse either trend. \u201cThey overstate the importance and power of the Federal Reserve,\u201d he said. All it can do, he added, is \u201caddress these problems around the edges.\u201d\n\n\nA version of this article appears in print on August 31, 2015, on page A1 of the New York edition with the headline: Left and Right Work to Shift Fed\u2019s Direction.  Order Reprints| Today\'s Paper|Subscribe\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n\n\n\n\n\n\n\n\n\nGo to Home Page \xbb\n\nSite Index\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'\', \'\', \'standard\', \'site-index-branding-link\');\n\n\n\n\nNews\n\n\nWorld\n\n\nU.S.\n\n\nPolitics\n\n\nN.Y.\n\n\nBusiness\n\n\nTech\n\n\nScience\n\n\nHealth\n\n\nSports\n\n\nEducation\n\n\nObituaries\n\n\nToday\'s Paper\n\n\nCorrections\n\n\n\n\nOpinion\n\n\nToday\'s Opinion\n\n\nOp-Ed Columnists\n\n\nEditorials\n\n\nContributing Writers\n\n\nOp-Ed Contributors\n\n\nOpinionator\n\n\nLetters\n\n\nSunday Review\n\n\nTaking Note\n\n\nRoom for Debate\n\n\nPublic Editor\n\n\nVideo: Opinion\n\n\n\n\nArts\n\n\nToday\'s Arts\n\n\nArt & Design\n\n\nArtsBeat\n\n\nBooks\n\n\nDance\n\n\nMovies\n\n\nMusic\n\n\nN.Y.C. Events Guide\n\n\nTelevision\n\n\nTheater\n\n\nVideo Games\n\n\nVideo: Arts\n\n\n\n\nLiving\n\n\nAutomobiles\n\n\nCrossword\n\n\nFood\n\n\nEducation\n\n\nFashion & Style\n\n\nHealth\n\n\nJobs\n\n\nMagazine\n\n\nN.Y.C. Events Guide\n\n\nReal Estate\n\n\nT Magazine\n\n\nTravel\n\n\nWeddings & Celebrations\n\n\n\n\nListings & More\n\n\nClassifieds\n\n\nTools & Services\n\n\nTimes Topics\n\n\nPublic Editor\n\n\nN.Y.C. Events Guide\n\n\nTV Listings\n\n\nBlogs\n\n\nCartoons\n\n\nMultimedia\n\n\nPhotography\n\n\nVideo\n\n\nNYT Store\n\n\nTimes Journeys\n\n\nSubscribe\n\n\nManage My Account\n\n\n\n\nSubscribe\n\nSubscribe\n\n\nTimes Premier\n\n\n\nHome Delivery\n\n\n\nDigital Subscriptions\n\n\n\nNYT Opinion\n\n\n\nCrossword\n\n\n\n\nEmail Newsletters\n\n\nAlerts\n\n\nGift Subscriptions\n\n\nCorporate Subscriptions\n\n\nEducation Rate\n\n\n\n\nMobile Applications\n\n\nReplica Edition\n\n\nInternational New York Times\n\n\n\n\n\n\n\n\n\n\n\n                    \xa9 2015 The New York Times Company\n\n\nHome\nSearch\nContact Us\nWork With Us\nAdvertise\nYour Ad Choices\nPrivacy\nTerms of Service\nTerms of Sale\n\n\n\n\nSite Map\nHelp\nSite Feedback\nSubscriptions\n\n\n\n\n\n\nrequire([\'foundation/main\'], function () {\n    require([\'article/main\']);\n    require([\'jquery/nyt\', \'foundation/views/page-manager\'], function ($, pageManager) {\n        if (window.location.search.indexOf(\'disable_tagx\') > 0) {\n            return;\n        }\n        $(document).ready(function () {\n            require([\'http://static01.nyt.com/bi/js/tagx/tagx.js\'], function () {\n                pageManager.trackingFireEventQueue();\n            });\n        });\n    });\n});\n\n\n\n\n\n\n\n\n\n\nwindow.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","licenseKey":"b5bcf2eba4","applicationID":"4491457","transactionName":"YwFXZhRYVhAEVUZcX1pLYEAPFlkTFRhCXUA=","queueTime":0,"applicationTime":305,"ttGuid":"","agentToken":"","userAttributes":"","errorBeacon":"bam.nr-data.net","agent":"js-agent.newrelic.com\\/nr-593.min.js"}\n\n'

スクロールするとわかるように、Beautiful Soupバージョンには目に見えないテキストがたくさん含まれています。あまりきれいではありません。

11
Michael

あなたのコードが

raw = nltk.clean_html(html) 
tokens = nltk.Word_tokenize(raw)

使用できます

raw = BeautifulSoup(html).get_text()
tokens = nltk.Word_tokenize(raw)

代わりに、理由について他の回答を参照してください。

3
Statham