nutch、solr、ubuntuサーバー12.04lts

Question

私はUbuntuサーバー12.04ltsを使用していますが、どのバージョンのnutchとsolrが互換性があるかを知っています。

解決策はありますか？

Noosrep · Answer

Nutch 1.5とSolr 3.6.0は互換性があります。

HowTo：

1）jdkをインストールする

Sudo apt-get install openjdk-7-jdk

2）Solrをダウンロードして解凍する

Sudo mkdir ~/tmp/solr cd ~/tmp/solr wget http://mirror.lividpenguin.com/pub/Apache/lucene/solr/3.6.0/Apache-solr-3.6.0.tgz tar -xzvf Apache-solr-3.6.0.tgz *default jetty in solr, try to run Java -jar start.jar* shutdown Ctrl-C

チェックhttp://localhost:8983/solr

3）Nutchをダウンロードして解凍する

Sudo mkdir ~/tmp/nutch cd ~/tmp/nutch wget http://mirror.rmg.io/Apache/nutch/1.5/Apache-nutch-1.5-bin.tar.gz tar -xzvf Apache-nutch-1.5-bin.tar.gz

4）Nutchを構成する

chmod +x bin/nutch export Java_HOME=/usr/lib/jvm/Java-7-openjdk-i386

conf/nutch-site.xmlに追加します

<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>

出口

mkdir -p urls cd urls touch seed.txt nano seed.txt

たとえば、クロール用のURLを追加します

http://nutch.Apache.org/

conf/regex-urlfilter.txtで置き換えます

# accept anything else +.

クロールするドメインに一致する正規表現を使用します。たとえば、クロールをnutch.Apache.orgドメインに制限する場合、行は次のようになります。

+^http://([a-z0-9]*\.)*nutch.Apache.org/

5）Solrを構成する

 ~/tmp/solr/Apache-solr-3.6.0/example/solr/conf schema.xml add the following <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <field name="digest" type="text" stored="true" indexed="true"/> <field name="boost" type="text" stored="true" indexed="true"/> <field name="segment" type="text" stored="true" indexed="true"/> <field name="Host" type="text" stored="true" indexed="true"/> <field name="site" type="text" stored="true" indexed="true"/> <field name="content" type="text" stored="true" indexed="true"/> <field name="tstamp" type="text" stored="true" indexed="false"/> <field name="url" type="string" stored="true" indexed="true"/> <field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/> change <uniqueKey>id</uniqueKey> to <uniqueKey>url</uniqueKey> in solrconfig.xml add <requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <int name="ps">100</int> <bool name="hl">true</bool> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>

6）SolrでNutchクローラーとインデックスを実行します（Solrが開始されていることを確認してください）

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

インデックス付きファイルをチェック@ http://localhost:8983/solr

ソース