Pythonを使用して、バルクAPIを使用してESにキーワードを保存する方法

Question

ElasticSearchにメッセージを保存する必要があります。pythonプログラムと統合します。メッセージを保存しようとしているのは次のとおりです。

d={"message":"this is message"} for index_nr in range(1,5): ElasticSearchAPI.addToIndex(index_nr, d) print d

つまり、メッセージが10個ある場合、コードを10回繰り返す必要があります。ですから、スクリプトファイルまたはバッチファイルを作成しようとしています。 ElasticSearch Guide を確認しました。BULKAPIは使用可能です。形式は次のようになります。

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } { "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } } { "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } } { "field1" : "value3" } { "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} } { "doc" : {"field2" : "value2"} }

私がしたことは：

{"index":{"_index":"test1","_type":"message","_id":"1"}} {"message":"it is red"} {"index":{"_index":"test2","_type":"message","_id":"2"}} {"message":"it is green"}

また、カールツールを使用してドキュメントを保存します。

$ curl -s -XPOST localhost:9200/_bulk --data-binary @message.json

ここで、my Python codeを使用してファイルをElastic Searchに保存します。

Justina Chen · Answer

from datetime import datetime from elasticsearch import Elasticsearch from elasticsearch import helpers es = Elasticsearch() actions = [ { "_index": "tickets-index", "_type": "tickets", "_id": j, "_source": { "any":"data" + str(j), "timestamp": datetime.now()} } for j in range(0, 10) ] helpers.bulk(es, actions)

Diolor · Answer

@justinachenのコードはpy-elasticsearchを開始するのに役立ちましたが、ソースコードを確認した後、簡単な改善を行いました。

_es = Elasticsearch() j = 0 actions = [] while (j <= 10): action = { "_index": "tickets-index", "_type": "tickets", "_id": j, "_source": { "any":"data" + str(j), "timestamp": datetime.now() } } actions.append(action) j += 1 helpers.bulk(es, actions) _

helpers.bulk()はすでにセグメンテーションを行っています。また、セグメンテーションとは、サーバーに毎回送信されるチャックを意味します。送信されたドキュメントのチャンクを削減したい場合：helpers.bulk(es, actions, chunk_size=100)

始めるための便利な情報：

helpers.bulk()は_helpers.streaming_bulk_の単なるラッパーですが、最初のリストは便利なリストを受け入れます。

_helpers.streaming_bulk_はElasticsearch.bulk()に基づいているため、何を選択するかを心配する必要はありません。

そのため、ほとんどの場合、 helpers.bulk（）で十分です。

Ethan · Answer

（このスレッドで言及されている他のアプローチは、ES更新用にpythonリストを使用します。これは、特に何百万ものデータをESに追加する必要がある場合、今日では良い解決策ではありません））

より良いアプローチはpythonジェネレーターを使用しています-メモリ不足になったり、速度を犠牲にせずにデータのギグを処理します。

以下は、実用的な使用例のサンプルスニペットです。nginxログファイルからデータを分析のためにESに追加します。

def decode_nginx_log(_nginx_fd): for each_line in _nginx_fd: # Filter out the below from each log line remote_addr = ... timestamp = ... ... # Index for elasticsearch. Typically timestamp. idx = ... es_fields_keys = ('remote_addr', 'timestamp', 'url', 'status') es_fields_vals = (remote_addr, timestamp, url, status) # We return a dict holding values from each line es_nginx_d = dict(Zip(es_fields_keys, es_fields_vals)) # Return the row on each iteration yield idx, es_nginx_d # <- Note the usage of 'yield' def es_add_bulk(nginx_file): # The nginx file can be gzip or just text. Open it appropriately. ... es = Elasticsearch(hosts = [{'Host': 'localhost', 'port': 9200}]) # NOTE the (...) round brackets. This is for a generator. k = ({ "_index": "nginx", "_type" : "logs", "_id" : idx, "_source": es_nginx_d, } for idx, es_nginx_d in decode_nginx_log(_nginx_fd)) helpers.bulk(es, k) # Now, just run it. es_add_bulk('./nginx.1.log.gz')

このスケルトンは、ジェネレーターの使用方法を示しています。必要に応じて、ベアマシンでも使用できます。そして、これをさらに拡張して、ニーズにすばやく合わせることができます。

Python Elasticsearchリファレンスこちら。

Rafal Enden · Answer

現時点で考えられる2つのオプションがあります。

1。各エンティティでインデックス名とドキュメントタイプを定義します：

es_client = Elasticsearch() body = [] for entry in entries: body.append({'index': {'_index': index, '_type': 'doc', '_id': entry['id']}}) body.append(entry) response = es_client.bulk(body=body)

2。メソッドでデフォルトのインデックスとドキュメントタイプを指定：

es_client = Elasticsearch() body = [] for entry in entries: body.append({'index': {'_id': entry['id']}}) body.append(entry) response = es_client.bulk(index='my_index', doc_type='doc', body=body)

と連携：

ESバージョン：6.4.0

ES python lib：6.3.1