ElasticSearchを使用したファイル名検索

Question

ElasticSearchを使用してファイル名（ファイルの内容ではない）を検索したい。したがって、ファイル名の一部を見つける必要があります（完全一致、あいまい検索なし）。

例：
次の名前のファイルがあります。

My_first_file_created_at_2012.01.13.doc My_second_file_created_at_2012.01.13.pdf Another file.txt And_again_another_file.docx foo.bar.txt

次に、2012.01.13を検索して、最初の2つのファイルを取得します。
fileまたはileを検索すると、最後のファイル名を除くすべてのファイル名が返されます。

ElasticSearchでそれを達成するにはどうすればよいですか？

これは私がテストしたものですが、常にゼロの結果を返します。

curl -X DELETE localhost:9200/files curl -X PUT localhost:9200/files -d ' { "settings" : { "index" : { "analysis" : { "analyzer" : { "filename_analyzer" : { "type" : "custom", "tokenizer" : "lowercase", "filter" : ["filename_stop", "filename_ngram"] } }, "filter" : { "filename_stop" : { "type" : "stop", "stopwords" : ["doc", "pdf", "docx"] }, "filename_ngram" : { "type" : "nGram", "min_gram" : 3, "max_gram" : 255 } } } } }, "mappings": { "files": { "properties": { "filename": { "type": "string", "analyzer": "filename_analyzer" } } } } } ' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh" FILES=' http://localhost:9200/files/_search?q=filename:2012.01.13 ' for file in ${FILES} do echo; echo; echo ">>> ${file}" curl "${file}&pretty=true" done

DrTech · Accepted Answer

貼り付けた内容にさまざまな問題があります。

1）不正なマッピング

インデックスを作成するときは、次のように指定します。

"mappings": { "files": {

ただし、タイプは実際にはfileではなくfilesです。マッピングを確認すると、すぐにわかります。

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' # { # "files" : { # "files" : { # "properties" : { # "filename" : { # "type" : "string", # "analyzer" : "filename_analyzer" # } # } # }, # "file" : { # "properties" : { # "filename" : { # "type" : "string" # } # } # } # } # }

2）アナライザーの定義が正しくありません

lowercaseトークナイザーを指定しましたが、文字以外のものはすべて削除されるため（ docs を参照）、番号は完全に削除されます。

これは analyze API で確認できます：

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' # { # "tokens" : [ # { # "end_offset" : 2, # "position" : 1, # "start_offset" : 0, # "type" : "Word", # "token" : "my" # }, # { # "end_offset" : 7, # "position" : 2, # "start_offset" : 3, # "type" : "Word", # "token" : "file" # }, # { # "end_offset" : 22, # "position" : 3, # "start_offset" : 19, # "type" : "Word", # "token" : "doc" # } # ] # }

3）検索時のNgram

Ngramトークンフィルターをインデックスアナライザーと検索アナライザーの両方に含めます。 ngramにインデックスを付けたいので、インデックスアナライザーには問題ありません。ただし、検索するときは、各ngramではなく、文字列全体を検索する必要があります。

たとえば、長さ1〜4のngramで"abcd"にインデックスを付けると、次のトークンが作成されます。

a b c d ab bc cd abc bcd

しかし、"dcba"（一致してはならない）で検索し、検索語をngramで分析すると、実際には次のように検索します。

d c b a dc cb ba dbc cba

したがって、a、b、cとdは一致します！

ソリューション

まず、適切なアナライザーを選択する必要があります。ユーザーはおそらく単語、数字、または日付を検索しますが、ileがfileと一致することを期待しないでしょう。代わりに、 Edge ngrams を使用すると、おそらくより便利になります。これにより、ngramが各Wordの開始（または終了）に固定されます。

また、なぜdocxなどを除外するのですか？確かに、ユーザーはファイルタイプを検索したいと思うかもしれませんか？

それでは、文字でも数字でもないものをすべて削除して、各ファイル名をより小さなトークンに分割しましょう（パターントークナイザーを使用）：

My_first_file_2012.01.13.doc => my first file 2012 01 13 doc

次に、インデックスアナライザーの場合、これらの各トークンでEdgengramも使用します。

my => m my first => f fi fir firs first file => f fi fil file 2012 => 2 20 201 201 01 => 0 01 13 => 1 13 doc => d do doc

次のようにインデックスを作成します。

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1' -d ' { "settings" : { "analysis" : { "analyzer" : { "filename_search" : { "tokenizer" : "filename", "filter" : ["lowercase"] }, "filename_index" : { "tokenizer" : "filename", "filter" : ["lowercase","Edge_ngram"] } }, "tokenizer" : { "filename" : { "pattern" : "[^\p{L}\d]+", "type" : "pattern" } }, "filter" : { "Edge_ngram" : { "side" : "front", "max_gram" : 20, "min_gram" : 1, "type" : "edgeNGram" } } } }, "mappings" : { "file" : { "properties" : { "filename" : { "type" : "string", "search_analyzer" : "filename_search", "index_analyzer" : "filename_index" } } } } } '

次に、アナライザーが正しく機能していることをテストします。

filename_search：

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' [results snipped] "token" : "my" "token" : "first" "token" : "file" "token" : "2012" "token" : "01" "token" : "13" "token" : "doc"

filename_index：

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' "token" : "m" "token" : "my" "token" : "f" "token" : "fi" "token" : "fir" "token" : "firs" "token" : "first" "token" : "f" "token" : "fi" "token" : "fil" "token" : "file" "token" : "2" "token" : "20" "token" : "201" "token" : "2012" "token" : "0" "token" : "01" "token" : "1" "token" : "13" "token" : "d" "token" : "do" "token" : "doc"

OK-正しく機能しているようです。それでは、いくつかのドキュメントを追加しましょう：

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh"

そして検索してみてください：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : "2012.01" } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "PsDvfFCkT4yvJnlguxJrrQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "ER5RmyhATg-Eu92XNGRu-w", # "_type" : "file" # } # ], # "max_score" : 0.06780553, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }

成功！

####更新####

2012.01の検索は2012.01.12と2012.12.01の両方に一致することに気付いたので、代わりにテキストフレーズクエリを使用するようにクエリを変更してみました。しかし、これはうまくいきませんでした。 Edge ngramフィルターは、各ngramの位置カウントをインクリメントすることがわかります（各ngramの位置はWordの先頭と同じだと思っていましたが）。

上記のポイント（3）で述べた問題は、任意のトークンに一致しようとするquery_string、field、またはtextクエリを使用する場合にのみ問題になります。ただし、text_phraseクエリの場合、すべてのトークンを正しい順序で照合しようとします。

この問題を示すために、日付が異なる別のドキュメントにインデックスを付けます。

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }' curl -X POST "http://localhost:9200/files/_refresh"

そして、上記と同じ検索を行います。

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : { "query" : "2012.01" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.22097087, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.22097087, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 5 # }

最初の結果の日付は2012.12.01ですが、これは2012.01に最適ではありません。したがって、その正確なフレーズのみに一致させるには、次のことができます。

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text_phrase" : { "filename" : { "query" : "2012.01", "analyzer" : "filename_index" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.55737644, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 7 # }

または、3つのファイルすべてを一致させたい場合（ユーザーがファイル名の一部の単語を覚えている可能性があるが、順序が間違っているため）、両方のクエリを実行できますが、正しい順序のファイル名の重要性を高めます。：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "bool" : { "should" : [ { "text_phrase" : { "filename" : { "boost" : 2, "query" : "2012.01", "analyzer" : "filename_index" } } }, { "text" : { "filename" : "2012.01" } } ] } } } ' # [Fri Feb 24 16:31:02 2012] Response: # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.012931341, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # } # ], # "max_score" : 0.56892186, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }

Chris Rode · Answer

これは、トークナイザーが使用されているためだと思います。

http://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html

小文字のトークナイザーはWordの境界で分割されるため、2012.01.13は「2012」、「01」、「13」としてインデックス付けされます。文字列「2012.01.13」の検索は明らかに一致しません。

1つのオプションは、検索にもトークン化を追加することです。したがって、「2012.01.13」の検索は、インデックス内と同じトークンにトークン化され、一致します。これは、コード内の検索を常に小文字にする必要がないため、便利です。

2番目のオプションは、フィルターの代わりにn-gramトークナイザーを使用することです。これは、Wordの境界を無視することを意味します（「_」も取得します）が、大文字と小文字の不一致に関する問題が発生する可能性があります。これが、最初に小文字のトークナイザーを追加した理由と考えられます。