【メモ】ElasticsearchにPDFを投入する - どこにでもいるSEの備忘録

Elasticsearchには基本的にテキストデータしか投入しないつもりだったものの、PDFやExcelなんかを投入したくなることも出てくるかと思います。微妙にやり方あるので、それをメモします。

この辺の記事を参考にさせて頂きました。

qiita.com

dev.classmethod.jp

tombo2.hatenablog.com

ingest
Mapping 定義
- エンコーディング変更
投入
確認
感想

ingest

今回はPDFを想定して、やっていきます。

Dockerfileにこちらの一文を追加します。

RUN elasticsearch-plugin install ingest-attachment

www.elastic.co

ElasticSeearch6.2だとingest-attachmentによってファイルの中身をばらして登録できるようです。

Mapping 定義

$ curl -X PUT 'localhost:9200/_ingest/pipeline/attachment' -H 'Content-Type: application/json' -d'
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1,
        "properties" : [
         "content",
         "content_type"
        ]
      }
    }
  ]
}
'

エンコーディング変更

$ file_path='/path/to/your/**.pdf'

$ file=$(base64 $file_path | perl -pe 's/\n//g')

$ echo -e "{ \"index\" : { \"index\" : \"shop\", \"type\" : \"goods\", \"_id\" : \"1\", \"pipeline\": \"attachment\" }\n{ \"data\" : \"$file\" }" > request_bulk.json

投入

$ curl -H "Content-type: application/x-ndjson" -X POST http://localhost:9200/bulk?refresh=false --data-binary @request_bulk.json $ curl -X POST 'localhost:9200/shop/refresh'

確認

$ curl -X GET 'localhost:9200/shop?pretty'
{
  "shop" : {
    "aliases" : { },
    "mappings" : {
      "goods" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "content_type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "data" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1558211470064",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "EHqISnJpTwGOefPWoHf-yg",
        "version" : {
          "created" : "6020299"
        },
        "provided_name" : "shop"
      }
    }
  }
}

感想

メモだったので特になし。

しかし、何がどうなってるかさっぱりわからない。。。