Donnerstag, 21. November 2013

Reindexing Content in Elasticsearch

One of the crucial parts on any search application is the way you map your content to the analyzers. It will determine which query terms match the terms that are indexed with the documents. Sometimes during development you might notice that you didn't get this right from the beginning and need to reindex your data with a new mapping. While for some applications you can easily start the indexing process again this become more difficult for others. Luckily Elasticsearch by default stores the original content in the _source field. In this short article I will show you how to use a script developed by Simon Willnauer that lets you retrieve all the data and reindex it with a new mapping.

You can do the same thing in an easier way using the utility stream2es only. Look at this post if you are interested

Reindexing

Suppose you have indexed documents in Elasticsearch. Imagine that those are a lot that can not be reindexed again easily or reindexing would take some time.

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Initially this will create the mapping that is determined from the values.

curl -XGET "http://localhost:9200/twitter/tweet/_mapping?pretty=true"
{
  "tweet" : {
    "properties" : {
      "message" : {
        "type" : "string"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string"
      }
    }
  }
}

Now if you notice that you would like to change some of the existing fields to another type you need to reindex as Elasticsearch doesn't allow you to modify the mapping for existing fields. Additional fields are fine, but not existing fields. You can leverage the _source field that you can also see when querying a document.

curl -XGET "http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true&size=1"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "oaFqxMnqSrex6T7_Ut-erw",
      "_score" : 0.30685282, "_source" : {
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}

    } ]
  }
}

For his "no slides no bullshit introduction to Elasticsearch" Simon Willnauer has implemented a script that retrieves the _source fields for all documents of an index. After installing the prerequisites you can use it by passing in your index name:

fetchSource.sh twitter > result.json

It prints all the documents to stdout which can be redirected to a file. We can now delete our index and recreate it using a different mapping.

curl -XDELETE http://localhost:9200/twitter
curl -XPOST "http://localhost:9200/twitter" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

The file we just created can now be send to Elasticsearch again using the handy stream2es utility.

stream2es stdin --target "http://localhost:9200/twitter/tweet" < result.json

All your documents are now indexed using the new mapping.

Implementation

Let's look at the details of the script. At the time of writing this post the relevant part of the script looks like this:

SCROLL_ID=`curl -s -XGET 'localhost:9200/'${INDEX_NAME}'/_search?search_type=scan&scroll=11m&size=250' -d '{"query" : {"match_all" : {} }}' | jq '._scroll_id' | sed s/\"//g`
RESULT=`curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}`

while [[ `echo ${RESULT} | jq -c '.hits.hits | length'` -gt 0 ]] ; do
  #echo "Processed batch of " `echo ${RESULT} | jq -c '.hits.hits | length'`
  SCROLL_ID=`echo $RESULT | jq '._scroll_id' | sed s/\"//g`
  echo $RESULT | jq -c '.hits.hits[] | ._source + {_id}' 
  RESULT=$(eval "curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}")
done

It uses scrolling to efficiently traverse the documents. Processing of the JSON output is done using jq, a lightweight and flexible command-line JSON processor, that I should have used as well when querying the SonarQube REST API.

The first line in the script creates a scan search that uses scrolling. The scroll will be valid for 11 minutes, returns 250 documents on each request and queries all documents, as requested with the match_all query. This response doesn't contain any documents but the _scroll_id which is then extracted with jq. The final sed command removes the quotes around it.

The scroll id now is used to send queries to Elasticsearch. On each iteration it is checked if there are any hits at all. If there are the request will return a new scroll id for the next batch. The result is echoed to the console. .hits.hits[] will return the list of all hits. Using the pipe symbol in jq processes each hit with the filter on the right that prints the source as well as the id of the hit.

Conclusion

The script is a very useful addition to your Elasticsearch toolbox. You can use it to reindex or just export your content. I am glad I looked at the details of the implementation as in the future jq can come in really handy as well.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany and have written a German book about Elasticsearch. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Keine Kommentare:

Kommentar veröffentlichen

Elasticsearch - Der praktische Einstieg
Java Code Geeks