Donnerstag, 28. November 2013

Reindexing Content in Elasticsearch with stream2es

Last week I wrote about reindexing content in Elasticsearch using a script that extracts the source field from the indexed content. You can use it for cases when your mapping changes or you need to adjust the index settings. After publishing the post Drew Raines mentioned that there is an easier way using the stream2es utility only. Time to have a look at it!

stream2es can be used to stream content from several inputs to Elasticsearch. In my last post I used it to stream a file containing the sources of documents to an Elasticsearch index. Besides that it can index data from Wikipedia or Twitter or from Elasticsearch directly, which we will look at now.

Again, we are indexing some documents:

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Now, if we need to adjust the mapping we can just create a new index with the new mapping:

curl -XPOST "http://localhost:9200/twitter2" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

You can now use stream2es to transfer the documents from the old index to the new one:

stream2es es --source http://localhost:9200/twitter/ --target http://localhost:9200/twitter2/

This will make our documents available in the new index:

curl -XGET http://localhost:9200/twitter2/_count?pretty=true
{                                                                  
  "count" : 2,                                                      
  "_shards" : {                                                    
    "total" : 5,                                                    
    "successful" : 5,                                               
    "failed" : 0                                                   
  }
}

You can now delete the old index. To keep your data available on the same old index name you can also create an alias that will point to your new index:

curl -XDELETE http://localhost:9200/twitter
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "twitter2", "alias" : "twitter" } }
    ]
}'

Looking at the mapping you can see that the twitter index now points to our updated version:

curl -XGET http://localhost:9200/twitter/tweet/_mapping?pretty=true
{
  "tweet" : {
    "properties" : {
      "bytes" : {
        "type" : "long"
      },
      "message" : {
        "type" : "string"
      },
      "offset" : {
        "type" : "long"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string",
        "index" : "not_analyzed",
        "omit_norms" : true,
        "index_options" : "docs"
      }
    }
  }
}

Donnerstag, 21. November 2013

Reindexing Content in Elasticsearch

One of the crucial parts on any search application is the way you map your content to the analyzers. It will determine which query terms match the terms that are indexed with the documents. Sometimes during development you might notice that you didn't get this right from the beginning and need to reindex your data with a new mapping. While for some applications you can easily start the indexing process again this become more difficult for others. Luckily Elasticsearch by default stores the original content in the _source field. In this short article I will show you how to use a script developed by Simon Willnauer that lets you retrieve all the data and reindex it with a new mapping.

You can do the same thing in an easier way using the utility stream2es only. Look at this post if you are interested

Reindexing

Suppose you have indexed documents in Elasticsearch. Imagine that those are a lot that can not be reindexed again easily or reindexing would take some time.

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Initially this will create the mapping that is determined from the values.

curl -XGET "http://localhost:9200/twitter/tweet/_mapping?pretty=true"
{
  "tweet" : {
    "properties" : {
      "message" : {
        "type" : "string"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string"
      }
    }
  }
}

Now if you notice that you would like to change some of the existing fields to another type you need to reindex as Elasticsearch doesn't allow you to modify the mapping for existing fields. Additional fields are fine, but not existing fields. You can leverage the _source field that you can also see when querying a document.

curl -XGET "http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true&size=1"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "oaFqxMnqSrex6T7_Ut-erw",
      "_score" : 0.30685282, "_source" : {
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}

    } ]
  }
}

For his "no slides no bullshit introduction to Elasticsearch" Simon Willnauer has implemented a script that retrieves the _source fields for all documents of an index. After installing the prerequisites you can use it by passing in your index name:

fetchSource.sh twitter > result.json

It prints all the documents to stdout which can be redirected to a file. We can now delete our index and recreate it using a different mapping.

curl -XDELETE http://localhost:9200/twitter
curl -XPOST "http://localhost:9200/twitter" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

The file we just created can now be send to Elasticsearch again using the handy stream2es utility.

stream2es stdin --target "http://localhost:9200/twitter/tweet" < result.json

All your documents are now indexed using the new mapping.

Implementation

Let's look at the details of the script. At the time of writing this post the relevant part of the script looks like this:

SCROLL_ID=`curl -s -XGET 'localhost:9200/'${INDEX_NAME}'/_search?search_type=scan&scroll=11m&size=250' -d '{"query" : {"match_all" : {} }}' | jq '._scroll_id' | sed s/\"//g`
RESULT=`curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}`

while [[ `echo ${RESULT} | jq -c '.hits.hits | length'` -gt 0 ]] ; do
  #echo "Processed batch of " `echo ${RESULT} | jq -c '.hits.hits | length'`
  SCROLL_ID=`echo $RESULT | jq '._scroll_id' | sed s/\"//g`
  echo $RESULT | jq -c '.hits.hits[] | ._source + {_id}' 
  RESULT=$(eval "curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}")
done

It uses scrolling to efficiently traverse the documents. Processing of the JSON output is done using jq, a lightweight and flexible command-line JSON processor, that I should have used as well when querying the SonarQube REST API.

The first line in the script creates a scan search that uses scrolling. The scroll will be valid for 11 minutes, returns 250 documents on each request and queries all documents, as requested with the match_all query. This response doesn't contain any documents but the _scroll_id which is then extracted with jq. The final sed command removes the quotes around it.

The scroll id now is used to send queries to Elasticsearch. On each iteration it is checked if there are any hits at all. If there are the request will return a new scroll id for the next batch. The result is echoed to the console. .hits.hits[] will return the list of all hits. Using the pipe symbol in jq processes each hit with the filter on the right that prints the source as well as the id of the hit.

Conclusion

The script is a very useful addition to your Elasticsearch toolbox. You can use it to reindex or just export your content. I am glad I looked at the details of the implementation as in the future jq can come in really handy as well.

Samstag, 16. November 2013

Devoxx in Tweets

For the first time in several years I unfortunately had to skip this years Devoxx. There are so many tweets that remind me of the good talks going on there and I thought I would do something useful with them. So again I indexed them in Elasticsearch using the Twitter river and therefore can look at them using Kibana. David Pilato also has set up a public instance and I could imagine that there will be a more thorough analysis done by the Devoxx team but here are my thoughts on this years Devoxx without having been there.

I'll be looking at three things: The top 10 mentions, the top 10 hashtags and the tweet distribution over time. For the mentions I have excluded @Devoxx and @java, for the hashtags I have excluded #devoxx, #dvx13 and #dv13 as the mentions and tags are too dominant and don't tell a lot. I have collected all tweets mentioning the term devoxx so there will be a lot I missed. Each retweet counts as a seperate tweet.

Overall Trends

Looking at the timeline of the whole week you can see that the amount of tweets is high at the beginning and continually rises with thursday having even more tweets than wednesday which is quite a surprise to me. I would have thought that the first conference day is the one that has the most tweets.

Stephan007, the founder of Devoxx, has the most mentions which is no surprise. Chet Haase and Romain Guy are following. I have never seen a talk done by them but I probably should. The Dart language is the dominant hashtag with a lot of buzz around their 1.0 release. Java, Android and Scala are still hot technologies. Android is a bit of a surprise here. It's nice that the initiative Devoxx4Kids ranks quite high.

Daily Analysis

Monday

On monday the top mention is @AngularJS. Of course this is caused by the two AngularJS university sessions that lasted nearly the whole day. Angular is a hot topic but I am not yet planning to do any work with it. The session on JavaEE 7 also created a lot of interest as can be seen by the mentions of its hosts Arun Gupta and Antonio Goncalves. They especially encouraged people to participate on Twitter which seems to have been received very well. Scala is another hot topic with the university session by Dick Wall and Joshua Suereth that I would really have liked to see.

Tuesday

Tuesday is dominated by the two excellent speakers Matt Raible and Venkat Subramaniam. I especially regret that I couldn't see Venkat in action who I consider to be one of the best speakers available. I am not sure what the tag hackergarden is referring to as I didn't find an event on monday evening or tuesday. There is also quite some interest in Reactor, the reactive framework of the Spring ecosystem.

Wednesday

Brian Goetz got a lot of mentions for the keynote. I think it's a surprise that there are so many mentions of David Blevins for his talk EJB 3.2 and beyond which I wouldn't have expected to be that popular. The big event of the day was the launch of Ceylon 1.0 as can be seen from the hashtag. I heard good things about Ceylon but I still consider it an underdog of the alternative JVM languages.

Thursday

Romain Guy is leading the mentions with his very popular talk "Filthy Rich Android Clients", followed by Jonas Boner of Akka fame and Venkat Subramaniam. The launch of Dart 1.0 dominates the keywords. The Javaposse still ranks in the top 10 with their popular traditional session.

Friday

Friday normally has fewer participants than the other days. Joshua Suereth received a lot of tweets for his Scala talk, ranking high both in mentions and hashtags. The session on Google Glass also was very popular. I am not sure which session caused the mention of dagger.

Programming Language Popularity

Niko Schmuck proposed to add the language popularity over the week. As this is quite interesting here is the totally unscientific popularity chart that of course should determine which language you are learning next. I am not querying the hashtags but any mention of the terms.

Java dominates but JavaScript is very strong on Monday and Thursday. Ceylon has its share on Wednesday while Thursday is the Dart day. Scala is very popular on Monday and Friday.

A ranked version:

Java1234
JS584
Dart490
Scala252
Ceylon172
Groovy171
Clojure94
Kotlin16

The Drink Tweets

As there are quite some people tweeting we can see some trends with regards to the drink tweets. First the coffee tweets:

Quite a spike on Monday with people either mentioning that they need coffee or complaining about the coffee. This repeats on tuesday on wednesday in the morning, people seem to have accepted the situation on thursday.

Another common topic, especially since the conference is located in Belgium, are the beer tweets.

Surprise, surprise, people tend to tweet about beer in the evening. I like the huge Javaposse-spike on thursday with a lot of mentions of the beer sponsor Atlassian.

Conclusion

Though I haven't been there I could get a small glimpse of the trends at this years Devoxx. As soon as the videos are available I will buy the account for this years conference, not only because there are so many interesting talks to see but also because the Devoxx team is doing a fantastic job that I'd like to support in any way I can.

Updates
  • 17.11. Added a section on programming language popularity
  • 18.11. Updated the weekly diagram with a more accurate

Sonntag, 10. November 2013

Lucene Solr Revolution 2013 in Dublin

I just returned from Lucene Solr Revolution Europe, the conference on everything Lucene and Solr which this year was held in Dublin. I always like to recap what I took from a conference so here are some impressions.

The Venue

In the spirit of last years conference, which was merged with ApacheCon and held in a soccer stadium in Sinsheim, this years venue was a Rugby Stadium. It's seems to be quite common that conferences are organized there and the location was well suited. For some of the room changes you had to walk quite a distance but that's nothing that couldn't be managed.

The Talks

As there were four tracks in parallel choosing the talk to attend could prove to be difficult. There were so many interesting things to choose from. Fortunately all the talks have been recorded and will be made available for free on the conference website.

The following are a selection of talks that I think were most valuable to me.

Keynote: Michael Busch on Lucene at Twitter

Michael Busch is a regular speaker at Search conferences because Twitter is doing some interesting things. On the one hand they have to handle near realtime search, massive data sets and lots of requests. On the other hand they can always be sure that their documents are of a certain size. They maintain two different data stores as Lucene indices, the realtime index that contains the most recent data and the archive index that makes older tweets searchable. They introduced the archive index only a few months ago which in my opinion led to a far more reliable search experience. They have done some really interesting things like encoding the position info of a term with the doc id because they only need few bits to address positions in a 140 character document. Also they changed some aspects of the posting list encoding because they always display results sorted by date. They are trying to make their changes more general so those can be contributed back to Lucene.

Solr Indexing and Analysis Tricks by Erik Hatcher

I always enjoy listening to the talks of Erik Hatcher, probably also because his university session at Devoxx 2009 was the driving factor for me starting to use Solr. In this years talk he presented lots of useful aspects for indexing data in Solr. One of the most interesting facts I took from this talk is the use of the ScriptUpdateProcessor that is included in Solr since version 4. You can define scripts that are executed during indexing and can manipulate the document. This is a valuable alternative to copyFields, especially if you would like to have the content stored as well. By default you can implement the logic in JavaScript but there are alternatives available.

Hacking Lucene and Solr for Fun and Profit by Grant Ingersoll

Grant Ingersoll presented some applications of Lucene and Solr not directly involving search like Classification, Recommendations and Analytics. Some examples had been taken from his excellent book Taming Text (watch this blog for a review of the book in the near future).

Schemaless Solr and the Solr Schema REST API by Steve Rowe

One of the factors of the success of Elasticsearch is its ease of use. You can download it and start indexing documents immediately without doing any configuration work. One of the features that enables you to do this is the autodiscovery of fields by value. Starting with Solr 4.4 you can now use Solr in a similar way. You can configure that you want Solr to manage your schema. This way unknown fields are then created automatically based on the first value that is extracted by configured parsers. As with Elasticsearch you shouldn't rely on this feature exclusively so there is also a way to add new fields of a certain type via the Schema REST API. When Solr is in managed mode it will modify the schema.xml so you might lose changes you made manually. For the future the developers are even thinking about moving away from XML for the managed mode as there are better options for when readability doesn't matter.

Stump the Chump with Chris Hostetter

This seems to be a tradition at Lucene Solr Revolution. Chris Hostetter has to find solutions to problems that have been submitted before or are posted by the audience. It's a fun event but you can also learn a lot.

Query Latency Optimization with Lucene by Stefan Pohl

Stefan first introduced some basic latency factors and how to measure them. He recommended to not instrument the low level Lucene classes when profiling your application as those rely heavily on hotspot optimizations. Besides introducing the basic mechanisms of how conjunction (AND) and disjunction (OR) work he described some recent Lucene improvements that can speed up your application, among those LUCENE-4571, the new minShouldMatch implementation and LUCENE-4752, which allows custom ordering of documents in the index.

Relevancy Hacks for eCommerce by Varun Thacker

Varun introduced the basics of relevancy sorting in Lucene and Solr and how those might affect product searches. TF/IDF is sometimes not the best solution ("IDF is a measurement of rarity not necessarily importance"). He also showed the ways to influence the relevancy: Implementation of a custom Similarity class, boosting and function queries.

What is in a Lucene Index by Adrien Grand

Adrien started with the basics fo a Lucene index and how it differs from a database index: the dictionary structure, segments and merging. He then moved on to topics like the structure of the posting list, term vectors, the FST terms index and the difference between stored fields and doc values. This is a talk full of interesting details on the internal workings of Lucene and the implications for the performance of your application.

Conclusion

As said before I couldn't attend all the talks I would have liked. I especially heard good things about the following talks which I will watch as soon as those are available:

  • Integrate Solr with Real-Time Stream Processing Applications by Timothy Potter
  • The Typed Index by Christoph Goller
  • Implementing a Custom Search Syntax Using Solr, Lucene and Parboiled by John Berryman

I really enjoyed Lucene Solr Revolution. Not only were there a lot of interesting talks to listen to but it was also a good opportunity to meet new people. On both evenings there have been get togethers with free drinks and food which must have cost LucidWorks a fortune. I couldn't attend the closing remarks but I heard they announced that they want to move to smaller, national events in Europe instead of the central conference. I hope those will still be events that attract so many commiters and interesting people.

Elasticsearch - Der praktische Einstieg
Java Code Geeks