Dienstag, 28. Mai 2013

Getting Started with ElasticSearch: Part 1 - Indexing

ElasticSearch is gaining a huge momentum with large installations like Github and Stackoverflow switching to it for its search capabilities. Its distributed nature makes it an excellent choice for large datasets with high availability requirements. In this 2 part article I'd like to share what I learned building a small Java application just for search.

The example I am showing here is part of an application I am using for talks to show the capabilities of Lucene, Solr and ElasticSearch. It's a simple webapp that can search on user group talks. You can find the sources on GitHub.

Some experience with Solr can be helpful when starting with ElasticSearch but there are also times when it's best to not stick to your old knowledge.

Installing ElasticSearch

There is no real installation process involved when starting with ElasticSearch. It's only a Jar file that can be started immediately, either directly using the java command or via the shell scripts that are included with the binary distribution. You can pass the location of the configuration files and the index data using environment variables. This is a Gradle snippet I am using to start an ElasticSearch instance:

task runES(type: JavaExec) {
    main = 'org.elasticsearch.bootstrap.ElasticSearch'
    classpath = sourceSets.main.runtimeClasspath
    systemProperties = ["es.path.home":'' + projectDir + '/elastichome',
                        "es.path.data":'' + projectDir + '/elastichome/data']
}

You might expect that ElasticSearch uses a bundled Jetty instance as it has become rather common nowadays. But no, it implements all the transport layer with the asynchronous networking library Netty so you never deploy it to a Servlet container.

After you started ElasticSearch it will be available at http://localhost:9200. Any further instances that you are starting will automatically connect to the existing cluster and even use another port automatically so there is no need for configuration and you won't see any "Address already in use" problems.

You can check that your installation works using some curl commands.

Index some data:

curl -XPOST 'http://localhost:9200/jug/talk/' -d '{
    "speaker" : "Alexander Reelsen",
    "date" : "2013-06-12T19:15:00",
    "title" : "Elasticsearch"
}'

And search it:

curl -XGET 'http://localhost:9200/jug/talk/_search?q=elasticsearch'

The url contains two fragments that determine the index name (jug) and the type (talk). You can have multiple indices per ElasticSearch instance and multiple types per index. Each type has its own mapping (schema) but you can also search across multiple types and multiple indices. Note that we didn't create the index and the type, ElasticSearch figures out index name and mapping automatically from the url and the structure of the indexed data.

Java Client

There are several alternative clients available when working with ElasticSearch from Java, like Jest that provides a POJO marshalling mechanism on indexing and for the search results. In this example we are using the Client that is included in ElasticSearch. By default the client doesn't use the REST API but connects to the cluster as a normal node that just doesn't store any data. It knows about the state of the cluster and can route requests to the correct node but supposedly consumes more memory. For our application this doesn't make a huge difference but for production systems that's something to think about.

This is an example setup for a Client object that can then be used for indexing and searching:

Client client = NodeBuilder.nodeBuilder().client(true).node().client();

You can use the client to create an index:

client.admin().indices().prepareCreate(INDEX).execute().actionGet();

Note that the actionGet() isn't named this way because it is an HTTP GET request, this is a call to the Future object that is returned by execute, so this is the blocking part of the call.

Mapping

As you have seen with the indexing operation above ElasticSearch doesn't require an explicit schema like Solr does. It automatically determines the likely types from the JSON you are sending to it. Of course, this might not always be correct, and you might want to define custom analyzers for your content so you can also adjust the mappings to your needs. As I was so used to the way Solr does this that I was looking for a way to add the mapping configuration via a file in the server config. This is something you can do indeed using a file called default-mapping.json or via index templates. On the other hand you can also use the REST based put mapping API which has the benefit that you don't need to distribute the file to all nodes manually and also you don't need to restart the server. The mapping then is part of the cluster state and will get distributed to all nodes automatically.

ElasticSearch provides most of its API via Builder classes. Surprisingly I didn't find a Builder for the mapping. One way to construct it is to use the generic JSON builder:

XContentBuilder builder = XContentFactory.jsonBuilder().
  startObject().
    startObject(TYPE).
      startObject("properties").
        startObject("path").
          field("type", "string").field("store", "yes").field("index", "not_analyzed").
        endObject().
        startObject("title").
          field("type", "string").field("store", "yes").field("analyzer", "german").
        endObject().
        // more mapping
      endObject().
    endObject().
  endObject();
client.admin().indices().preparePutMapping(INDEX).setType(TYPE).setSource(builder).execute().actionGet();

Another way I have seen is to put the mapping in a file and just read it to a String, e.g. by using the Guava Resources class.

After you have adjusted the mapping you can have a look at the result at the _mapping endpoint of the index at http://localhost:9200/jug/_mapping?pretty=true.

Indexing

Now we are ready to index some data. In the example application I am using simple data classes that represent talks to be indexed. Again, you have different options how to transform your objects to the JSON ElasticSearch understands. You can build it by hand, e.g. with the XContentBuilder we have already seen above, or more conveniently, by using something like the JSON processor Jackson that can serialize and deserialize Java objects to and from JSON. This is what it looks like when using the XContentBuilder:

XContentBuilder sourceBuilder = XContentFactory.jsonBuilder().startObject()
  .field("path", talk.path)
  .field("title", talk.title)
  .field("date", talk.date)
  .field("content", talk.content)
  .array("category", talk.categories.toArray(new String[0]))
  .array("speaker", talk.speakers.toArray(new String[0]));
IndexRequest request = new IndexRequest(INDEX, TYPE).id(talk.path).source(sourceBuilder);
client.index(request).actionGet();

You can also use the BulkRequest to prevent having to send a request for each document.

With ElasticSearch you don't need to commit after you indexed. By default, it will refresh the index every second which is fast enough for most use cases. If you want to be able to search the data as soon as possible you can also call refresh() on the client. This can be really useful when writing tests and you don't want to wait for a second between indexing and searching.

This concludes the first part of this article on getting started with ElasticSearch using Java. The second part contains more information on searching the data we indexed.