Dienstag, 28. Mai 2013

Getting Started with ElasticSearch: Part 1 - Indexing

ElasticSearch is gaining a huge momentum with large installations like Github and Stackoverflow switching to it for its search capabilities. Its distributed nature makes it an excellent choice for large datasets with high availability requirements. In this 2 part article I'd like to share what I learned building a small Java application just for search.

The example I am showing here is part of an application I am using for talks to show the capabilities of Lucene, Solr and ElasticSearch. It's a simple webapp that can search on user group talks. You can find the sources on GitHub.

Some experience with Solr can be helpful when starting with ElasticSearch but there are also times when it's best to not stick to your old knowledge.

Installing ElasticSearch

There is no real installation process involved when starting with ElasticSearch. It's only a Jar file that can be started immediately, either directly using the java command or via the shell scripts that are included with the binary distribution. You can pass the location of the configuration files and the index data using environment variables. This is a Gradle snippet I am using to start an ElasticSearch instance:

task runES(type: JavaExec) {
    main = 'org.elasticsearch.bootstrap.ElasticSearch'
    classpath = sourceSets.main.runtimeClasspath
    systemProperties = ["es.path.home":'' + projectDir + '/elastichome',
                        "es.path.data":'' + projectDir + '/elastichome/data']
}

You might expect that ElasticSearch uses a bundled Jetty instance as it has become rather common nowadays. But no, it implements all the transport layer with the asynchronous networking library Netty so you never deploy it to a Servlet container.

After you started ElasticSearch it will be available at http://localhost:9200. Any further instances that you are starting will automatically connect to the existing cluster and even use another port automatically so there is no need for configuration and you won't see any "Address already in use" problems.

You can check that your installation works using some curl commands.

Index some data:

curl -XPOST 'http://localhost:9200/jug/talk/' -d '{
    "speaker" : "Alexander Reelsen",
    "date" : "2013-06-12T19:15:00",
    "title" : "Elasticsearch"
}'

And search it:

curl -XGET 'http://localhost:9200/jug/talk/_search?q=elasticsearch'

The url contains two fragments that determine the index name (jug) and the type (talk). You can have multiple indices per ElasticSearch instance and multiple types per index. Each type has its own mapping (schema) but you can also search across multiple types and multiple indices. Note that we didn't create the index and the type, ElasticSearch figures out index name and mapping automatically from the url and the structure of the indexed data.

Java Client

There are several alternative clients available when working with ElasticSearch from Java, like Jest that provides a POJO marshalling mechanism on indexing and for the search results. In this example we are using the Client that is included in ElasticSearch. By default the client doesn't use the REST API but connects to the cluster as a normal node that just doesn't store any data. It knows about the state of the cluster and can route requests to the correct node but supposedly consumes more memory. For our application this doesn't make a huge difference but for production systems that's something to think about.

This is an example setup for a Client object that can then be used for indexing and searching:

Client client = NodeBuilder.nodeBuilder().client(true).node().client();

You can use the client to create an index:

client.admin().indices().prepareCreate(INDEX).execute().actionGet();

Note that the actionGet() isn't named this way because it is an HTTP GET request, this is a call to the Future object that is returned by execute, so this is the blocking part of the call.

Mapping

As you have seen with the indexing operation above ElasticSearch doesn't require an explicit schema like Solr does. It automatically determines the likely types from the JSON you are sending to it. Of course, this might not always be correct, and you might want to define custom analyzers for your content so you can also adjust the mappings to your needs. As I was so used to the way Solr does this that I was looking for a way to add the mapping configuration via a file in the server config. This is something you can do indeed using a file called default-mapping.json or via index templates. On the other hand you can also use the REST based put mapping API which has the benefit that you don't need to distribute the file to all nodes manually and also you don't need to restart the server. The mapping then is part of the cluster state and will get distributed to all nodes automatically.

ElasticSearch provides most of its API via Builder classes. Surprisingly I didn't find a Builder for the mapping. One way to construct it is to use the generic JSON builder:

XContentBuilder builder = XContentFactory.jsonBuilder().
  startObject().
    startObject(TYPE).
      startObject("properties").
        startObject("path").
          field("type", "string").field("store", "yes").field("index", "not_analyzed").
        endObject().
        startObject("title").
          field("type", "string").field("store", "yes").field("analyzer", "german").
        endObject().
        // more mapping
      endObject().
    endObject().
  endObject();
client.admin().indices().preparePutMapping(INDEX).setType(TYPE).setSource(builder).execute().actionGet();

Another way I have seen is to put the mapping in a file and just read it to a String, e.g. by using the Guava Resources class.

After you have adjusted the mapping you can have a look at the result at the _mapping endpoint of the index at http://localhost:9200/jug/_mapping?pretty=true.

Indexing

Now we are ready to index some data. In the example application I am using simple data classes that represent talks to be indexed. Again, you have different options how to transform your objects to the JSON ElasticSearch understands. You can build it by hand, e.g. with the XContentBuilder we have already seen above, or more conveniently, by using something like the JSON processor Jackson that can serialize and deserialize Java objects to and from JSON. This is what it looks like when using the XContentBuilder:

XContentBuilder sourceBuilder = XContentFactory.jsonBuilder().startObject()
  .field("path", talk.path)
  .field("title", talk.title)
  .field("date", talk.date)
  .field("content", talk.content)
  .array("category", talk.categories.toArray(new String[0]))
  .array("speaker", talk.speakers.toArray(new String[0]));
IndexRequest request = new IndexRequest(INDEX, TYPE).id(talk.path).source(sourceBuilder);
client.index(request).actionGet();

You can also use the BulkRequest to prevent having to send a request for each document.

With ElasticSearch you don't need to commit after you indexed. By default, it will refresh the index every second which is fast enough for most use cases. If you want to be able to search the data as soon as possible you can also call refresh() on the client. This can be really useful when writing tests and you don't want to wait for a second between indexing and searching.

This concludes the first part of this article on getting started with ElasticSearch using Java. The second part contains more information on searching the data we indexed.

Dienstag, 19. Februar 2013

Softwerkskammer Rhein-Main Open Space

On Saturday I attended an Open Space in Wiesbaden, organized by members of Softwerkskammer Rhein-Main, a very active chapter of the German software craftmanship community. The event took place in the offices of Seibert Media above a shopping mall including a nice view of the city.

The Format

Open Space conferences are special as there is no predefined agenda. All the attendees can bring ideas and propose those in the opening session and choose a time slot and room. Sessions are not necessarily normal presentations but rather discussions so it's even OK to just propose a question that you have or a topic you'd like to learn more about from the attendees. Also, there are some guidelines and rules: sessions don't need to start and end in time, you can always leave a session in case you feel you can't contribute and you shouldn't be disappointed if nobody shows up for your proposed session.

Personal Kanban

Dennis Traub presented a session on Personal Kanban. As I did Kanban style development in one project already I was eager to learn how to apply the principles to personal organization. Basically it all works the same as normal Kanban. Tasks are visualized on a board where a swimlane defines the state of a task with work items flowing from left (todo) to right (done). You can define swimlanes as it fits your habits, e.g. one for todos, one for in progress and one for blocked. The in progress lane needs to have a Work in Progress limit which is the amount of tasks you start and process in parallel. An important aspect is that you don't have to put all your backlog items to the todo lane but you can also keep them in a seperate place. This keeps you from getting overwhelmed when looking at the board.

It sounds like Kanban is a good way for organizing your daily life. For me personally the biggest hindrance is that I am working from my living room and I'd rather not put a Kanban board in my living room. If I'd use a separate office I guess I'd try it immediately.

Open Source

An attendee wanted to know some experiences with Open Source communities. Two full time committers, Ollie for Spring and Marcel for Eclipse, shared some of their experiences. I am still surprised that a lot of Open Source projects have quite some bugs in their trackers that could easily be fixed by newcomers. A lot of people like Open Source software but not that many seem to be interested in contributing to a project continuously. Most of the interaction with users in the issue trackers are one time reports, so the people report one bug and move on. Even for big projects like Spring and Eclipse it's hard to find committers. One way to motivate people is to organize hack days where users learn to work with the sources of the projects but this also needs quite some preparation.

Freelancing

The topic of freelancing was discussed all over the day. Markus Tacker presented his idea of the kybernetic agency, a plan to form a freelance network with people who can work on projects together. We discussed benefits and possible problems, mainly of legal type. A quite inspiring session that also made me think about the difference of freelancing in the Java enterprise world compared to PHP development. Most of the freelancers I know would prefer not to work 5 days a week for one client exclusively but that is often a prerequisite for projects in the enterprise world.

Learning

Learning is a topic that is very important to me so I proposed a session on it. I already switched from 5 to 4 days the last months of my employment at synyx because I felt the need to invest more time in learning which is often not possible when working on client projects. Even now as a freelancer I keep one day for learning only. What works best for me is writing blogposts that contain some sample code. I can build something and when writing the post I make sure that I have a deep understanding of the topic I am writing about. Other people also said that the most important aspect is to have something to work on, reading or watching screencasts alone is no sustainable activity. I also liked the technique of another freelancer: whenever he notices that he could do something different on the current project he stops to track the time for the customer and tries to find ways to improve the project, probably learning a new approach. This is something you are doing implicitly as a freelancer, you often spend some of your spare time thinking about client work but I like this explicit approach.

Summary

All in all this was a really fruitful, but also exhausting, day. Though I chose meta topics exclusively I gained a lot from visiting. Thanks a lot to the organizers (mainly Benjamin), moderators, sponsors and all the attendees that made this event possible. I am looking forward to meeting a lot of the people again at Socrates this year.

Freitag, 1. Februar 2013

Book Review: Gradle Effective Implementation Guide

PacktPub kindly offered me a free review edition of Gradle Effective Implementation Guide written by mrhaki Hubert Klein Ikkink. As I planned to read it anyway I agreed to write a review of it.

Maven was huge for Java Development. It brought dependency management, sane conventions and platform independent builds to the mainstream. If there is a Maven pom file available for an open source project you can be quite sure to manage to build it on your local machine in no time.

But there are cases when it doesn't work that well. Its phase model is rather strict and the one-artifact-per-build restriction can get in your way for more unusual build setups. You can workaround some of these problems using profiles and assemblies but it feels that it is primarily useful for a certain set of projects.

Gradle is different. It's more flexible but there's also a learning curve involved. Groovy as its build DSL is easy to read but probably not that easy to write at first because there are often multiple ways to do something. As a standard Java developer like me you might be unsure about the proper way of doing something.

There are a lot of helpful resources online, namely the forum and the excellent user guide but as I prefer to read longer sections offline I am really glad that there now is a book available that contains extensive information and can get you started with Gradle.

Content

The book starts with a general introduction into Gradle. You'll get a high level overview of its features, learn how to install it and write your first build file. You'll also learn some important options of the gradle executable that I haven't been aware of.

Chapter 2 explains tasks and how to write build files. This is a very important chapter if you are not that deep into the Groovy language. You'll learn about the implicitly available Task and Project instances and the different ways of accessing methods and properties and of defining tasks and dependencies between them.

Working with files is an important part of any build system. Chapter 3 contains detailed information on accessing and modifying files, file collections and file trees. This is also where the benefit of using Groovy becomes really obvious. The ease of working with collections can lead to very concise build definitions though you have all the power of Groovy and the JVM at your hands. The different log levels are useful to know and can come in handy when you'd like to diagnose a build.

While understanding tasks is an important foundation for working with Gradle it's likely that you are after using it with programming languages. Nearly all of the remaining chapters cover working with different aspects on builds for JVM languages. Chapter 4 starts with a look at the Java plugin and its additional concepts. You'll see how you can compile and package Java applications and how to work with sourceSets.

Nearly no application is an island. The Java world provides masses of useful libraries that can help you build your application. Proper dependency management, as introduced in Chapter 5, is important for easy build setups and for making sure that you do not introduce incompatible combinations of libraries. Gradle supports Maven, Ivy and local file based repositories. Configurations are used to group dependencies, e.g. to define dependencies that are only necessary for tests. If you need to influence the version you are retrieving for a certain dependency you can configure resolution strategies, version ranges and exclusions for transitive dependencies.

Automated testing is a crucial part of any modern software development process. Gradle can work with JUnit and TestNG out of the box. Test execution times can be improved a lot by the incremental build support and the parallelization of tests. I guess this can lead to dramatically shorter build times, something I plan to try on an example project with a lot of tests in the near future. This chapter also introduces the different ways to run an application, create distributions and how to publish artifacts.

The next chapter will show you how you can structure your application in separate projects. Gradle has clever ways to find out which projects need to be rebuild before and after building a certain project.

Chapter 8 contains information on how to work with Scala and Groovy code. The necessary compiler versions can be defined in the build so there is no need to have additional installations. I've heard good things about the Scala integration so Gradle seems to be a viable alternative to sbt.

The check task can be used to gather metrics on your project using many of the available open source projects for code quality measurement. Chapter 9 shows you how to include tools like Checkstyle, PMD and FindBugs to analyze your project sources, either standalone or by sending data to Sonar.

If you need additional functionality that is not available you can start implementing your own tasks and plugins. Chapter 10 introduces the important classes for writing custom plugins and how to use them from Groovy and Java.

Gradle can be used on several Continuous Integration systems. As I've been working with Hudson/Jenkins exclusively during the last years it was interesting to also read about the commercial alternatives Team City and Bamboo in Chapter 11.

The final chapter contains a lot of in depth information on the Eclipse and IDEA plugins. Honestly, this contains more information on the Eclipse file format than I wanted to know but I guess that can be really useful for users. Unfortunately the excellent Netbeans plugin is not described in the book.

Summary

The book is an excellent introduction into working effectively with Gradle. It has helped me to get a far better understanding of the concepts. If you are thinking about or already started working with Gradle I highly recommend to get a copy. There are a lot of detailed example files that you can use immediately. Many of those are very close to real world use cases and can help you thinking about additional ways Gradle can be useful for organizing your builds.

Donnerstag, 24. Januar 2013

Make your Filters Match: Faceting in Solr

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I'll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn't have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

As a very simple example consider this schema definition:

<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="text" type="text_general" indexed="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="false"/>
</fields>

There are three fields, the id, a title that we'd probably like to search on and an author. The author is defined as a string field which means no analyzing at all. The faceting mechanism uses the term value and not a stored value so we want to make sure that the original value is preserved. I explicitly don't store the author information to make it clear that we are working with the indexed value.

Let's index some book data with curl (see this GitHub repo for the complete example including some unit tests that execute the same functionality using Java).

curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary \
'<add><doc>
<field name="id">1</field>
<field name="text">On the Shortness of Life</field>
<field name="author">Seneca</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">What I Talk About When I Talk About Running</field>
<field name="author">Haruki Murakami</field>
</doc>
<doc>
<field name="id">3</field>
<field name="text">The Dude and the Zen Master</field>
<field name="author">Jeff "The Dude" Bridges</field>
</doc>
</add>'
curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

And verify that the documents are available:

curl http://localhost:8082/solr/query?q=*:*
{
"responseHeader":{
"status":0,
"QTime":3,
"params":{
"q":"*:*"}},
"response":{"numFound":3,"start":0,"docs":[
{
"id":"1",
"text":"On the Shortness of Life"},
{
"id":"2",
"text":"What I Talk About When I Talk About Running"},
{
"id":"3",
"text":"The Dude and the Zen Master"}]
}}

I'll omit parts of the response in the following examples. We can also have a look at the shiny new administration view of Solr 4 to see all terms that are indexed for the field author.

Each of the author names is indexed as one term.

Faceting

Let's move on to the faceting part. To let the user drill down on search results there are two steps involved. First you tell Solr that you would like to retrieve facets with the results. Facets are contained in an extra section of the response and consist of the indexed term as well as a count. As with most Solr parameters you can either send the necessary options with the query or preconfigure them in solrconfig.xml. This query has faceting on the author field enabled:

curl "http://localhost:8082/solr/query?q=*:*&facet=on&facet.field=author"
{
  "responseHeader":{...},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"},
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"},
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1,
        "Jeff \"The Dude\" Bridges",1,
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

And this is what a configuration in solrconfig looks like:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">none</str>
    <int name="rows">10</int>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
</requestHandler>

This way we don't have to pass the parameters with the query anymore and can see which parts of the query change.

Common Filtering

When a user chooses a facet you issue the same query again, this time adding a filter query that restricts the search results to any that have the value for this certain fields set. In our case the user would only see books of one certain author. Let's start simple and pretend that a user can't handle the massive amount of 3 search results and is only interested in books on Seneca:

curl 'http://localhost:8082/solr/select?fq=author:Seneca'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Works fine. We added a filter query that restricts the results to only those that are written by Seneca. Note that there is only one facet left because the search results don't contain any books by other authors. Let's see what happens when we try to filter the results to see only books by Haruki Murakami. We need to URL encode the blank, the rest of the query stays the same:

curl 'http://localhost:8082/solr/select?fq=author:Haruki%20Murakami'
{
  "responseHeader":{...},
  "response":{"numFound":0,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[]},
    "facet_dates":{},
    "facet_ranges":{}}}

No results. Why is that? The default query parser for filter queries is the Lucene query parser. It tokenizes the query on whitespace, so even if we store the field unanalyzed it's not the query we are probably expecting to use. The query that is the result of the parsing process is not a term query as in our first example. It's a boolean query that consists of two term queries author:Haruki text:murakami. If you are familiar with the Lucene query syntax this won't be a surprise to you. If you prefix a term with a field name and a colon it will search on this field, otherwise it will search on the default field we declared in solrconfig.xml.

How can we fix it? Simple, just turn it into a phrase by surrounding the words with double quotes:

curl 'http://localhost:8082/solr/select?fq=author:"Haruki%20Murakami"'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Or, if you prefer, you can also escape the blank using the backslash, which yields the same result:

curl 'http://localhost:8082/solr/select?fq=author:Haruki\%20Murakami'

Fun fact: I am not that good at picking examples. If we are filtering on our last author we will be surprised (at least I scratched my head for a while):

curl 'http://localhost:8082/solr/select?fq=author:Jeff%20"The%20Dude"%20Bridges'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Jeff \"The Dude\" Bridges",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

This actually seemed to work though we neither turned it into a phrase nor did we escape the blanks. If we look at how the Lucene query parser handles this query we see immediately why this returns a result. As with the last example this is turned into a boolean query, only the first query is executed against the author field. The other two tokens are searching on the default field and in this case "The Dude" matches the text field: author:Jeff text:"the dude" text:bridges. If you just want to match on the author field you can escape the blanks as we did in the example before:

curl 'http://localhost:8082/solr/select?fq=author:Jeff\%20\"The\%20Dude\"\%20Bridges'

I'll spare you with the response.

Using Local Params to set the Query Parser

At ApacheCon Europe in November Eric Hatcher did a really interesting presentation on query parsers in Solr where he introduced another, probably cleaner way to do this: You can use the local param syntax for choosing a different query parser. As we have learnt, the query parser defaults to the Lucene query parser. You can change the query parser for the query by setting the defType parameter, either via request parameters or in the solrconfig.xml but I am not aware of any way to set it for the filter queries. As we have unanalyzed terms the correct thing to do would be to use a TermQuery, which can be built using the TermQParserPlugin. To use this parser we can explicitly set it in the filter query:

curl 'http://localhost:8082/solr/select?fq={!term%20f=author%20v='Jeff%20"The%20Dude"%20Bridges'}'

Or, for better readability, without the URL encoding:

curl 'http://localhost:8082/solr/select?fq={!term f=author v='Jeff "The Dude" Bridges'}'

The local params are enclosed by curly braces. The value term is a shorthand for type='term', f is the fiels the TermQuery should be built for and v the value. Though this might look quirky at first this is a really powerful feature, especially since you can reference other request parameters from the local params. Consider this configuration of a request handler:

<requestHandler name="/selectfiltered" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
  <lst name="appends">
    <str name="fq">{!term f=author v=$author}</str>
  </lst>
</requestHandler>

The default configuration is the same as we were using above. Only the appends section is new, which adds additional parameters to the request. There are similar local params as we were using via curl, but the real filter query is replaced by the variable $author. This can now be passed in cleanly via an aptly named parameter:

curl 'http://localhost:8082/solr/selectfiltered?author=Jeff%20"The%20Dude"%20Bridges'

There are a lot of powerful features in Solr that are not that commonly used. To see this example in Java have a look at the Github repository of this blogpost.

Donnerstag, 10. Januar 2013

JUnit Rule for ElasticSearch

While I am using Solr a lot in my current engagement I recently started a pet project with ElasticSearch to learn more about it. Some of its functionality is rather different from Solr so there is quite some experimentation involved. I like to start small and implement tests if I like to find out how things work (see this post on how to write tests for Solr).

ElasticSearch internally uses TestNG and the test classes are not available in the distributed jar files. Fortunately it is really easy to start an ElasticSearch instance from within a test so it's no problem to do something similar in JUnit. Felix Müller posted some useful code snippets on how to do this, obviously targeted at a Maven build. The ElasticSearch instance is started in a setUp method and stopped in a tearDown method:

private EmbeddedElasticsearchServer embeddedElasticsearchServer;

@Before
public void startEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer = new EmbeddedElasticsearchServer();
}

@After
public void shutdownEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer.shutdown();
}

As it is rather cumbersome to add these methods to all tests I transformed the code to a JUnit rule. Rules can execute code before and after a test is run and influence its execution. There are some base classes available that make it really easy to get started with custom rules.

Our ElasticSearch example can be easily modeled using the base class ExternalResource (see the full example code on GitHub):

public class ElasticsearchTestNode extends ExternalResource {

    private Node node;
    private Path dataDirectory;
    
    @Override
    protected void before() throws Throwable {
        try {
            dataDirectory = Files.createTempDirectory("es-test", new FileAttribute []{});
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }

        ImmutableSettings.Builder elasticsearchSettings = ImmutableSettings.settingsBuilder()
                .put("http.enabled", "false")
                .put("path.data", dataDirectory.toString());

        node = NodeBuilder.nodeBuilder()
                .local(true)
                .settings(elasticsearchSettings.build())
                .node();
    }

    @Override
    protected void after() {
        node.close();
        try {
            FileUtils.deleteDirectory(dataDirectory.toFile());
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }
    }
    
    public Client getClient() {
        return node.client();
    }
}

The before method is executed before the test is run so we can use it to start ElasticSearch. All data is written to a temporary folder. The after method is used to stop ElasticSearch and delete the folder.

In your test you can now just use the rule, either with the @Rule annotation to have it triggered on each test method, or using @ClassRule to execute it only once per class:

public class CoreTest {

    @Rule
    public ElasticsearchTestNode testNode = new ElasticsearchTestNode();
    
    @Test
    public void indexAndGet() throws IOException {
        testNode.getClient().prepareIndex("myindex", "document", "1")
                .setSource(jsonBuilder().startObject().field("test", "123").endObject())
                .execute()
                .actionGet();
        
        GetResponse response = testNode.getClient().prepareGet("myindex", "document", "1").execute().actionGet();
        assertThat(response.getSource().get("test")).isEqualTo("123");
    }
}

As it is really easy to implement custom rules I think this is a feature I will be using more often in the future.

Donnerstag, 3. Januar 2013

12 Conferences of 2012

I went to a lot, probably too many conferences in 2012. As the year is over now I'd like to summarize some of the impressions, maybe there's a conference you didn't know about and you'd like to attend this year.

FOSDEM

FOSDEM is the Free and Open Source Software Developer European Meetup, a yearly event that takes place in Brussels, Belgium. There are multiple tracks and developer rooms on a multitude of topics ranging from databases to programming languages and open source tools. The rooms are spread across some buildings at the University so there might be some walking involved when switching tracks. What is rather special is that there's no registration involved, you just go there and that's it. The amount of people can be overwhelming, especially in the main entrance area. Unfortunately I was rather disappointed with the talks I chose. The event is a very good fit if you are working on an Open Source project and, as the name of the conference suggests, want to meet other developers of the project.

Berlin Expert Days

BED-Con is a rather young conference organized by the Java User Group Berlin-Brandenburg. I haven't been there in the first year but in 2012 it still had a small and informal feeling. The conference takes place in three rooms of the Freie Universität Berlin, the content selection was an excellent mixture of technical and process/meta talks, most of them in German. If you can afford the trip to Berlin I'd definitively recommend going there.

JAX

The largest and best known German Java conference. There are two editions, JAX in April in Mainz (the one I attended) and W-JAX in November in Munich. There's one huge hall and several smaller rooms and a wide variety of topics you can choose from. I never planned to go there as the admission fee is rather high (thanks to Software & Support for sponsoring my ticket) but I have to admit that it really can be worth the money. There were excellent talks by Charles Nutter, Tim Berglund and many more. The infrastructure (food, coffee, schedules) is very good, if you are on a business budget you can gain a lot by visiting.

Berlin Buzzwords

A niche conference on Search, Scale and Big Data. A lot of people are coming from overseas just to visit. If you are interested in these technologies definitively go there. For more information see this post.

Barcamp Karlsruhe

My first real Barcamp. Really fun event but of course there are always some sessions that are not as interesting as anticipated. Topics ranged from Computer and work stuff to more soft content. I always thought this is a nerd only event but as there is so much to choose from Barcamps might even be interesting for people who are not that much into computers. Very well organized, no admission fee, interesting sessions.

Socrates

The International Software Craftmanship and Testing Conference. Awesome setting and the first open space conference I attended. It takes place in a seminar center in the middle of nowhere which makes it a very intense experience. Besides the sessions there are a lot of informal discussions going on around the day with the very enthusiastic attendants. The 2012 event started Thursday evening with a world cafe, Friday and Saturday open space and an optional code retreat on Sunday. I'd say there were three kinds of sessions: informal discussion rounds, practical hands on sessions and talks. It seems that most people liked the practical sessions best, so if I could choose again I'd go to more of those. Thanks to the sponsors all we had to pay was the accommodation and one meal which additionally makes it an incredible cheap event. Be quick with registration as space is limited.

FrOSCon

The Free and Open Source Software Conference is a great community weekend event with different tracks on admin and development topics. I like it a lot because of the variation of talks and the very informal setting. It's a mixture of holiday and learning and for me a chance to get information on topics that are not presented at the other developer conferences I attend. Talks are partly english, partly german. You can either stay in St. Augustin or in Bonn, it's only a short tram ride.

JAX On Tour

JAX On Tour is another event I attended because of the generous sponsorship of Software & Support. It's not a conference but a training event with longer talks that are grouped together. It's a small event and there's always time for questions. I learned a lot, mainly about documenting architectures. This is a really good alternative to a normal conference to grasp a topic in depth.

OpenCms-Days

If you are into OpenCms you probably already heard about OpenCms-Days, and if not this is probably not for you. Two days of OpenCms only, used by Alkacon to present new features and by the community to present extensions and projects. I am always impressed that there are people who fly around the world just to attend, but of course this is the only conference of its kind worldwide. There is always something new to learn and it's fun to meet the community.

DevFest Karlsruhe

A one day event on Google technologies. There have been several events in different cities worldwide, this one organized by the Google Developer Group Karlsruhe. The organizers were really unlucky as multiple speakers canceled on short notice, nevertheless there were some really good talks. Kudos to the organizers who managed to get this event started in a really short time frame.

ApacheCon Europe

I originally went to ApacheCon because it took place in Sinsheim, which is close to Karlsruhe. Fortunately in this year the conference also hosted the LuceneCon Europe, so there were lots of interesting talks for me. Additionally I had been voted as a committer to the ODFToolkit just before it and I was able to meet some other people that are involved in the project. The location was really special (soccer stadium) but I think a lot of non-locals had to suffer a bit because of the lack of hotels and taxis. This community event can be really interesting even if you are only a user of a project.

Devoxx

Devoxx is the largest Java conference in Europe, organized by the Java User Group Belgium. They attract a lot of high class speakers so this is the place to keep you informed on the Java universe. Located in a large multiplex cinema in a suburb of Antwerp, very comfy chairs and huge screens. The week starts with two university days that contain longer talks and are usually less crowded. This year I went there on Wednesday, due to a train strike in Belgium I arrived quite late. I have to admit that it probably wasn't worth the hassle for only 1.5 days. If you are going there I recommend to stay the whole week.

2013

So this was a lot last year. I don't plan to visit that many conferences again. I am sure that I will be going to Berlin Buzzwords, Socrates, FrOSCon. There probably will be more, but it won't get 13 this year :).

Finally, I couldn't have afforded to pay for all those conferences myself, so thanks to synyx, who paid for FOSDEM and BEDCon when I was still employed there and provided me with a free ticket to OpenCms-Days. Thanks to Software & Support for letting me attend JAX and JAX On Tour for free, those guys are fantastic supporters of the Java User Group Karlsruhe. Also, thanks to the Devoxx team for letting me attend as an ambassador of our JUG.

Donnerstag, 20. Dezember 2012

Gradle is too Clever for my Plans

While writing this post about the Lucene Codec API I noticed something strange when running the tests with Gradle. When experimenting with a library feature most of the times I write unit tests that validate my expectations. This is a habit I learned from Lucene in Action and can also be useful in real world scenarios, e.g. to make sure that nothing breaks when you update a library.

OK, what happened? This time I did not only want to have the test result but also ran the test for a side effect, I wanted a Lucene index to be written to the /tmp directory to manually have a look at it. This worked fine for the first time, but not afterwards, e.g. after my machine was rebooted and the directory cleared.

It turns out that the Gradle developers know that a test shouldn't be used to execute stuff. So once the test is run successfully it is just not run again until its input changes! Though this bit me this time this is a really nice feature to speed up your builds. And if you really need to execute the tests, you can always run gradle cleanTest test.