Freitag, 27. September 2013

Feature Toggles in JSP with Togglz

Feature Toggles are a useful pattern when you are working on several features but want to keep your application in a deployable state. One of the implementations of the pattern available for Java is Togglz. It provides ways to check if a feature is enabled programmatically, from JSF or JSP pages or even when wiring Spring beans. I couldn't find a single example on how to use the JSP support so I created an example project and pushed it to GitHub. In this post I will show you the basics of Togglz and how to use it in Java Server Pages.

Togglz

Features that you want to make configurable are described with a Java Enum. This is an example with two features that can be enabled or disabled:

public enum ToggledFeature implements Feature {

    TEXT,
    MORE_TEXT;

    public boolean isActive() {
        return FeatureContext.getFeatureManager().isActive(this);
    }
}

This Enum can then be used to check if a feature is enabled in any part of your code:

if (ToggledFeature.TEXT.isActive()) {
    // do something clever
}

The config class is used to wire the feature enum with a configuration mechanism:

public class ToggledFeatureConfiguration implements TogglzConfig {

    public Class<? extends Feature> getFeatureClass() {
        return ToggledFeature.class;
    }

    public StateRepository getStateRepository() {
        return new FileBasedStateRepository(new File("/tmp/features.properties"));
    }

    public UserProvider getUserProvider() {
        return new ServletUserProvider("ADMIN_ROLE");
    }
}

The StateRepository is used for enabling and disabling features. We are using a file based one but there are others available.

To configure Togglz for your webapp you can either do it using CDI, Spring or via manual configuration in the web.xml:

<web-app xmlns="http://java.sun.com/xml/ns/javaee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
    version="3.0">

    <context-param>
        <param-name>org.togglz.core.manager.TogglzConfig</param-name>
        <param-value>de.fhopf.togglz.ToggledFeatureConfiguration</param-value>
    </context-param>

    <filter>
        <filter-name>TogglzFilter</filter-name>
        <filter-class>org.togglz.servlet.TogglzFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>TogglzFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>

</web-app>

In my example I had to add the filter manually though with Servlet 3.0 this shouldn't be necessary. I am not sure if this is caused by the way Gradle runs Jetty or if this is always the case when doing the configuration via a context-param.

Togglz with Java Server Pages

For the integration of Togglz in JSPs you need to add the dependency togglz-jsp to your project. It contains a tag that can be used to group code which can then be enabled or disabled. A simple example for our ToggledFeature:

<%@ taglib uri="http://togglz.org/taglib" prefix="togglz" %>

This is some text that is always shown.

<togglz:feature name="TEXT">
This is the text of the TEXT feature.
</togglz:feature>

<togglz:feature name="MORE_TEXT">
This is the text of the MORE_TEXT feature.
</togglz:feature>

Both features will be disabled by default so you will only see the first sentence. You can control which features are enabled (even at runtime) in /tmp/features.properties. This is what it looks like when the TEXT feature is enabled:

TEXT=true
MORE_TEXT=false

A Word of Caution

I am just starting using feature toggles in an application so I wouldn't call me experienced. But I have the impression that you need to be really disciplined when using it. Old feature toggles that are not used should be removed as soon as possible. Unfortunately the huge benefit of compile time safety in Java for removing a feature from the enum is gone with JSPs; the names of the features are only Strings so you will have to do some file searches when removing a feature.

Freitag, 20. September 2013

Kibana and Elasticsearch: See What Tweets Can Say About a Conference

In my last post I showed how you can index tweets for an event in Elasticsearch and how to do some simple queries on it using its HTTP API. This week I will show how you can use Kibana 3 to visualize the data and make it explorable without having to learn the Elasticsearch API.

Installing Kibana

Kibana 3 is a pure HTML/JS frontend for Elasticsearch that you can use to build dashboards for your data. We are still working with the example data the is indexed using the Twitter River. It consists of tweets for FrOSCon but can be anything, especially data that contains some kind of timestamp as it's the case for tweets. To install Kibana you can just fetch it from the GitHub repostory (Note: now there are also prepackaged archives available that you can download without cloning the repository):

git clone https://github.com/elasticsearch/kibana.git

You will now have a folder kibana that contains the html files as well as all the assets needed. The files need to be served by a webserver so you can just copy the folder to the directory e.g. Apache is serving. If you don't have a webserver installed you can simply serve the current directory using python:

python -m SimpleHTTPServer 8080

This will make Kibana available at http://localhost:8080/kibana/src. With the default configuration Elasticsearch needs to be running on the same machine as well.

Dashboards

A dashboard in Kibana consists of rows that can contain different panels. Each panel can either display data, control which data is being displayed or both. Panels do not stand on their own; the results that are getting displayed are the same for the whole dashboard. So if you choose something in one panel you will notice that the other panels on the page will also get updated with new values.

When accessing Kibana you are directed to a welcome page from where you can choose between several dashboard templates. As Kibana is often used for logfile analytics there is an existing dashboard that is preconfigured to work with Logstash data. Another generic dashboard can be used to query some data from the index but we'll use the option "Unconfigured Dashboard" which gives some hints on which panels you might want to have.

This will present you with a dashboard that contains some rows and panels already.

Starting from the top it contains these rows:

  • The "Options" row that contains one text panel
  • The "Query" row that contains a text query panel
  • A hidden "Filter" row that contains a text panel and the filter panel. The row can be toggled visible by clicking on the text Filter on the left.
  • The "Graph" row two text panels
  • The large "Table" row with one text panel.

Those panels are already laid out in a way that they can display the widgets that are described in the text. We will now add those to get some data from the event tweets.

Building a Dashboard

The text panels are only there to guide you when adding the widgets you need and can then be removed. To add or remove panels for a row you can click the little gear next to the title of the row. This will open an options menu. For the top row we are choosing a timepicker panel with a default mode of absolute. This gives you the opportunity to choose a begin and end date for your data. The field that contains the timestamp is called "created_at". After saving you can also remove the text panel on the second tab.

If you now open the "Filters" row you will see that there now is a filter displayed. It is best to keep this row open to see which filters are currently applied. You can remove the text panel in the row.

In the graph section we will add two graph panels instead of the text panels: A pie chart that displays the terms of the tweet texts and a date histogram that shows how many tweets there are for a certain time. For the pie chart we use the field "text" and exclude some common terms again. Note that if you are adding terms to the excluded terms when the panel is already created that you need to initiate another query, e.g. by clicking the button in the timepicker. For the date histogram we are again choosing the timestamp field "created_at".

Finally, in the last row we are adding a table to display the resulting tweet documents. Besides adding the columns "text", "user.screen_name" and "created_at" we can leave the settings like it's proposed.

We now have a dashboard to play with the data and see the results immediately. Data can be explored by using any of the displays, you can click in the pie chart to choose a certain term or choose a time range in the date histogram. This makes it really easy to work with the data.

Answering questions

Now we have a visual representation of all the terms and the time of day people are tweeting most. As you can see, people are tweeting slightly more during the beginning of the day.

You can now check for any relevant terms that you might be interested in. For example, let's see when people tweet about beer. As we do have tweets in multiple languages (german, english and people from cologne) we need to add some variation. We can enter the query

text:bier* OR text:beer* OR text:kölsch
in the query box.

There are only few tweets about it but it will be a total surprise to you that most of the tweets about beer tend to be send later during the day (I won't go into detail why there are so many tweets mentioning the terms horse and piss when talking about Kölsch).

Some more surprising facts: There is not a single tweet mentioning Java but a lot of tweets that mention php, especially during the first day. This day seems to be far more successful for the PHP dev room.

Summary

I hope that I could give you some hints on how powerful Kibana can be when it comes to analytics of data, not only with log data. If you'd like to read another detailed step by step guide on using Kibana to visualize Twitter data have a look at this article by Laurent Broudoux.

Mittwoch, 11. September 2013

Simple Event Analytics with ElasticSearch and the Twitter River

Tweets can say a lot about an event. The hashtags that are used and the time that is used for tweeting can be interesting to see. Some of the questions you might want answers to:

  • Who tweeted the most?
  • What are the dominant keywords/hashtags?
  • When is the time people are tweeting the most?
  • And, most importantly: Is there a correlation between the time and the amount of tweets mentioning coffee or beer?

During this years FrOSCon I indexed all relevant tweets in ElasticSearch using the Twitter River. In this post I'll show you how you can index tweets in ElasticSearch to have a dataset you can do analytics with. We will see how we can get answers to the first two questions using the ElasticSearch Query DSL. Next week I will show how Kibana can help you to get a visual representation of the data.

Indexing Tweets in ElasticSearch

To run ElasticSearch you need to have a recent version of Java installed. Then you can just download the archive and unpack it. It contains a bin directory with the necessary scripts to start ElasticSearch:

bin/elasticsearch -f

-f will take care that ElasticSearch starts in the foreground so you can also stop it using Ctrl-C. You can see if your installation is working by calling http://localhost:9200 in your browser.

After stopping it again we need to install the ElasticSearch Twitter River that uses the Twitter streaming API to get all the tweets we are interested in.

bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0

Twitter doesn't allow anonymous access to its API anymore so you need to register for the OAuth access at https://dev.twitter.com/apps. Choose a name for your application and generate the key and token. Those will be needed to configure the plugin via the REST API. In the configuration you need to pass your OAuth information as well as any keyword you would like to track and the index that should be used to store the data.

curl -XPUT localhost:9200/_river/frosconriver/_meta -d '
{
    "type" : "twitter",
    "twitter" : {
        "oauth" : {
            "consumer_key" : "YOUR_KEY",
            "consumer_secret" : "YOUR_SECRET",
            "access_token" : "YOUR_TOKEN",
            "access_token_secret" : "YOUR_TOKEN_SECRET"
        },
        "filter" : {
            "tracks" : "froscon"
        }
    },
    "index" : {
        "index" : "froscon",
        "type" : "tweet",
        "bulk_size" : 1
    }
}
'

The index doesn't need to exist yet, it will be created automatically. I am using a bulk size of 1 as there aren't really many tweets. If you are indexing a lot of data you might consider setting this to a higher value.

After issuing the call you should see some information in the logs that the river is starting and receiving data. You can see how many tweets there are in your index by issuing a count query:

curl 'localhost:9200/froscon/_count?pretty=true

You can see the basic structure of the documents created by looking at the mapping that is created automatically.

http://localhost:9200/froscon/_mapping?pretty=true

The result is quite long so I am not replicating it here but it contains all the relevant information you might be interested in like the user who tweeted, the location of the user, the text, the mentions and any links in it.

Doing Analytics Using the ElasticSearch REST API

Once you have enough tweets indexed you can already do some analytics using the ElasticSearch REST API and the Query DSL. This requires you to have some understanding of the query syntax but you should be able to get started by skimming through the documentation.

Top Tweeters

First, we'd like to see who tweeted the most. This can be done by doing a query for all documents and facet on the user name. This will give us the names and count in a section of the response.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "user" : { 
        "terms" : {
          "field" : "user.screen_name"
        } 
      }                            
    }
  }
'

Those are the top tweeters for FrOSCon:

Dominant Keywords

The dominant keywords can also be retrieved using a facet query, this time on the text of the tweet. As there are a lot of german tweets for FrOSCon and the text field is processed using the StandardAnalyzer that only removes english stopwords it might be necessary to exclude some terms. Also you might want to remove some other common terms that indicate retweets or are part of urls.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "keywords" : { 
        "terms" : {
          "field" : "text", 
          "exclude" : ["froscon", "rt", "t.co", "http", "der", "auf", "ich", "my", "die", "und", "wir", "von"] 
        }
      }                            
    }
  }
'

Those are the dominant keywords for FrOSCon:

  • talk (no surprise for a conference)
  • slashme
  • teamix (a company that does very good marketing. Unfortunately in this case this is more because their fluffy tux got stolen. The tweet about it is the most retweeted tweet of the data.)

Summary

Using the Twitter River it is really easy to get some data into ElasticSearch. The Query DSL makes it easy to extract some useful information. Next week we will have a look at Kibana that doesn't necessarily require a deep understanding of the ElasticSearch queries and can visualize our data.

Mittwoch, 4. September 2013

Developing with CoreMedia

A while ago I had the chance to attend a training on web development with CoreMedia. It's a quite enterprisey commercial Content Management System that powers large corporate websites like telekom.com as well as news sites like Bild.de (well, you can't hold CoreMedia responsible for the kind of "content" people put into their system). As I have been working with different Java based Content Management Systems over the years I was really looking forward to learn about the system I heard really good things about. In this post I'll describe the basic structure of the system as well how it feels like to develop with it.

System Architecture

As CoreMedia is built to scale to really large sites the architecture is also built around redundant and distributed components. The part of the system the editors are working on is seperated from the parts that serve the content to the internet audience. A publication process copies the content from the editorial system to the live system.

The heart of CoreMedia is the Content Server. It stores all the content in a database and makes it retrievable. You rarely access it directly but only via other applications that then talk to it in the background via CORBA. Editors used to work with CoreMedia using a Java client (used to be called the Editor, now known as the Site Manager), starting with CoreMedia 7 there is also the web based Studio that is used to create and edit content. A preview application can be used to see how the site looks before being published. Workflows, that are managed using the Workflow Server, can be used to control the processes around editing as well as publication.

The live system consists of several components that are mostly laid out in a redundant way. There is one Master Live Server as well as 0 to n Replication Live Servers that are used for distributing the load as well as fault tolerance. The Content Management Servers are accessed from the Content Application Engine (CAE) that contains all the delivery and additional logic for your website. One or more Solr instances are used to provide the search services for your application.

Document Model

The document model for your application describes the content types that are available in the system. CoreMedia provides a blueprint application that contains a generic document model that can be used as a basis for your application but you are also free to build something completely different. The document model is used throughout the whole system as it describes the way your content is stored. The model is object oriented in nature with documents that consist of attributes. There are 6 attribute types like String (fixed length Strings), XML (variable length Strings) and Blob (binary data) available that form the basis of all your types. An XML configuration file is used to describe your specific document model. This is an example of an article that contains a title, the text and a list of related articles.

<DocType Name="Article">
  <StringProperty Name="title"/>
  <XmlProperty Grammar="coremedia-richtext-1.0" Name="text"/>
  <LinkListProperty LinkType="Article" Name="related"/>
</DocType>

Content Application Engine

Most of the code you will be writing is the delivery code that is part of the Content Application Engine, either for preview or for the live site. This is a standard Java webapp that is assembled from different Maven based modules. CAE code is heavily based on Spring MVC with the CoreMedia specific View Dispatcher that takes care of the rendering of different documents. The document model is made available using the so called Contentbeans that can be generated from the document model. Contentbeans access the content on demand and can contain additional business logic. So those are no POJOs but more active objects similar to Active Record entities in the Rails world.

Our example above would translate to a Contentbean with getters for the title (a java.lang.String), the text (a com.coremedia.xml.Markup) and a getter for a java.util.List that is typed to de.fhopf.Article.

Rendering of the Contentbeans happens in JSPs that are named according to classes or interfaces with a specific logic to determine which JSP should be used. An object Article that resides in the package de.fhopf would then be found in the path de/fhopf/Article.jsp, if you want to add a special rendering mechanism for List this would be in java/util/List.jsp. Different rendering of objects can be done by using a view name. An Article that is rendered as a link would then be in de/fhopf/Artilcle.link.jsp.

This is done using one of the custom Spring components of CoreMedia, the View Dispatcher, a View Resolver that determines the correct view to be invoked for a certain model based on the content element in the Model. The JSP that is used can then contain further includes on other elements of the content, be it documents in the sense of CoreMedia or one of the attributes that are available. Those includes are again routed through the View Dispatcher.

Let's see an example for rendering the list of related articles for an article. Say you call the CAE with a certain content id, that is an Article. The standard mechanism routes this request to the Article.jsp described above. It might contain the following fragment to include the related articles:

<cm:include self="${self.related}"/>

Note that we do not tell which JSP to include. CoreMedia automatically figures out that we are including a List, for example a java.util.ArrayList. As there is no JSP available at java/util/ArrayList.jsp Coremedia will automatically look for any interfaces that are implemented by that class, in this case it will find java/util/List.jsp. This could then contain the following fragment:

<ul>
<c:forEach items="${self}" var="item">
  <li><cm:include self="${item}" view="link"></li>
</c:forEach>
</ul>

As the List in our case contains Article implementations this will then hit the Article.link.jsp that would finally render the link. This is a very flexible approach with a high degree of reusability for the fragments. The List.jsp we are seeing above has no connection to the Article. You can use it for any objects that should be rendered in a List structure, the View Dispatcher of CoreMedia takes care of which JSP to include for a certain type.

To minimize the load on the Content Server you can also add caching via configuration settings. Data Views, that are a layer on top of the Contentbeans, are then held in memory and contain prefilled beans that don't need to access the Content Management Server anymore. This object cache approach is different to the html fragment caching a lot of other systems are doing.

Summary

Though this is only a very short introduction you should have seen that CoreMedia really is a nice system to work with. The distributed nature not only makes it scalable but this also has implications when developing for it: When you are working on the CAE you are only changing code in this component. You can start the more heavyweight Contentserver only once and afterwards work with the lightweight CAE that can be run using the Maven jetty plugin. Restarts don't take a long time so you have short turnaround times. The JSPs are very cleanly structured and don't need to include scriptlets (I heard that this has been different for earlier versions). As most of the application is build around Spring MVC you can use a lot of knowledge that is around already.

Mittwoch, 28. August 2013

FrOSCon 8 2013 - Free and Open Source Software Conference

Last weekend I attended FrOSCon, the Free and Open Source Software Conference taking place in St. Augustin near Bonn. It's a community organized conference with an especially low entrance fee and a relaxed vibe. The talks are a good mixture of development and system administration topics.

Some of the interesting talks I attended:

Fixing Legacy Code by Kore Nordmann and Benjamin Eberlein

Though this session was part of the PHP track it contained a lot of valuable information related to working with legacy code in any language. Besides strategies for getting an application under test the speakers showed some useful refactorings that can make sense to start with. Slides

Building Awesome Ruby Command Line Apps by Christian Vervoorts

Christian first showed some of the properties that make up a good command line app. You should choose sane default values but make those configurable. Help functionality is crucial for a good user experience, via -h parameter and a man page. In the second part Chistian introduced some Ruby gems that can be used to build command line apps. GLI seems to be the most interesting with a nice DSL and its scaffolding functionality.

Talking People Into Creating Patches by Isabel Drost-Fromm

Isabel, who is very active in the Apache community, introduced some of her findings when trying to make students, researchers and professionals participate in Open Source. The participants where a mixture of people running open source projects and developers that are interested in contributing to open source. I have been especially interested in this talk because I wouldn't mind having more people help with the Odftoolkit I am also working on. When working with professionals, who are the main target, it is important to answer quickly on mails or issues as they might move on to other projects and might not be able to help later on. Also, it's nice to have some easy tasks in the bugtracker that can be processed by newbies.

MySQL Performance Schema by Carsten Thalheimer

Performance Schema is a new feature in MySQL 5.5 and is activated by default since 5.6. It monitors a lot of the internal functionality like file access and queries so you can later see which parts you can optimize. Some performance measurements done by the MySQL developers showed that keeping it activated has an performance impact of around 5%. Though this doesn't sound that good at first I think you can gain a lot more performance by the insight you have in the inner workings. Working with Performance Schema is supposed to be rather complex ("Take two weeks to work with it"), ps_helper is a more beginner friendly functionality that can get you started with some useful metrics.

Summary

FrOSCon is the one of the most relaxing conferences I know. It is my goto place for seeing stuff that is not directly related to Java development. The low fee makes it a no brainer to attend. If you are interested in any of this years talks they will also be made available online.

Mittwoch, 21. August 2013

Getting Started with ElasticSearch: Part 2 - Querying

This is the second part of the article on things I learned while building a simple Java based search application on top of ElasticSearch. In the first part of this article we looked at how to index data in ElasticSearch and what the mapping is. Though ElasticSearch is often called schema free specifying the mapping is still a crucial part of creating a search application. This time we will look at the query side and see how we can get our indexed talks out of it again.

Simple Search

Recall that our documents consist of a title, the date and the speaker of a talk. We have adjusted the mapping so that for the title we are using the German analyzer that stems our terms and we can search on variations of words. This curl request creates a similar index:

curl -XPUT "http://localhost:9200/blog" -d'
{
    "mappings" : {
        "talk" : {
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Let's see how we can search on our content. We are indexing another document with a German title.

curl -XPOST "http://localhost:9200/blog/talk/" -d'
{
    "speaker" : "Florian Hopf",
    "date" : "2012-07-04T19:15:00",
    "title" : "Suchen und Finden mit Lucene und Solr"
}'

All searching is done on the _search endpoint that is available on the type or index level (you can also search on multiple types and indexes by separating them with a comma). As the title field uses the German analyzer we can search on variations of the words, e.g. suche which stems to the same root as suchen, such.

curl -XGET "http://localhost:9200/blog/talk/_search?q=title:suche&pretty=true"                                                                       
{
  "took" : 14, 
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
},                                                                                                                                                                                                                             
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "A2Qv3fN3TkeYEhxA4zicgw",
      "_score" : 0.15342641, "_source" : {
        "speaker" : "Florian Hopf",
        "date" : "2012-07-04T19:15:00",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The _all field

Now that this works, we might want to search on multiple fields. ElasticSearch provides the convenience functionality of copying all field content to the _all field that is used when omitting the field name in the query. Let's try the query again:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }

No results. Why is that? Of course we have set the analyzer correctly for the title as we have seen above. But this doesn't mean that the content is analyzed in the same way for the _all field. As we didn't specify an analyzer for this field it still uses the StandardAnalyzer that splits on whitespace but doesn't do any stemming. If you want to have a consistent behavior for the title and the _all field you need to set the analyzer in the mapping:

curl -XPUT "http://localhost:9200/blog/talk/_mapping" -d'
{
    "mappings" : {
        "talk" : {
            "_all" : {"analyzer" : "german"},
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Note that as with all mapping changes you can't change the type of the _all field once it's created. You need to delete the index, put the new mapping and reindex your data. Afterwards our search will return the same results for the two queries.

_source

You might have noticed from the example above that ElasticSearch returns the special _source field for each result. This is very convenient as you don't need to specify which fields should be stored. But be aware that this might become a problem for large fields that you don't need for each search request (content section of articles, images that you might store in the index). You can either disable the use of the source field and indicate which fields should be stored in the mapping for your indexed type or you can specify in the query which fields you'd like to retrieve:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true&fields=speaker,title"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "MA2oYAqnTdqJhbjnCNq2zA",
      "_score" : 0.15342641
    }, {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "aGdDy24cSImz6DVNSQ5iwA",
      "_score" : 0.076713204,
      "fields" : {
        "speaker" : "Florian Hopf",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The same can be done if you are not using the simple query parameters but the more advanced query DSL:

curl -XPOST "http://localhost:9200/blog/talk/_search" -d'
{
    "fields" : ["title", "speaker"],
    "query" : {
        "term" : { "speaker" : "florian" }
    }
}'

Querying from Java

Besides the JSON based Query DSL you can also query ElasticSearch using Java. The default ElasticSearch Java client provides builders for creating different parts of the query that can then be combinded. For example if you'd like to query on two fields using the multi_match query this is what it looks like using curl:

curl -XPOST "http://localhost:9200/blog/_search" -d'
{
    "query" : {
        "multi_match" : {
            "query" : "Solr",
            "fields" : [ "title", "speaker" ]
        }
    }
}'

The Java version maps quite well to this. Once you found the builders you need you can use the excellent documentation of the Query DSL for your Java client as well.

QueryBuilder multiMatch = multiMatchQuery("Solr", "title", "speakers");
SearchResponse response = esClient.prepareSearch("blog")
        .setQuery(multiMatch)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));

The same QueryBuilder we are constructing above can also be used on other parts of the query: For example it can be passed as a parameter to create a QueryFilterBuilder or can be used to construct a QueryFacetBuilder. This composition is a very powerful way to build flexible applications. It is easier to reason about the components of the query and you could even test parts of the query on its own.

Faceting

One of the most prominent features of ElasticSearch is its excellent faceting support that not only is used for building search applications but also for doing analytics of large data sets. You can use different kinds of faceting, e.g. for certain terms, using the TermsFacet, or for queries, using the query facet. The query facet would accept the same QueryBuilder that we used above.

TermsFacetBuilder facet = termsFacet("speaker").field("speaker");
QueryBuilder query = queryString("solr");
SearchResponse response = esClient.prepareSearch("blog")
        .addFacet(facet)
        .setQuery(query)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));
TermsFacet resultFacet = response.getFacets().facet(TermsFacet.class, "speaker");
assertEquals(1, resultFacet.getEntries().size());

Conclusion

ElasticSearch has a really nice Java API, be it for indexing or for querying. You can get started with indexing and searching in no time though you need to know some concepts or the results might not be what you expect.

Mittwoch, 14. August 2013

The Pragmatic Programmers Rubber Duck of the 19th Century

In their influential book "The Pragmatic Programmer" Andy Hunt and Dave Thomas describe a technique for finding solutions to hard problems you are struggling with. They recommend to just tell the problem to somebody, not for getting an answer, but because while explaining the problem you are thinking differently about it. And, if there is nobody around you, get yourself a rubber duck you can talk to, hence the name of the tip.

It's obvious that this is not a new discovery made by the authors. Everybody has experienced similar situations where they are finding a solution to a problem while explaining it to someone. But I have been surprised to read this in the essay "Über die allmähliche Verfertigung der Gedanken beim Reden" by Heinrich von Kleist dating to 1805/1806 (translated from German by me):

If you want to know something and you can't find it in meditation I advice you [...] to tell it to the next acquaintance you are meeting. He doesn't need to be a keen thinker and I don't mean you should ask him: no! Rather you should tell him about it in the first place.

1806. The same tip (without the duck). This is another case where we are relearning things that have been discovered before, something especially computer science is prone to.

So, what is the big achievement of the authors? It's not that they are finding only new ideas. We don't need that many new ideas. There is a lot of stuff around that is just waiting to be applied to our work as software developers. Those old ideas need to be put into context. There is even a benefit in stating obvious things that might trigger rethinking your habits.