Freitag, 28. März 2014

JavaLand 2014: The Conference, the Talks, the Tweets

This week on Tuesday and Wednesday parts of the German Java community met for JavaLand, the conference of the German Java community. In this post I would like to talk about the conference in general, some of the talks I attended and show you some facts about what happened on Twitter.

Location and Organization

The conference has the most unusual setting one can imagine: A theme park, Phantasialand in Brühl. Though I was rather sceptical about this it really turned out to be a good choice. The rooms for the talks were mostly well suited and it was a nice place for meeting outside. On Tuesday evening and at some times during the day some of the attractions were open for use. Though I am not really into theme parks I have to admit that some of them really were fun.

Another unusual choice of the organizers: There were no lunch breaks. Every talk started at the full hour and lasted for 45 minutes, starting from 09:00 in the morning and continuing to the evening. You had 15 minutes to change rooms and get a coffee. The attendees could decide by themselves if they would like to eat a really quick lunch in 15 minutes or skip one of the slots completely. This circumvents some of the problems with large queues at other conferences or breaks that are too long for some participants.

The quality of the food was excellent, with buffets during lunch time and even Tuesday evening. There were several main dishes, different variations of salads and dessert. One of the best conference catering I have ever seen.

There were quite some community events, e.g. the Java User Group café or the innovation lab with the Oculus Rift and a Quadrocopter.

The Talks

Most of the talks were in German but there were also some in English. I doubt that you get a lot of value if you don't speak German and go to JavaLand just for the talks, though there were several high profile non German speakers. On several partner stages companies presented talks done by their employees. I had an especially good impression of the Codecentric stage and regret that I couldn't go to see Patrick Peschlow talking on Elasticsearch, a talk I heard good things about.

Some of the talks had a common theme:

Reactive Programming

There was a surprising amount of talks on reactive programming. First Roland Kuhn, the project lead of the Akka project, introduced the principles of building reactive applications. He showed how choosing an event driven approach can lead to scalable, resilient and responsive applications. He didn't go into technical details but only introduced the concepts.

At the end of day one, Michael Pisula gave a technical introduction into the actor framework Akka, also referring to a lot of the concepts Roland Kuhn mentioned in the morning. Though I already have a good grasp of Akka there were some details that were really useful.

At the beginning of the second day Niko Köbler and Heiko Spindler gave another talk on reactive programming. It was a mixture of concepts and technical details, showing MeteorJS, a framework for combining JavaScript on the client side with the server side, and again Akka.

Profiling

There were two sessions related to profiling: First Fabian Lange of Codecentric showing different aspects of profiling in general and some details on microbenchmarking. I will especially take a closer look on jmh, a microbenchmark tool in OpenJDK.

In "JVM and application bottleneck troubleshooting with simple tools" Daniel Witkowski of Azul demoed an example process of analyzing a problematic application. He used system tools and VisualVM to find and identify problems. A rather basic talk but nevertheless it is good to keep some of the practices in mind.

Building Web Applications

Felix Müller and Nils Wloka presented two different web frameworks. Play, presented by Felix Müller, is a framework that is mostly developed by Typesafe, the company behind Scala and Akka. You can use it from Java as well as Scala though the templates are always written in Scala. I started building an application using Play a while ago and had a quite good impression. If I had a use case for a full stack framework I would definitively have another look.

Nils Wloka did a session that obviously wasn't for everyone: 45 minutes of live coding in Clojure building a complete web application for voting on conference talks using Ring, Compojure and Hiccup. I am currently working on a simple application using the same stack so I could follow at least most of his explanations. I especially liked that he managed to finish the app, deploy it to OpenShift and used it to let the audience vote on his talk. Quite impressive.

I like it that both frameworks support rapid development. You don't need to restart the server as often as with common Java web development. A lot of the changes are available instantly.

Miscellaneous

There were two more talks I'd like to mention: On the second day Patrick Baumgartner gave an excellent introduction into Neo4J, the graph database on the JVM. He showed several use cases (including a graph of Whiskey brands that spawned quite a discussion on the properties of Whiskey in the audience) and some of the technical details. Though I just recently attended a long talk on Neo4J at JUG Karlsruhe and already had seen a similar talk by Patrick at BEDCon last year it was really entertaining and good for refreshing some of the knowledge.

Finally the highlight of the conference for me: On the first day Stefan Tilkov gave one of his excellent talks on architecture: He showed how splitting applications and building for replacement instead of building for reuse can lead to cleaner applications. He gave several examples of companies that had employed these principles. I have never attended a bad talk by Stefan so if he's giving a talk at a conference you are at you should definitively go see it.

The Tweets

As I have done before for Devoxx and other conferences I tracked all of the tweets with the hashtags #javaland and #javaland14, stored them in Elasticsearch and therefore had the possibility to analyze them with Kibana. I started tracking several months before the conference but I am only showing some of my findings for the conference days here, as those can give good insight into which topics were hot. Each retweet counts as a separate tweet so if there is one tweet that gets retweeted a lot this will have a strong impact on the numbers.

Timing

Looking at the distribution of the tweets for the two conference days we can see that there are a lot of tweets in the morning of the first day and surprisingly in the evening of the first day. I suspect those are mostly tweets by people announcing that they are now starting to ride the Black Mamba. Of course this might also be related to the start of the Java 8 launch event but I like the Black Mamba theory better.

Mentions

A good incication of popular speakers are the mentions. It's a huge surprise that the twitter handle for the conference only is at third place. The two speakers Arun Gupta and David Blevins seem to be rather popular on twitter (or they just had a really long discussion).

Hashtags and Common Terms

To see the trends we can have a look at the hashtags people used. I excluded #javaland as it is contained in every tweet anyway.

The Java 8 launch event is dominant but Java EE 7 and JavaFX both are strong as well. There was quite some interest for Stephen Chin and his Nighthacking sessions. Neo4J and Asciidoctor are quite a surprise (but not #wegotbeer, a reasonable hashtag).

Hashtags are one thing but we can also look at all the terms in the text of the tweets. I excluded a long list of common words so this is my curated list of the important terms.

"Great", "thanks" and "cool" ... I am not that deep into statistics but it seems to me that people liked the conference.

Conclusion

JavaLand was a fantastic conference. I had some doubts about the setting in the theme park but it was really great. I will definitively go there again next year, if you are in the area you should think about going there as well. Thanks to all organizers for doing a great job.

Freitag, 21. März 2014

Book Review: Instant Apache Solr for Indexing Data How-to

Indexing, the process of putting data in a search engine, often is the foundation of anything when building a search based application. With Instant Apache Solr Indexing Data Howto Alexandre Rafalovitch has written a book dedicated to different aspects of indexing.

The book is written in a cookbook style with tasks that are solved using different features of Apache Solr. Each task is classified with a difficulty level (simple, intermediate, advanced). The book shows you how to work with collections, index text and binary content (using http and Java) and how to use the Data Import Handler. You will learn about the difference between soft and hard commits, how the UpdateRequestProcessor works (showing the useful ScriptUpdateProcessor) and the functionality of atomic updates. The final task describes an approach to index content in multiple languages.

Though it is quite short the book contains some really useful information. As it is dedicated to indexing alone you can't really use it for learning how to work with all aspects of Apache Solr but you can get some bang for the buck for you daily work. The only thing that I missed in the book is a more detailed description of more filters and tokenizers. Nevertheless you get quite some value from the book for a reasonable price.

If you are looking for more in-depth information I recommend Apache Solr 3 Enterprise Search Server (which covers an older version) or the recently finished Solr in Action

Freitag, 14. März 2014

Building a Navigation from a Search Index

Though the classic use case for search engines is keyword search nowadays they are often used to serve content that doesn't resemble a search result list at all. As they are optimized for read access they make good candidates for serving any part of a website that needs a dynamic query mechanism. Faceting is most often used to display filters for refining the search results but it can also be used to build hierarchical navigations from results. I am using Solr in this post but the concept can also be used with Elasticsearch or even plain Lucene.

Browsing Search Results

What we are trying to achieve can often be seen on Ecommerce sites but is also useful for serving content. You will be familiar with the navigation that Amazon provides: You can browse the categories that are displayed in a hierarchical navigation.

Of course I am not familiar with the implementation details of how they are storing their content but search engines can be used to build something like this. You can index different types (editiorial content and products) and tag those with categories. The navigation and the page content is then built from the search results that are returned.

For a simple example we are indexing just products, two books and one record. Two Books by Haruki Murakami are in the category Books/Fiction and Books/Non-Fiction. The record by 13 & God is in the category Music/Downbeat. The resulting navigation should then be something like this:

  • Books
    • Non-Fiction
      • Haruki Murakami
    • Fiction
      • Haruki Murakami
  • Music
    • Downbeat
      • 13 & God

PathHierarchyTokenizer

Lucene provides the PathHierarchyTokenizer that can be used to split path like hierarchies. It takes a path as input and creates segments from it. For example when indexing Books/Non-Fiction/Haruki Murakami it will emit the following tokens:

  • Books
  • Books/Non-Fiction
  • Books/Non-Fiction/Haruki Murakami

What is important: It doesn't just split the string on a delimiter but creates a real hierarchy with all the parent paths. This can be used to build our navigation.

Solr

Let's see an example with Solr. The configuration and a unit test is also available on Github.

We are using a very simple schema with documents that only contain a title and a category

<fields>
    <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="category" type="category" indexed="true" stored="false"/>
</fields>

The title is just a string field but the category is a custom field that uses the PathHierarchyTokenizerFactory.

<fieldType name="category" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

When indexing data the path is split according to the rules of the PathHierarchyTokenizer. When querying we are taking the query term as it is so we can have exact matches.

Suppose we are now indexing three documents that are in the three categories we have seen above:

URL=http://localhost:8082/solr/update/json
curl $URL -H 'Content-type:application/json' -d '
[
  {"title":"What I Talk About When I Talk About Running", "category":"Books/Non-Fiction/Haruki Murakami"}, 
  {"title":"South of the Border, West of the Sun", "category":"Books/Fiction/Haruki Murakami"}, 
  {"title":"Own Your Ghost", "category":"Music/Downbeat/13&God"}
]'
curl "$URL?commit=true"

We can easily request the navigation using facets. We query on all documents but return no documents at all. A facet is returned that contains our navigation structure:

curl http://localhost:8082/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=category
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

When displaying the navigation you can now simply split the paths according to their hierarchies. The queries that are executed for displaying the content contain a filter query that filters the currently selected navigation item.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=category:Books"
{
  "responseHeader":{
    "status":0,
    "QTime":27},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  }}

Using tags and exclusions you can even build the navigation using the same request that queries for the filtered documents.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=%7B%21tag%3Dcat%7Dcategory:Books&facet=true&facet.field=%7B%21ex%3Dcat%7Dcategory"
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

If the navigation you are retrieving from the search engine is your single source of truth you might also have to add a sort mechanism; by default all facets are sorted by their count. This works in our simple case but not for the real world. To have them sorted in a defined way you can add numeric identifiers. You would then index paths like 100|Books/110|Non-Fiction/111|Haruki Murakami and request alphanumeric sorting via facet.sort=index. When displaying the result you just remove the front of the string.

As you now are using the search engine to build the navigation you will immediately have the benefits of its filtering mechanisms. Only use categories that have online articles? Add a filter query fq=online:true. Make sure that there are no categories with products that are out of stock? fq=inStock:true.

Conclusion

Search engines offer great flexibility when delivering content. A lot of their functionality can be used to build applications that pull most of their content from the index.

Freitag, 7. März 2014

Prefix and Suffix Matches in Solr

Search engines are all about looking up strings. The user enters a query term that is then retrieved from the inverted index. Sometimes a user is looking for a value that is only a substring of values in the index and the user might be interested in those matches as well. This is especially important for languages like German that contain compound words like Semmelknödel where Knödel means dumpling and Semmel specializes the kind.

Wildcards

For demoing the approaches I am using a very simple schema. Documents consist of a text field and an id. The configuration as well as a unit test is also vailable on Github.

<fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" type="text_general" indexed="true" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
</types>

One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*

Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application.

You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher:

dump^2 OR dump*

When creating a query like this you should also take care that the user can't add terms that will make the query invalid. The SolrJ method escapeQueryChars of the class ClientUtils can be used to escape the user input.

If you are now taking suffix matches into account the query can get quite complicated and creating a query like this on the client side is not for everyone. Depending on your application another approach can be the better solution: You can create another field containing NGrams during indexing.

Prefix Matches with NGrams

NGrams are substrings of your indexed terms that you can put in an additional field. Those substrings can then be used for lookups so there is no need for any wildcards. Using the (e)dismax handler you can automatically set a boost on your field that is used for exact matches so you get the same behaviour we have seen above.

For prefix matches we can use the EdgeNGramFilter that is configured for an additional field:

...
    <field name="text_prefix" type="text_prefix" indexed="true" stored="false"/>
...
    <copyField source="text" dest="text_prefix"/>
...    
    <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        </analyzer>
    </fieldType>

During indexing time the text field value is copied to the text_prefix field and analyzed using the EdgeNGramFilter. Grams are created for any length between 3 and 15, starting from the front of the string. When indexing the term dumpling this would be

  • dum
  • dump
  • dumpl
  • dumpli
  • dumplin
  • dumpling

During query time the term is not split again so that the exact match for the substring can be used. As usual, the analyze view of the Solr admin backend can be a great help for seeing the analyzing process in action.

Using the dismax handler you can now pass in the user query as it is and just advice it to search on your fields by adding the parameter qf=text^2,text_prefix.

Suffix Matches

With languages that have compound words it's a common requirement to also do suffix matches. If a user queries for the term Knödel (dumpling) it is expected that documents that contain the termSemmelknödel also match.

Using Solr versions up to 4.3 this is no problem. You can use the EdgeNGramFilterFactory to create grams starting from the back of the string.

...
    <field name="text_suffix" type="text_suffix" indexed="true" stored="false"/>
...    
    <copyField source="text" dest="text_suffix"/>
...
    <fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
...

This creates suffixes of the indexed term that also contains the term knödel so our query works.

But, using more recent versions of Solr you will encounter a problem during indexing time:

java.lang.IllegalArgumentException: Side.BACK is not supported anymore as of Lucene 4.4, use ReverseStringFilter up-front and afterward
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:114)
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:149)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:52)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:34)

You can't use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter:

<fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.ReverseStringFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        <filter class="solr.ReverseStringFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldType>

This will now yield the same results as before.

Conclusion

Whether you are going for manipulating your query by adding wildcards or if you should be using the NGram approach heavily depends on your use case and is also a matter of taste. Personally I am using NGrams most of the time as disk space normally isn't a concern for the kind of projects I am working on. Wildcard search has become a lot faster in Lucene 4 so I doubt there is a real benefit there anymore. Nevertheless I tend to do as much processing I can during indexing time.

Elasticsearch - Der praktische Einstieg
Java Code Geeks