Mittwoch, 30. April 2014

What Is Special About This? Significant Terms in Elasticseach

I have been using Elasticsearch a few times now for doing analytics of twitter data for conferences. Popular hashtags and mentions that can be extraced using facets can show what is hot at a conference. But you can go even further and see what makes each hashtag special. In this post I would like to show you the significant terms aggregation that is available with Elasticsearch 1.1. I am using the tweets of last years Devoxx as those contain enough documents to play around.

Aggregations

Elasticsearch 1.0 introduced aggregations, that can be used similar to facets but are far more powerful. To see why those are useful let's take a step back and look at facets, that are often used to extract statistical values and distributions. One useful example for facets is the total count of a hashtag:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size": 0,  
    "facets": {
       "hashtags": {
          "terms": {
             "field": "hashtag.text",
             "size": 10,
             "exclude": [
                "devoxx", "dv13"
             ]
          }
       }
    }
}'

We request a facet called hashtags that uses the terms of hashtag.text and returns the 10 top values with the counts. We are excluding the hashtags devoxx and dv13 as those are very frequent. This is an excerpt of the result with the popular hashtags:

   "facets": {
      "hashtags": {
         "_type": "terms",
         "missing": 0,
         "total": 19219,
         "other": 17908,
         "terms": [
            {
               "term": "dartlang",
               "count": 229
            },
            {
               "term": "java",
               "count": 216
            },
            {
               "term": "android",
               "count": 139
            },
    [...]

Besides the statistical information we are retrieving here facets are often used for offering a refinement on search results. A common use is to display categories or features of products on eCommerce sites for example.

Starting with Elasticsearch 1.0 you can have the same behaviour by using one of the new aggregations, in this case a terms aggregation:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text", 
                "exclude" : "devoxx|dv13"
            }
        }
    }
}'

Instead of requesting facets we are now requesting a terms aggregation for the field hashtag.text. The exclusion is now based on a regular expression instead of a list. The result looks similar to the facet return values:

   "aggregations": {
      "hashtags": {
         "buckets": [
            {
               "key": "dartlang",
               "doc_count": 229
            },
            {
               "key": "java",
               "doc_count": 216
            },
            {
               "key": "android",
               "doc_count": 139
            },
    [...]

Each value forms a so called bucket that contains a key and a doc_count.

But aggregations not only are a replacement for facets. Multiple aggregations can be combined to give more information on the distribution of different fields. For example we can see the users that used a certain hashtag by adding a second terms aggregation for the field user.screen_name:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text", 
                "exclude" : "devoxx|dv13"
            },
            "aggs" : {
                "hashtagusers" : {
                    "terms" : {
                        "field" : "user.screen_name"
                    }
                }
            }
        }
    }
}'

Using this nested aggregation we now get a list of buckets for each hashtag. This list contains the users that used the hashtag. This is a short excerpt for the #scala hashtag:

 
               "key": "scala",
               "doc_count": 130,
               "hashtagusers": {
                  "buckets": [
                     {
                        "key": "jaceklaskowski",
                        "doc_count": 74
                     },
                     {
                        "key": "ManningBooks",
                        "doc_count": 3
                     },
    [...]

We can see that there is one user that is responsible for half of the hashtags. A very dedicated user.

Using aggregations we can get information that we were not able to get with facets alone. If you are interested in more details about aggregations in general or the metrics aggregations I haven't touched here, Chris Simpson has written a nice post on the feature, there is a nice visual one at the Found blog, another one here and of course there is the official documentation on the Elasticsearch website.

Significant Terms

Elasticsearch 1.1 contains a new aggregation, the significant terms aggregation. It allows you to do something very useful: For each bucket that is created you can see the terms that make this bucket special.

Significant terms are calculated by comparing a foreground frequency (which is the frequency of the bucket you are interested in) with a background frequency (which for Elasticsearch 1.1 always is the frequency of the complete index). This means it will collect any results that have a high frequency for the current bucket but not for the complete index.

For our example we can now check for the hashtags that are often used with a certain mention. This is not the same that can be done with the terms aggregation. The significant terms will only return those terms that are occuring often for a certain user but not as frequently for all users. This is what Mark Harwood calls the uncommonly common.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "mentions" : {
            "terms" : { 
                "field" : "mention.screen_name" 
            },
            "aggs" : {
                "uncommonhashtags" : {
                    "significant_terms" : {
                        "field" : "hashtag.text"
                    }
                }
            }
        }
    }
}'

We request a normal terms aggregation for the mentioned users. Using a nested significant_terms aggregation we can see any hashtags that are often used with the mentioned user but not so often in the whole index. This is a snippet for the account of Brian Goetz:

            {
               "key": "BrianGoetz",
               "doc_count": 173,
               "uncommonhashtags": {
                  "doc_count": 173,
                  "buckets": [
                     {
                        "key": "lambda",
                        "doc_count": 13,
                        "score": 1.8852860861614915,
                        "bg_count": 33
                     },
                     {
                        "key": "jdk8",
                        "doc_count": 8,
                        "score": 0.7193691737111163,
                        "bg_count": 32
                     },
                     {
                        "key": "java",
                        "doc_count": 21,
                        "score": 0.6601749139630457,
                        "bg_count": 216
                     },
                     {
                        "key": "performance",
                        "doc_count": 4,
                        "score": 0.6574225667412876,
                        "bg_count": 9
                     },
                     {
                        "key": "keynote",
                        "doc_count": 9,
                        "score": 0.5442707998673785,
                        "bg_count": 52
                     },
        [...]

You can see that there are some tags that are targeted a lot at the keynote by Brian Goetz and are not that common for the whole index.

Some more ideas what we could look at with the significant terms aggregation:

  • Find users that are using a hashtag a lot.
  • Find terms that are often used with a certain hashtag.
  • Find terms that are used by a certain user.
  • ...

Besides these impressive analytics feature significant terms can also be used for search applications. A useful example is given in the Elasticsearch documentation itself: If a user searches for "bird flu" automatically display a link to a search to H5N1 which should be very common in the result documents but not in the whole of the corpus.

Conclusion

With significant terms Elasticsearch has again added a feature that might very well offer surprising new applications and use cases for search. Not only is it important for analytics but it can also be used to improve classic search applications. Mark Harwood has collected some really interesting use cases on the Elasticsearch blog. If you'd like to read another post on the topic you can see this post at QBox-Blog that introduces significant terms as well as the percentile and cardinality aggregations.

Freitag, 11. April 2014

The Absolute Basics of Indexing Data

Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won't go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch.

Input

Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn't matter if this data originally resides in a document in the filesystem, on a website or in a database.

Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. This would be two example documents:

{
    "title" : "Search Patterns",
    "authors" : [ "Morville", "Callender" ],
}
{
    "title" : "Apache Solr Enterprise Search Server",
    "authors" : [ "Smiley", "Pugh" ]
}

Even though the structure of both documents is the same in our case, the format of the document doesn't need to be fixed. Both documents could have totally different attributes, nevertheless both could be stored in the same index. In reality you will try to keep the documents similar, after all, you need a way to handle the documents in your application.

Lucene itself doesn't even have the concept of a key. But of course you need a key to identify your documents when updating them. Both Solr and Elasticsearch have ids that can either be chosen by the application or be autogenerated.

Analyzing

For every field that is indexed a special process called analyzing is employed. What it does can differ from field to field. For example, in a simple case it might just split the terms on whitespace and remove any punctuation so Search Patterns would become two terms, Search and Patterns.

Index Structure

An inverted index, the structure search engines are using, is similar to a map that contains search terms as key and a reference to a document as value. This way the process of searching is just a lookup of the term in the index, a very fast process. Those might be the terms that are indexed for our example documents.

FieldTermDocument Id
titleApache2
Enterprise2
Patterns1
Search1,2
Server2
Solr2
authorCallender1
Morville1
Pugh2
Smiley2

A real index contains more information like position information to enable phrase queries and frequencies for calculating the relevancy of a document for a certain search term.

As we can see the index holds a reference to the document. This document, that is also stored with the search index, doesn't necessarily have to be the same as our input document. You can determine for each field if you would like to keep the original content which is normally controlled via an attribute named stored. As a general rule, you should have all the fields stored that you would like to display with the search results. When indexing lots of complete books and you don't need to display it in a results page it might be better to not store it at all. You can still search it, as the terms are available in the index, but you can't access the original content.

More on Analyzing

Looking at the index structure above we can already imagine how the search process for a book might work. The user enters a term, e.g. Solr, this term is then used to lookup the documents that contain the term. This works fine for cases when the user types the term correctly. A search for solr won't match for our current example.

To mitigate those difficulties we can use the analyzing process already mentioned above. Besides the tokenization that splits the field value into tokens we can do further preprocessing like removing tokens, adding tokens or modifying tokens (TokenFilter).

For our book case it might at first be enough to do lowercasing on the incoming data. So a field value Solr will then be stored as solr in the index. To enable the user to also search for Solr with an uppercase letter we need to do analyzing for the query as well. Often it is the same process that is used for indexing but there are also cases for different analyzers.

The analyzing process not only depends on the content of the documents (field types, language of text fields) but also on your application. Take one common scenario: Adding synonyms for terms to the index. You might think that you just take a huge list of synonyms like WordNet and add those to each of your application. This might in fact decrease the search experience of your users as there are too many false positives. Also, for certain terms of the domain of your users WordNet might not contain the correct synonyms at all.

Duplication

When designing the index structure there are two competing forces: Often you either optimize for query speed or for index size. If you have a lot of data you probably need to take care that you only store data that you really need and even only put terms in the index that are necessary for lookups. Oftentimes for smaller datasets the index size doesn't matter that much and you can design your index for query performance.

Let's look at an example that can make sense for both cases. In our book information system we would like to display an alphabetic navigation for the last name of the author. If the user clicks on A, all the books of authors starting with the letter A should be displayed. When using the Lucene query syntax you can do something like this with its wildcard support: Just issue a query that contains the letter the user clicked and a trailing *, e.g. a*.

Wildcard queries have become very fast with recent Lucene versions, nevertheless it still is a query time impact. You can also choose another way. When indexing the data you can add another field that just stores the first letter of the name. This is what the relevant configuration might look like in Elasticsearch but the concept is the same for Lucene and Solr:

"author": {
    "type": "multi_field",
    "fields": {
        "author" : {
            "type": "string"
        },
        "letter" : {
            "type": "string",
            "analyzer": "single_char_analyzer"
        }
    }
}

Under the hood, another term dictionary for the field author.letter will be created. For our example it will look like this:

FieldTermDocument Id
author.letterC1
M1
P2
S2

Now, instead of issuing a wildcard query on the author field we can directly query the author.letter field with the letter. You can even build the navigation from all the terms in the index using techniques like faceting the extract all the available terms for a field from the index.

Conclusion

These are the basics of the indexing data for a search engine. The inverted index structure makes searching really fast by moving some processing to the indexing phase. When we are not bound by any index size concerns we can design our index for query performance and add additional fields that duplicate some of the data. This design for queries is what makes search engines similar to how lots of the NoSQL solutions are used.

If you'd like to go deeper in the topics I recommend watching the talk What is in a Lucene index by Adrien Grand. He shows some of the concepts I have mentioned here (and a lot more) but also how those are implemented in Lucene.

Montag, 7. April 2014

BEDCon - Berlin Expert Days 2014

BEDCon is over again. This marks the third year I have been there and it still has the cheapest prices for a general conference I have seen in Germany. If you are looking for a nice conference in Berlin you should definitively consider it.

Interesting Talks

The three most interesting talks I attended:

  • Java App Servers are Dead by Eberhard Wolff. Contains some good arguments why deploying applications to app servers might not be the best idea. I especially liked the idea that the target application server is a dependency of your project (because you need a certain version) and dependencies should be explicit in you source tree.
  • Resilience with Hystrix by Uwe Friedrichsen. A more advanced talk that was perfect for me because I had seen an introduction to fault tolerance by Uwe Friedrichsen at XPDays Germany last year. Hystrix is a library I will definitively check out.
  • Log Managment with Graylog2 by Lennart Koopmann. Graylog2 is a full application for doing log management that includes Elasticsearch, MongoDB and a Play application. Lennart mentioned some interesting numbers about installations, a system with an impressive scale.

Talks I would have liked to see:

My Talks

Surprisingly I had two talks accepted for BEDCon. I am especially glad that the more important talk on Search-Driven Applications with Tobias of Exensio went really well and we were talking to a packed room. For me giving a talk in a team is far less stressful than giving a talk alone. I am looking forward to giving this talk again at Entwicklertag Karlsruhe in May. The slides are available on Speaker Deck.

My second short talk on Akka also went ok, the slides are online. If you are interested in Akka you can also have a look at my blogposts on message passing concurrency with Akka and supervision in Akka.

Tweets

I know, I am repeating myself, but again I stored all the tweets for the conference in Elasticsearch so I can look at them using Kibana. The usual rules apply, each retweet counts as a seperate tweet. I am using a sharded version of the index so some of the counts might not be totally exact.

Distribution

There seem to be more tweets on the first day. I also had the impression that there were fewer people there for the second day at all. The first day has a spike that is probably related to the blackout at around 12.

Hashtags

This is a surprise: elasticsearch beats any other hashtag. logstash, kibana, springmvc, roca ... very specialized hashtags as well. As you can see from the numbers there weren't that many tweets at all.

Mentions

An even bigger surprise for me: I seem to have got mentioned a lot. After looking into this, this is caused by retweets counting as a mention as well. I had some tweets that got retweeted a few times.

Negativity

There is one thing that had me quite upset. During a short power failure (which was not related to BEDCon at all) I was watching a short talk. The speaker had nothing better to do than to insult the technicians that were there to help him. I don't get this attitude and I hope to never see this speaker again on any conference I am attending.

BEDCon is a great conference, all the people involved are doing a great job. I hope in the following years I can find the time to go there again.

Elasticsearch - Der praktische Einstieg
Java Code Geeks