Mittwoch, 18. Dezember 2013

Book Review: Taming Text

This is text. As I presume you are human you will understand the words and their meaning. Some words have multiple meanings like the word like. Also as English isn't my native tongue there will be errors in my writing but you will understand it anyway. Our brain is doing a fantastic job at inferring meaning from the context. This is something that is far more difficult for machines.

Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have written a book about all the difficulties that you might encounter when processing text with machines and ways to solve them. Taming Text not only shows you the theory of extracting, searching and classifying information in text but also introduces different open source projects that you can integrate in your application.

Each chapter focuses on one problem space and most of them can even be read in isolation. You will learn about the difficulties in understanding text, mostly caused by ambiguous meanings and the context words appear in. Tokenization and entity recognition are introduced with some basics of linguistics. Searching in text is covered well with all details on analyzing, the inverted index and the vector space model, which is also important for clustering and classification. Fuzzy string matching, the process of looking up similar strings, is shown using the famous Levenshtein distance, NGrams and Tries. A larger part of the book finally focuses on text clustering, the unsupervised process of putting documents into clusters, and classification and categorization, a learning process that needs some precategorized data.

Throughout all the chapters the authors introduce sample applications in Java using one or more of the open source projects that are covered. You will see an application that searches text in Mary Shelleys Frankenstein using Apache Lucene and does entity recognition to identify people and places using Apache OpenNLP. Apache Solr is mostly used for searching and OpenNLP can do extensive analysis of text like tokenization, determining sentences or parts of speech tagging. Content is extracted from different file formats using Apache Tika. Text clustering is shown using Carrot² for search result clustering in Solr, Apache Mahout is mainly used for document clustering and classification with some help of Lucene, Solr and OpenNLP. The final example of the book builds on the knowledge of all the preceding chapters showing you an example question answering system similar to IBM Watson that accepts natural language questions and tries to give correct answers from a data set extracted from Wikipedia.

This book is exceptional in that it covers many different topics but the authors manage to combine them in a coherent example. It is one of the books in this years Jolt award for a good reason. If you are doing anything with text, be it searching or analytics you are advised to get a copy for yourself. I know that I will come back to mine again in the future when I need to refresh some of the information.

Donnerstag, 28. November 2013

Reindexing Content in Elasticsearch with stream2es

Last week I wrote about reindexing content in Elasticsearch using a script that extracts the source field from the indexed content. You can use it for cases when your mapping changes or you need to adjust the index settings. After publishing the post Drew Raines mentioned that there is an easier way using the stream2es utility only. Time to have a look at it!

stream2es can be used to stream content from several inputs to Elasticsearch. In my last post I used it to stream a file containing the sources of documents to an Elasticsearch index. Besides that it can index data from Wikipedia or Twitter or from Elasticsearch directly, which we will look at now.

Again, we are indexing some documents:

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Now, if we need to adjust the mapping we can just create a new index with the new mapping:

curl -XPOST "http://localhost:9200/twitter2" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

You can now use stream2es to transfer the documents from the old index to the new one:

stream2es es --source http://localhost:9200/twitter/ --target http://localhost:9200/twitter2/

This will make our documents available in the new index:

curl -XGET http://localhost:9200/twitter2/_count?pretty=true
{                                                                  
  "count" : 2,                                                      
  "_shards" : {                                                    
    "total" : 5,                                                    
    "successful" : 5,                                               
    "failed" : 0                                                   
  }
}

You can now delete the old index. To keep your data available on the same old index name you can also create an alias that will point to your new index:

curl -XDELETE http://localhost:9200/twitter
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "twitter2", "alias" : "twitter" } }
    ]
}'

Looking at the mapping you can see that the twitter index now points to our updated version:

curl -XGET http://localhost:9200/twitter/tweet/_mapping?pretty=true
{
  "tweet" : {
    "properties" : {
      "bytes" : {
        "type" : "long"
      },
      "message" : {
        "type" : "string"
      },
      "offset" : {
        "type" : "long"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string",
        "index" : "not_analyzed",
        "omit_norms" : true,
        "index_options" : "docs"
      }
    }
  }
}

Donnerstag, 21. November 2013

Reindexing Content in Elasticsearch

One of the crucial parts on any search application is the way you map your content to the analyzers. It will determine which query terms match the terms that are indexed with the documents. Sometimes during development you might notice that you didn't get this right from the beginning and need to reindex your data with a new mapping. While for some applications you can easily start the indexing process again this become more difficult for others. Luckily Elasticsearch by default stores the original content in the _source field. In this short article I will show you how to use a script developed by Simon Willnauer that lets you retrieve all the data and reindex it with a new mapping.

You can do the same thing in an easier way using the utility stream2es only. Look at this post if you are interested

Reindexing

Suppose you have indexed documents in Elasticsearch. Imagine that those are a lot that can not be reindexed again easily or reindexing would take some time.

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Initially this will create the mapping that is determined from the values.

curl -XGET "http://localhost:9200/twitter/tweet/_mapping?pretty=true"
{
  "tweet" : {
    "properties" : {
      "message" : {
        "type" : "string"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string"
      }
    }
  }
}

Now if you notice that you would like to change some of the existing fields to another type you need to reindex as Elasticsearch doesn't allow you to modify the mapping for existing fields. Additional fields are fine, but not existing fields. You can leverage the _source field that you can also see when querying a document.

curl -XGET "http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true&size=1"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "oaFqxMnqSrex6T7_Ut-erw",
      "_score" : 0.30685282, "_source" : {
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}

    } ]
  }
}

For his "no slides no bullshit introduction to Elasticsearch" Simon Willnauer has implemented a script that retrieves the _source fields for all documents of an index. After installing the prerequisites you can use it by passing in your index name:

fetchSource.sh twitter > result.json

It prints all the documents to stdout which can be redirected to a file. We can now delete our index and recreate it using a different mapping.

curl -XDELETE http://localhost:9200/twitter
curl -XPOST "http://localhost:9200/twitter" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

The file we just created can now be send to Elasticsearch again using the handy stream2es utility.

stream2es stdin --target "http://localhost:9200/twitter/tweet" < result.json

All your documents are now indexed using the new mapping.

Implementation

Let's look at the details of the script. At the time of writing this post the relevant part of the script looks like this:

SCROLL_ID=`curl -s -XGET 'localhost:9200/'${INDEX_NAME}'/_search?search_type=scan&scroll=11m&size=250' -d '{"query" : {"match_all" : {} }}' | jq '._scroll_id' | sed s/\"//g`
RESULT=`curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}`

while [[ `echo ${RESULT} | jq -c '.hits.hits | length'` -gt 0 ]] ; do
  #echo "Processed batch of " `echo ${RESULT} | jq -c '.hits.hits | length'`
  SCROLL_ID=`echo $RESULT | jq '._scroll_id' | sed s/\"//g`
  echo $RESULT | jq -c '.hits.hits[] | ._source + {_id}' 
  RESULT=$(eval "curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}")
done

It uses scrolling to efficiently traverse the documents. Processing of the JSON output is done using jq, a lightweight and flexible command-line JSON processor, that I should have used as well when querying the SonarQube REST API.

The first line in the script creates a scan search that uses scrolling. The scroll will be valid for 11 minutes, returns 250 documents on each request and queries all documents, as requested with the match_all query. This response doesn't contain any documents but the _scroll_id which is then extracted with jq. The final sed command removes the quotes around it.

The scroll id now is used to send queries to Elasticsearch. On each iteration it is checked if there are any hits at all. If there are the request will return a new scroll id for the next batch. The result is echoed to the console. .hits.hits[] will return the list of all hits. Using the pipe symbol in jq processes each hit with the filter on the right that prints the source as well as the id of the hit.

Conclusion

The script is a very useful addition to your Elasticsearch toolbox. You can use it to reindex or just export your content. I am glad I looked at the details of the implementation as in the future jq can come in really handy as well.

Samstag, 16. November 2013

Devoxx in Tweets

For the first time in several years I unfortunately had to skip this years Devoxx. There are so many tweets that remind me of the good talks going on there and I thought I would do something useful with them. So again I indexed them in Elasticsearch using the Twitter river and therefore can look at them using Kibana. David Pilato also has set up a public instance and I could imagine that there will be a more thorough analysis done by the Devoxx team but here are my thoughts on this years Devoxx without having been there.

I'll be looking at three things: The top 10 mentions, the top 10 hashtags and the tweet distribution over time. For the mentions I have excluded @Devoxx and @java, for the hashtags I have excluded #devoxx, #dvx13 and #dv13 as the mentions and tags are too dominant and don't tell a lot. I have collected all tweets mentioning the term devoxx so there will be a lot I missed. Each retweet counts as a seperate tweet.

Overall Trends

Looking at the timeline of the whole week you can see that the amount of tweets is high at the beginning and continually rises with thursday having even more tweets than wednesday which is quite a surprise to me. I would have thought that the first conference day is the one that has the most tweets.

Stephan007, the founder of Devoxx, has the most mentions which is no surprise. Chet Haase and Romain Guy are following. I have never seen a talk done by them but I probably should. The Dart language is the dominant hashtag with a lot of buzz around their 1.0 release. Java, Android and Scala are still hot technologies. Android is a bit of a surprise here. It's nice that the initiative Devoxx4Kids ranks quite high.

Daily Analysis

Monday

On monday the top mention is @AngularJS. Of course this is caused by the two AngularJS university sessions that lasted nearly the whole day. Angular is a hot topic but I am not yet planning to do any work with it. The session on JavaEE 7 also created a lot of interest as can be seen by the mentions of its hosts Arun Gupta and Antonio Goncalves. They especially encouraged people to participate on Twitter which seems to have been received very well. Scala is another hot topic with the university session by Dick Wall and Joshua Suereth that I would really have liked to see.

Tuesday

Tuesday is dominated by the two excellent speakers Matt Raible and Venkat Subramaniam. I especially regret that I couldn't see Venkat in action who I consider to be one of the best speakers available. I am not sure what the tag hackergarden is referring to as I didn't find an event on monday evening or tuesday. There is also quite some interest in Reactor, the reactive framework of the Spring ecosystem.

Wednesday

Brian Goetz got a lot of mentions for the keynote. I think it's a surprise that there are so many mentions of David Blevins for his talk EJB 3.2 and beyond which I wouldn't have expected to be that popular. The big event of the day was the launch of Ceylon 1.0 as can be seen from the hashtag. I heard good things about Ceylon but I still consider it an underdog of the alternative JVM languages.

Thursday

Romain Guy is leading the mentions with his very popular talk "Filthy Rich Android Clients", followed by Jonas Boner of Akka fame and Venkat Subramaniam. The launch of Dart 1.0 dominates the keywords. The Javaposse still ranks in the top 10 with their popular traditional session.

Friday

Friday normally has fewer participants than the other days. Joshua Suereth received a lot of tweets for his Scala talk, ranking high both in mentions and hashtags. The session on Google Glass also was very popular. I am not sure which session caused the mention of dagger.

Programming Language Popularity

Niko Schmuck proposed to add the language popularity over the week. As this is quite interesting here is the totally unscientific popularity chart that of course should determine which language you are learning next. I am not querying the hashtags but any mention of the terms.

Java dominates but JavaScript is very strong on Monday and Thursday. Ceylon has its share on Wednesday while Thursday is the Dart day. Scala is very popular on Monday and Friday.

A ranked version:

Java1234
JS584
Dart490
Scala252
Ceylon172
Groovy171
Clojure94
Kotlin16

The Drink Tweets

As there are quite some people tweeting we can see some trends with regards to the drink tweets. First the coffee tweets:

Quite a spike on Monday with people either mentioning that they need coffee or complaining about the coffee. This repeats on tuesday on wednesday in the morning, people seem to have accepted the situation on thursday.

Another common topic, especially since the conference is located in Belgium, are the beer tweets.

Surprise, surprise, people tend to tweet about beer in the evening. I like the huge Javaposse-spike on thursday with a lot of mentions of the beer sponsor Atlassian.

Conclusion

Though I haven't been there I could get a small glimpse of the trends at this years Devoxx. As soon as the videos are available I will buy the account for this years conference, not only because there are so many interesting talks to see but also because the Devoxx team is doing a fantastic job that I'd like to support in any way I can.

Updates
  • 17.11. Added a section on programming language popularity
  • 18.11. Updated the weekly diagram with a more accurate

Sonntag, 10. November 2013

Lucene Solr Revolution 2013 in Dublin

I just returned from Lucene Solr Revolution Europe, the conference on everything Lucene and Solr which this year was held in Dublin. I always like to recap what I took from a conference so here are some impressions.

The Venue

In the spirit of last years conference, which was merged with ApacheCon and held in a soccer stadium in Sinsheim, this years venue was a Rugby Stadium. It's seems to be quite common that conferences are organized there and the location was well suited. For some of the room changes you had to walk quite a distance but that's nothing that couldn't be managed.

The Talks

As there were four tracks in parallel choosing the talk to attend could prove to be difficult. There were so many interesting things to choose from. Fortunately all the talks have been recorded and will be made available for free on the conference website.

The following are a selection of talks that I think were most valuable to me.

Keynote: Michael Busch on Lucene at Twitter

Michael Busch is a regular speaker at Search conferences because Twitter is doing some interesting things. On the one hand they have to handle near realtime search, massive data sets and lots of requests. On the other hand they can always be sure that their documents are of a certain size. They maintain two different data stores as Lucene indices, the realtime index that contains the most recent data and the archive index that makes older tweets searchable. They introduced the archive index only a few months ago which in my opinion led to a far more reliable search experience. They have done some really interesting things like encoding the position info of a term with the doc id because they only need few bits to address positions in a 140 character document. Also they changed some aspects of the posting list encoding because they always display results sorted by date. They are trying to make their changes more general so those can be contributed back to Lucene.

Solr Indexing and Analysis Tricks by Erik Hatcher

I always enjoy listening to the talks of Erik Hatcher, probably also because his university session at Devoxx 2009 was the driving factor for me starting to use Solr. In this years talk he presented lots of useful aspects for indexing data in Solr. One of the most interesting facts I took from this talk is the use of the ScriptUpdateProcessor that is included in Solr since version 4. You can define scripts that are executed during indexing and can manipulate the document. This is a valuable alternative to copyFields, especially if you would like to have the content stored as well. By default you can implement the logic in JavaScript but there are alternatives available.

Hacking Lucene and Solr for Fun and Profit by Grant Ingersoll

Grant Ingersoll presented some applications of Lucene and Solr not directly involving search like Classification, Recommendations and Analytics. Some examples had been taken from his excellent book Taming Text (watch this blog for a review of the book in the near future).

Schemaless Solr and the Solr Schema REST API by Steve Rowe

One of the factors of the success of Elasticsearch is its ease of use. You can download it and start indexing documents immediately without doing any configuration work. One of the features that enables you to do this is the autodiscovery of fields by value. Starting with Solr 4.4 you can now use Solr in a similar way. You can configure that you want Solr to manage your schema. This way unknown fields are then created automatically based on the first value that is extracted by configured parsers. As with Elasticsearch you shouldn't rely on this feature exclusively so there is also a way to add new fields of a certain type via the Schema REST API. When Solr is in managed mode it will modify the schema.xml so you might lose changes you made manually. For the future the developers are even thinking about moving away from XML for the managed mode as there are better options for when readability doesn't matter.

Stump the Chump with Chris Hostetter

This seems to be a tradition at Lucene Solr Revolution. Chris Hostetter has to find solutions to problems that have been submitted before or are posted by the audience. It's a fun event but you can also learn a lot.

Query Latency Optimization with Lucene by Stefan Pohl

Stefan first introduced some basic latency factors and how to measure them. He recommended to not instrument the low level Lucene classes when profiling your application as those rely heavily on hotspot optimizations. Besides introducing the basic mechanisms of how conjunction (AND) and disjunction (OR) work he described some recent Lucene improvements that can speed up your application, among those LUCENE-4571, the new minShouldMatch implementation and LUCENE-4752, which allows custom ordering of documents in the index.

Relevancy Hacks for eCommerce by Varun Thacker

Varun introduced the basics of relevancy sorting in Lucene and Solr and how those might affect product searches. TF/IDF is sometimes not the best solution ("IDF is a measurement of rarity not necessarily importance"). He also showed the ways to influence the relevancy: Implementation of a custom Similarity class, boosting and function queries.

What is in a Lucene Index by Adrien Grand

Adrien started with the basics fo a Lucene index and how it differs from a database index: the dictionary structure, segments and merging. He then moved on to topics like the structure of the posting list, term vectors, the FST terms index and the difference between stored fields and doc values. This is a talk full of interesting details on the internal workings of Lucene and the implications for the performance of your application.

Conclusion

As said before I couldn't attend all the talks I would have liked. I especially heard good things about the following talks which I will watch as soon as those are available:

  • Integrate Solr with Real-Time Stream Processing Applications by Timothy Potter
  • The Typed Index by Christoph Goller
  • Implementing a Custom Search Syntax Using Solr, Lucene and Parboiled by John Berryman

I really enjoyed Lucene Solr Revolution. Not only were there a lot of interesting talks to listen to but it was also a good opportunity to meet new people. On both evenings there have been get togethers with free drinks and food which must have cost LucidWorks a fortune. I couldn't attend the closing remarks but I heard they announced that they want to move to smaller, national events in Europe instead of the central conference. I hope those will still be events that attract so many commiters and interesting people.

Donnerstag, 31. Oktober 2013

Switch Off Legacy Code Violations in SonarQube

While I don't believe in putting numbers on source code quality, SonarQube (formerly known as Sonar) can be a really useful tool during development. It enforces a consistent style across your team, has discovered several possible bugs for me and is a great tool to learn: You can browse the violations and see why a certain expression or code block can be a problem.

To make sure that your code base stays in a consistent state you can also go as far as mandating that there should be no violations in the code developers check in. One of the problems with this is that a lot of projects are not green field projects and you have a lot of existing code. If your violation number already is high it is difficult to judge if no new violations were introduced.

In this post I will show you how you can start with zero violations for existing code without touching the sources, something I got inspired to do by Jens Schauder in his great talk Working with Legacy Teams. We will ignore all violations based on the line in the file so if anybody touches the file the violations will show again and the developer is responsible for fixing the legacy violations.

The Switch Off Violations Plugin

We are using the Switch Off Violations Plugin for SonarQube. It can be configured with different exclusion patterns for the issues. You can define regular expressions for code blocks that should be ignored or deactivate violations at all or on a file or line basis.

For existing code you want to ignore all violations for certain files and lines. This can be done by inserting something like this in the text area Exclusion patterns:

de.fhopf.akka.actor.IndexingActor;pmd:SignatureDeclareThrowsException;[23]

This will exclude the violation for throwing raw Exceptions in line 23 of the IndexingActor class. When analyzing the code again this violation will be ignored.

Retrieving violations via the API

Besides the nice dashboard SonarQube also offers an API that can be used to retrieve all the violations for a project. If you are not keen to look up all existing violations in your code base and insert those by hand you can use it to generate the exclusion patterns automatically. All of the violations can be found at /api/violations, e.g. http://localhost:9000/api/violations.

I am sure there are other ways to do it but I used jsawk to parse the JSON response (On Ubuntu you have to install Spidermonkey instead of the default js interpreter.. And you have to compile it yourself. And I had to use a specific version. Sigh.).

Once you have set up all the components you can now use jsawk to create the exclusion patterns for all existing violations:

curl -XGET 'http://localhost:9000/api/violations?depth=-1' | ./jsawk -a 'return this.join("\n")' 'return this.resource.key.split(":")[1] + ";*;[" + this.line + "]"' | sort | uniq

This will present a list that can just be pasted in the text area of the Switch Off Violations plugin or checked in to the repository as a file. With the next analysis process you will then hopefully see zero violations. When somebody changes a file by inserting a line the violations will be shown again and should be fixed. Unfortunately some violations are not line based and will yield a line number 'undefined'. Currently I just removed those manually so you still might see some violations.

Conclusion

I presented one way to reset your legacy code base to zero violations. With SonarQube 4.0 the functionality of the Switch Violations Off plugin will be available in the core so it will be easier to use. I am still looking for the best way to keep the exclusion patterns up to date. Once somebody had to fix the violations for an existing file the pattern should be removed.

Update 09.01.2014

Starting with SonarQube 4 this approach doesn't work anymore. Some features of the SwitchOffViolations plugin have been moved to the core but excluding violations by line is not possible anymore and also will not be implemented. The developers recommend to only look at the trends of the project and not the overall violation count. This can be done nicely using the differentials.

Freitag, 25. Oktober 2013

Elasticsearch at Scale - Kiln and GitHub

Most of us are not exposed to data at real scale. It is getting more common but still I appreciate that more progressive companies that have to fight with large volumes of data are open about it and talk about their problems and solutions. GitHub and Fog Creek are two of the larger users of Elasticsearch and both have published articles and interviews on their setup. It's interesting that both of these companies are using it for a very specialized use case, source code search. As I have recently read the article on Kiln as well as the interview with the folks at GitHub I'd like to summarize some of the points they made. Visit the original links for in depth information.

Elasticsearch at Fog Creek for Kiln

In this article on InfoQ Kevin Gessnar, a developer at Fog Creek describes the process of migrating the code search of Kiln to Elasticsearch.

Initial Position

Kiln allows you to search on commit messages, filenames and file contents. For commit messages and filenames they were initially using the full text search features of SQL Server. For the file content search they were using a tool called OpenGrok that leverages Ctags to analyze the code and stores it in a Lucene index. This provided them will all of the features they needed but unfortunately the solution couldn't scale with their requirements. Queries took several seconds up to the timeout value of 30 seconds.

It's interesting to see that they decided against Solr because of poor read performance on heavy writes. Would be interesting to see if this is still the case for current versions.

Scale

They are indexing several million documents every day, which comes to terabytes of data. They are still running their production system on two nodes only. These are numbers that really surprised me. I would have guessed that you need more nodes for this amount of data (well, probably those are really big machines). They only seem to be using Elasticsearch for indexing and search but retrieve the result display data from their primary storage layer.

Elasticsearch at GitHub

Andrew Cholakian, who is doing a great job with writing his book Exploring Elasticsearch in the open, published an interview with Tim Pease and Grant Rodgers of GitHub on their Elasticsearch setup, going through a lot of details.

Initial Position

GitHub used to have their search based on Solr. As the volume of data and search increased they needed a solution that scales. Again, I would be interested if current versions of Solr Cloud could handle this volume.

Scale

They are really searching big data. 44 Amazon EC2 instances power search on 2 billion documents which make up 30 terabyte of data. 8 instances don't hold any data but are only there to distribute the queries. They are planning to move from the 44 Amazon instances to 8 larger physical machines. Besides their user facing data they are indexing internal data like audit logs and exceptions (it isn't clear to me from the interview if in this case Elasticsearch is their primary data store which would be remarkable). They are using different clusters for different data types so that the external search is not affected when there are a lot of exceptions.

Challenges

Shortly after launching their new search feature people started discovering that you could also search for files people had accidentally commited like private ssh keys or passwords. This is an interesting phenomen where just the possibility for better retrieval made a huge difference. All the information had been there before but it just couldn't be found easily. This led to an increase in search volume that was not anticipated. Due to some configuration issues (suboptimal Java version, no setting for minimum of masters) their cluster became unstable and they had to disable search for the whole site.

Further Takeaways
  • Use routing to keep your data together on one shard
  • Thrift seems to be far more complicated from an ops point of view compared to HTTP
  • Use the slow query log
  • Time slicing your indices is a good idea if the data allows

A Common Theme

Both of these articles have some observations in common:

  • Elasticsearch is easy to get started with
  • Scaling is not an issue
  • the HTTP interface is good for debugging and operations
  • the Elasticsearch community and the company are really helpful when it comes to problems

Freitag, 11. Oktober 2013

Cope with Failure - Actor Supervision in Akka

A while ago I showed an example on how to use Akka to scale a simple application with multiple threads. Tasks can be split into several actors that communicate via immutable messages. State is encapsulated and each actor can be scaled independently. While implementing an actor you don't have to take care of low level building blocks like Threads and synchronization so it is far more easy to reason about the application.

Besides these obvious benefits, fault tolerance is another important aspect. In this post I'd like to show you how you can leverage some of Akkas characteristics to make our example more robust.

The Application

To recap, we are building a simple web site crawler in Java to index pages in Lucene. The full code of the examples is available on GitHub. We are using three actors: one which carries the information on the pages to be visited and visited already, one that downloads and parses the pages and one that indexes the pages in Lucene.

By using several actors to download and parse pages we could see some good performance improvements.

What could possibly go wrong?

Things will fail. We are relying on external services (the page we are crawling) and therefore the network. Requests could time out or our parser could choke on the input. To make our example somewhat reproducible I just simulated an error. A new PageRetriever, the ChaosMonkeyPageRetriever sometimes just throws an Exception:

@Override
public PageContent fetchPageContent(String url) {
    // this error rate is derived from scientific measurements
    if (System.currentTimeMillis() % 20 == 0) {
      throw new RetrievalException("Something went horribly wrong when fetching the page.");
    }
    return super.fetchPageContent(url);
}

You can surely imagine what happens when we use this retriever in the sequential example that doesn't use Akka or threads. As we didn't take care of the failure our application just stops when the Exception occurs. One way we could mitigate this is by surrounding statements with try/catch-Blocks but this will soon intermingle a lot of recovery and fault processing code with our application logic. Once we have an application that is running in multiple threads fault processing gets a lot harder. There is no easy way to notify other Threads or save the state of the failing thread.

Supervision

Let's see Akkas behavior in case of an error. I added some logging that indicates the current state of the visited pages.

1939 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:  55, allPages:  60
1952 [default-akka.actor.default-dispatcher-4] INFO de.fhopf.akka.actor.Master - inProgress:  54, allPages:  60
[ERROR] [10/10/2013 06:47:39.752] [default-akka.actor.default-dispatcher-5] [akka://default/user/$a/$a] Something went horribly wrong when fetching the page.
de.fhopf.akka.RetrievalException: Something went horribly wrong when fetching the page.
        at de.fhopf.akka.actor.parallel.ChaosMonkeyPageRetriever.fetchPageContent(ChaosMonkeyPageRetriever.java:21)
        at de.fhopf.akka.actor.PageParsingActor.onReceive(PageParsingActor.java:26)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

1998 [default-akka.actor.default-dispatcher-8] INFO de.fhopf.akka.actor.Master - inProgress:  53, allPages:  60
2001 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-2] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-10] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
[...]
2469 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.Master - inProgress:   8, allPages:  78
2487 [default-akka.actor.default-dispatcher-7] INFO de.fhopf.akka.actor.Master - inProgress:   7, allPages:  78
2497 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:   6, allPages:  78
2540 [default-akka.actor.default-dispatcher-13] INFO de.fhopf.akka.actor.Master - inProgress:   5, allPages:  78

We can see each exception that is happening in the log file but our application keeps running. That is because of Akkas supervision support. Actors form hierarchies where our PageParsingActor is a child of the Master actor because it is created from its context. The Master is responsible to determine the fault strategy for its children. By default it will restart the Actor in case of an exception which makes sure that the next message is processed correctly. This means even in case of an error Akka tries to keep the system in a running state.

The reaction to a failure is determined by the method supervisorStrategy() in the parent actor. Based on an Exception class you can choose several outcomes:

  • resume: Keep the actor running as if nothing had happened
  • restart: Replace the failing actor with a new instance
  • suspend: Stop the failing actor
  • escalate: Let your own parent decide on what to do

A supervisor that would restart the actor for our exception and escalate otherwise could be added like this:

// allow 100 restarts in 1 minute ... this is a lot but we the chaos monkey is rather busy
private SupervisorStrategy supervisorStrategy = new OneForOneStrategy(100, Duration.create("1 minute"), new Function() {

    @Override
    public Directive apply(Throwable t) throws Exception {
        if (t instanceof RetrievalException) {
            return SupervisorStrategy.restart();
        }
        // it would be best to model the default behaviour in other cases
        return SupervisorStrategy.escalate();
    }

});

@Override
public SupervisorStrategy supervisorStrategy() {
    return supervisorStrategy;
}

Let's come back to our example. Though Akka takes care of restarting our failing actors the end result doesn't look good. The application continues to run after several exceptions but our application then just stops and hangs. This is caused by our business logic. The Master actor keeps all pages to visit in the VisitedPageStore and only commits the Lucene index when all pages are visited. As we had several failures we didn't receive the result for those pages and the Master still waits.

One way to fix this is to resend the message once the actor is restarted. Each Actor class can implement some methods that hook into the actors lifecycle. In preRestart() we can just send the message again.

@Override
public void preRestart(Throwable reason, Option<Object> message) throws Exception {
    logger.info("Restarting PageParsingActor and resending message '{}'", message);
    if (message.nonEmpty()) {
        getSelf().forward(message.get(), getContext());
    }
    super.preRestart(reason, message);
}

Now if we run this example we can see our actors recover from the failure. Though some exceptions are happening all pages get visited eventually and everything will be indexed and commited in Lucene.

Though resending seems to be the solution to our failures you need to be careful to not break your system with it: For some applications the message might be the cause for the failure and by resending it you will keep your system busy with it in a livelock state. When using this approach you should at least add a count to the message that you can increment on restart. Once it is sent too often you can then escalate the failure to have it handled in a different way.

Conclusion

We have only handled one certain type of failure but you can already see how powerful Akka can be when it comes to fault tolerance. Recovery code is completely separated from the business code. To learn more on different aspects of error handling read the Akka documentation on supervision and fault tolerance or this excellent article by Daniel Westheide.

Freitag, 4. Oktober 2013

Brian Foote on Prototyping

Big Ball of Mud is a collection of patterns by Brian Foote, published in 1999. The title stems from one of the patterns, Big Ball of Mud, the "most frequently deployed of software architectures". Though this might sound like a joke at first the article contains a lot of really useful information on the forces at work when dealing with large codebases and legacy code. I especially like his take on prototyping applications.

Did you ever find yourself in the following situation? A customer agrees to build a prototype of an application to learn and to see something in action. After the prototype is finished the customer tries to force you to reuse the prototype, "because it already does what we need". Cold sweat, you probably were taking lots of shortcuts in the prototype and didn't build it with maintainability in mind. This is what Brian recommends to circumvent this situation:

One way to minimize the risk of a prototype being put into production is to write the prototype in using a language or tool that you couldn't possibly use for a production version of your product.

Three observations:

  1. Nowadays the choice of languages doesn't matter that much for running the code in production with virtual machines that support a lot of languages.
  2. This only holds true for prototypes that are used to explore the domain. When doing a technical proof of concept at least some parts of the prototype need to use the intended technology.
  3. Prototypes are sometimes also used to make the team familiar with a new technology that is set for the project.

Nevertheless this is a really useful advice to keep in mind.

Freitag, 27. September 2013

Feature Toggles in JSP with Togglz

Feature Toggles are a useful pattern when you are working on several features but want to keep your application in a deployable state. One of the implementations of the pattern available for Java is Togglz. It provides ways to check if a feature is enabled programmatically, from JSF or JSP pages or even when wiring Spring beans. I couldn't find a single example on how to use the JSP support so I created an example project and pushed it to GitHub. In this post I will show you the basics of Togglz and how to use it in Java Server Pages.

Togglz

Features that you want to make configurable are described with a Java Enum. This is an example with two features that can be enabled or disabled:

public enum ToggledFeature implements Feature {

    TEXT,
    MORE_TEXT;

    public boolean isActive() {
        return FeatureContext.getFeatureManager().isActive(this);
    }
}

This Enum can then be used to check if a feature is enabled in any part of your code:

if (ToggledFeature.TEXT.isActive()) {
    // do something clever
}

The config class is used to wire the feature enum with a configuration mechanism:

public class ToggledFeatureConfiguration implements TogglzConfig {

    public Class<? extends Feature> getFeatureClass() {
        return ToggledFeature.class;
    }

    public StateRepository getStateRepository() {
        return new FileBasedStateRepository(new File("/tmp/features.properties"));
    }

    public UserProvider getUserProvider() {
        return new ServletUserProvider("ADMIN_ROLE");
    }
}

The StateRepository is used for enabling and disabling features. We are using a file based one but there are others available.

To configure Togglz for your webapp you can either do it using CDI, Spring or via manual configuration in the web.xml:

<web-app xmlns="http://java.sun.com/xml/ns/javaee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
    version="3.0">

    <context-param>
        <param-name>org.togglz.core.manager.TogglzConfig</param-name>
        <param-value>de.fhopf.togglz.ToggledFeatureConfiguration</param-value>
    </context-param>

    <filter>
        <filter-name>TogglzFilter</filter-name>
        <filter-class>org.togglz.servlet.TogglzFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>TogglzFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>

</web-app>

In my example I had to add the filter manually though with Servlet 3.0 this shouldn't be necessary. I am not sure if this is caused by the way Gradle runs Jetty or if this is always the case when doing the configuration via a context-param.

Togglz with Java Server Pages

For the integration of Togglz in JSPs you need to add the dependency togglz-jsp to your project. It contains a tag that can be used to group code which can then be enabled or disabled. A simple example for our ToggledFeature:

<%@ taglib uri="http://togglz.org/taglib" prefix="togglz" %>

This is some text that is always shown.

<togglz:feature name="TEXT">
This is the text of the TEXT feature.
</togglz:feature>

<togglz:feature name="MORE_TEXT">
This is the text of the MORE_TEXT feature.
</togglz:feature>

Both features will be disabled by default so you will only see the first sentence. You can control which features are enabled (even at runtime) in /tmp/features.properties. This is what it looks like when the TEXT feature is enabled:

TEXT=true
MORE_TEXT=false

A Word of Caution

I am just starting using feature toggles in an application so I wouldn't call me experienced. But I have the impression that you need to be really disciplined when using it. Old feature toggles that are not used should be removed as soon as possible. Unfortunately the huge benefit of compile time safety in Java for removing a feature from the enum is gone with JSPs; the names of the features are only Strings so you will have to do some file searches when removing a feature.

Freitag, 20. September 2013

Kibana and Elasticsearch: See What Tweets Can Say About a Conference

In my last post I showed how you can index tweets for an event in Elasticsearch and how to do some simple queries on it using its HTTP API. This week I will show how you can use Kibana 3 to visualize the data and make it explorable without having to learn the Elasticsearch API.

Installing Kibana

Kibana 3 is a pure HTML/JS frontend for Elasticsearch that you can use to build dashboards for your data. We are still working with the example data the is indexed using the Twitter River. It consists of tweets for FrOSCon but can be anything, especially data that contains some kind of timestamp as it's the case for tweets. To install Kibana you can just fetch it from the GitHub repostory (Note: now there are also prepackaged archives available that you can download without cloning the repository):

git clone https://github.com/elasticsearch/kibana.git

You will now have a folder kibana that contains the html files as well as all the assets needed. The files need to be served by a webserver so you can just copy the folder to the directory e.g. Apache is serving. If you don't have a webserver installed you can simply serve the current directory using python:

python -m SimpleHTTPServer 8080

This will make Kibana available at http://localhost:8080/kibana/src. With the default configuration Elasticsearch needs to be running on the same machine as well.

Dashboards

A dashboard in Kibana consists of rows that can contain different panels. Each panel can either display data, control which data is being displayed or both. Panels do not stand on their own; the results that are getting displayed are the same for the whole dashboard. So if you choose something in one panel you will notice that the other panels on the page will also get updated with new values.

When accessing Kibana you are directed to a welcome page from where you can choose between several dashboard templates. As Kibana is often used for logfile analytics there is an existing dashboard that is preconfigured to work with Logstash data. Another generic dashboard can be used to query some data from the index but we'll use the option "Unconfigured Dashboard" which gives some hints on which panels you might want to have.

This will present you with a dashboard that contains some rows and panels already.

Starting from the top it contains these rows:

  • The "Options" row that contains one text panel
  • The "Query" row that contains a text query panel
  • A hidden "Filter" row that contains a text panel and the filter panel. The row can be toggled visible by clicking on the text Filter on the left.
  • The "Graph" row two text panels
  • The large "Table" row with one text panel.

Those panels are already laid out in a way that they can display the widgets that are described in the text. We will now add those to get some data from the event tweets.

Building a Dashboard

The text panels are only there to guide you when adding the widgets you need and can then be removed. To add or remove panels for a row you can click the little gear next to the title of the row. This will open an options menu. For the top row we are choosing a timepicker panel with a default mode of absolute. This gives you the opportunity to choose a begin and end date for your data. The field that contains the timestamp is called "created_at". After saving you can also remove the text panel on the second tab.

If you now open the "Filters" row you will see that there now is a filter displayed. It is best to keep this row open to see which filters are currently applied. You can remove the text panel in the row.

In the graph section we will add two graph panels instead of the text panels: A pie chart that displays the terms of the tweet texts and a date histogram that shows how many tweets there are for a certain time. For the pie chart we use the field "text" and exclude some common terms again. Note that if you are adding terms to the excluded terms when the panel is already created that you need to initiate another query, e.g. by clicking the button in the timepicker. For the date histogram we are again choosing the timestamp field "created_at".

Finally, in the last row we are adding a table to display the resulting tweet documents. Besides adding the columns "text", "user.screen_name" and "created_at" we can leave the settings like it's proposed.

We now have a dashboard to play with the data and see the results immediately. Data can be explored by using any of the displays, you can click in the pie chart to choose a certain term or choose a time range in the date histogram. This makes it really easy to work with the data.

Answering questions

Now we have a visual representation of all the terms and the time of day people are tweeting most. As you can see, people are tweeting slightly more during the beginning of the day.

You can now check for any relevant terms that you might be interested in. For example, let's see when people tweet about beer. As we do have tweets in multiple languages (german, english and people from cologne) we need to add some variation. We can enter the query

text:bier* OR text:beer* OR text:kölsch
in the query box.

There are only few tweets about it but it will be a total surprise to you that most of the tweets about beer tend to be send later during the day (I won't go into detail why there are so many tweets mentioning the terms horse and piss when talking about Kölsch).

Some more surprising facts: There is not a single tweet mentioning Java but a lot of tweets that mention php, especially during the first day. This day seems to be far more successful for the PHP dev room.

Summary

I hope that I could give you some hints on how powerful Kibana can be when it comes to analytics of data, not only with log data. If you'd like to read another detailed step by step guide on using Kibana to visualize Twitter data have a look at this article by Laurent Broudoux.

Mittwoch, 11. September 2013

Simple Event Analytics with ElasticSearch and the Twitter River

Tweets can say a lot about an event. The hashtags that are used and the time that is used for tweeting can be interesting to see. Some of the questions you might want answers to:

  • Who tweeted the most?
  • What are the dominant keywords/hashtags?
  • When is the time people are tweeting the most?
  • And, most importantly: Is there a correlation between the time and the amount of tweets mentioning coffee or beer?

During this years FrOSCon I indexed all relevant tweets in ElasticSearch using the Twitter River. In this post I'll show you how you can index tweets in ElasticSearch to have a dataset you can do analytics with. We will see how we can get answers to the first two questions using the ElasticSearch Query DSL. Next week I will show how Kibana can help you to get a visual representation of the data.

Indexing Tweets in ElasticSearch

To run ElasticSearch you need to have a recent version of Java installed. Then you can just download the archive and unpack it. It contains a bin directory with the necessary scripts to start ElasticSearch:

bin/elasticsearch -f

-f will take care that ElasticSearch starts in the foreground so you can also stop it using Ctrl-C. You can see if your installation is working by calling http://localhost:9200 in your browser.

After stopping it again we need to install the ElasticSearch Twitter River that uses the Twitter streaming API to get all the tweets we are interested in.

bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0

Twitter doesn't allow anonymous access to its API anymore so you need to register for the OAuth access at https://dev.twitter.com/apps. Choose a name for your application and generate the key and token. Those will be needed to configure the plugin via the REST API. In the configuration you need to pass your OAuth information as well as any keyword you would like to track and the index that should be used to store the data.

curl -XPUT localhost:9200/_river/frosconriver/_meta -d '
{
    "type" : "twitter",
    "twitter" : {
        "oauth" : {
            "consumer_key" : "YOUR_KEY",
            "consumer_secret" : "YOUR_SECRET",
            "access_token" : "YOUR_TOKEN",
            "access_token_secret" : "YOUR_TOKEN_SECRET"
        },
        "filter" : {
            "tracks" : "froscon"
        }
    },
    "index" : {
        "index" : "froscon",
        "type" : "tweet",
        "bulk_size" : 1
    }
}
'

The index doesn't need to exist yet, it will be created automatically. I am using a bulk size of 1 as there aren't really many tweets. If you are indexing a lot of data you might consider setting this to a higher value.

After issuing the call you should see some information in the logs that the river is starting and receiving data. You can see how many tweets there are in your index by issuing a count query:

curl 'localhost:9200/froscon/_count?pretty=true

You can see the basic structure of the documents created by looking at the mapping that is created automatically.

http://localhost:9200/froscon/_mapping?pretty=true

The result is quite long so I am not replicating it here but it contains all the relevant information you might be interested in like the user who tweeted, the location of the user, the text, the mentions and any links in it.

Doing Analytics Using the ElasticSearch REST API

Once you have enough tweets indexed you can already do some analytics using the ElasticSearch REST API and the Query DSL. This requires you to have some understanding of the query syntax but you should be able to get started by skimming through the documentation.

Top Tweeters

First, we'd like to see who tweeted the most. This can be done by doing a query for all documents and facet on the user name. This will give us the names and count in a section of the response.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "user" : { 
        "terms" : {
          "field" : "user.screen_name"
        } 
      }                            
    }
  }
'

Those are the top tweeters for FrOSCon:

Dominant Keywords

The dominant keywords can also be retrieved using a facet query, this time on the text of the tweet. As there are a lot of german tweets for FrOSCon and the text field is processed using the StandardAnalyzer that only removes english stopwords it might be necessary to exclude some terms. Also you might want to remove some other common terms that indicate retweets or are part of urls.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "keywords" : { 
        "terms" : {
          "field" : "text", 
          "exclude" : ["froscon", "rt", "t.co", "http", "der", "auf", "ich", "my", "die", "und", "wir", "von"] 
        }
      }                            
    }
  }
'

Those are the dominant keywords for FrOSCon:

  • talk (no surprise for a conference)
  • slashme
  • teamix (a company that does very good marketing. Unfortunately in this case this is more because their fluffy tux got stolen. The tweet about it is the most retweeted tweet of the data.)

Summary

Using the Twitter River it is really easy to get some data into ElasticSearch. The Query DSL makes it easy to extract some useful information. Next week we will have a look at Kibana that doesn't necessarily require a deep understanding of the ElasticSearch queries and can visualize our data.

Mittwoch, 4. September 2013

Developing with CoreMedia

A while ago I had the chance to attend a training on web development with CoreMedia. It's a quite enterprisey commercial Content Management System that powers large corporate websites like telekom.com as well as news sites like Bild.de (well, you can't hold CoreMedia responsible for the kind of "content" people put into their system). As I have been working with different Java based Content Management Systems over the years I was really looking forward to learn about the system I heard really good things about. In this post I'll describe the basic structure of the system as well how it feels like to develop with it.

System Architecture

As CoreMedia is built to scale to really large sites the architecture is also built around redundant and distributed components. The part of the system the editors are working on is seperated from the parts that serve the content to the internet audience. A publication process copies the content from the editorial system to the live system.

The heart of CoreMedia is the Content Server. It stores all the content in a database and makes it retrievable. You rarely access it directly but only via other applications that then talk to it in the background via CORBA. Editors used to work with CoreMedia using a Java client (used to be called the Editor, now known as the Site Manager), starting with CoreMedia 7 there is also the web based Studio that is used to create and edit content. A preview application can be used to see how the site looks before being published. Workflows, that are managed using the Workflow Server, can be used to control the processes around editing as well as publication.

The live system consists of several components that are mostly laid out in a redundant way. There is one Master Live Server as well as 0 to n Replication Live Servers that are used for distributing the load as well as fault tolerance. The Content Management Servers are accessed from the Content Application Engine (CAE) that contains all the delivery and additional logic for your website. One or more Solr instances are used to provide the search services for your application.

Document Model

The document model for your application describes the content types that are available in the system. CoreMedia provides a blueprint application that contains a generic document model that can be used as a basis for your application but you are also free to build something completely different. The document model is used throughout the whole system as it describes the way your content is stored. The model is object oriented in nature with documents that consist of attributes. There are 6 attribute types like String (fixed length Strings), XML (variable length Strings) and Blob (binary data) available that form the basis of all your types. An XML configuration file is used to describe your specific document model. This is an example of an article that contains a title, the text and a list of related articles.

<DocType Name="Article">
  <StringProperty Name="title"/>
  <XmlProperty Grammar="coremedia-richtext-1.0" Name="text"/>
  <LinkListProperty LinkType="Article" Name="related"/>
</DocType>

Content Application Engine

Most of the code you will be writing is the delivery code that is part of the Content Application Engine, either for preview or for the live site. This is a standard Java webapp that is assembled from different Maven based modules. CAE code is heavily based on Spring MVC with the CoreMedia specific View Dispatcher that takes care of the rendering of different documents. The document model is made available using the so called Contentbeans that can be generated from the document model. Contentbeans access the content on demand and can contain additional business logic. So those are no POJOs but more active objects similar to Active Record entities in the Rails world.

Our example above would translate to a Contentbean with getters for the title (a java.lang.String), the text (a com.coremedia.xml.Markup) and a getter for a java.util.List that is typed to de.fhopf.Article.

Rendering of the Contentbeans happens in JSPs that are named according to classes or interfaces with a specific logic to determine which JSP should be used. An object Article that resides in the package de.fhopf would then be found in the path de/fhopf/Article.jsp, if you want to add a special rendering mechanism for List this would be in java/util/List.jsp. Different rendering of objects can be done by using a view name. An Article that is rendered as a link would then be in de/fhopf/Artilcle.link.jsp.

This is done using one of the custom Spring components of CoreMedia, the View Dispatcher, a View Resolver that determines the correct view to be invoked for a certain model based on the content element in the Model. The JSP that is used can then contain further includes on other elements of the content, be it documents in the sense of CoreMedia or one of the attributes that are available. Those includes are again routed through the View Dispatcher.

Let's see an example for rendering the list of related articles for an article. Say you call the CAE with a certain content id, that is an Article. The standard mechanism routes this request to the Article.jsp described above. It might contain the following fragment to include the related articles:

<cm:include self="${self.related}"/>

Note that we do not tell which JSP to include. CoreMedia automatically figures out that we are including a List, for example a java.util.ArrayList. As there is no JSP available at java/util/ArrayList.jsp Coremedia will automatically look for any interfaces that are implemented by that class, in this case it will find java/util/List.jsp. This could then contain the following fragment:

<ul>
<c:forEach items="${self}" var="item">
  <li><cm:include self="${item}" view="link"></li>
</c:forEach>
</ul>

As the List in our case contains Article implementations this will then hit the Article.link.jsp that would finally render the link. This is a very flexible approach with a high degree of reusability for the fragments. The List.jsp we are seeing above has no connection to the Article. You can use it for any objects that should be rendered in a List structure, the View Dispatcher of CoreMedia takes care of which JSP to include for a certain type.

To minimize the load on the Content Server you can also add caching via configuration settings. Data Views, that are a layer on top of the Contentbeans, are then held in memory and contain prefilled beans that don't need to access the Content Management Server anymore. This object cache approach is different to the html fragment caching a lot of other systems are doing.

Summary

Though this is only a very short introduction you should have seen that CoreMedia really is a nice system to work with. The distributed nature not only makes it scalable but this also has implications when developing for it: When you are working on the CAE you are only changing code in this component. You can start the more heavyweight Contentserver only once and afterwards work with the lightweight CAE that can be run using the Maven jetty plugin. Restarts don't take a long time so you have short turnaround times. The JSPs are very cleanly structured and don't need to include scriptlets (I heard that this has been different for earlier versions). As most of the application is build around Spring MVC you can use a lot of knowledge that is around already.

Mittwoch, 28. August 2013

FrOSCon 8 2013 - Free and Open Source Software Conference

Last weekend I attended FrOSCon, the Free and Open Source Software Conference taking place in St. Augustin near Bonn. It's a community organized conference with an especially low entrance fee and a relaxed vibe. The talks are a good mixture of development and system administration topics.

Some of the interesting talks I attended:

Fixing Legacy Code by Kore Nordmann and Benjamin Eberlein

Though this session was part of the PHP track it contained a lot of valuable information related to working with legacy code in any language. Besides strategies for getting an application under test the speakers showed some useful refactorings that can make sense to start with. Slides

Building Awesome Ruby Command Line Apps by Christian Vervoorts

Christian first showed some of the properties that make up a good command line app. You should choose sane default values but make those configurable. Help functionality is crucial for a good user experience, via -h parameter and a man page. In the second part Chistian introduced some Ruby gems that can be used to build command line apps. GLI seems to be the most interesting with a nice DSL and its scaffolding functionality.

Talking People Into Creating Patches by Isabel Drost-Fromm

Isabel, who is very active in the Apache community, introduced some of her findings when trying to make students, researchers and professionals participate in Open Source. The participants where a mixture of people running open source projects and developers that are interested in contributing to open source. I have been especially interested in this talk because I wouldn't mind having more people help with the Odftoolkit I am also working on. When working with professionals, who are the main target, it is important to answer quickly on mails or issues as they might move on to other projects and might not be able to help later on. Also, it's nice to have some easy tasks in the bugtracker that can be processed by newbies.

MySQL Performance Schema by Carsten Thalheimer

Performance Schema is a new feature in MySQL 5.5 and is activated by default since 5.6. It monitors a lot of the internal functionality like file access and queries so you can later see which parts you can optimize. Some performance measurements done by the MySQL developers showed that keeping it activated has an performance impact of around 5%. Though this doesn't sound that good at first I think you can gain a lot more performance by the insight you have in the inner workings. Working with Performance Schema is supposed to be rather complex ("Take two weeks to work with it"), ps_helper is a more beginner friendly functionality that can get you started with some useful metrics.

Summary

FrOSCon is the one of the most relaxing conferences I know. It is my goto place for seeing stuff that is not directly related to Java development. The low fee makes it a no brainer to attend. If you are interested in any of this years talks they will also be made available online.

Mittwoch, 21. August 2013

Getting Started with ElasticSearch: Part 2 - Querying

This is the second part of the article on things I learned while building a simple Java based search application on top of ElasticSearch. In the first part of this article we looked at how to index data in ElasticSearch and what the mapping is. Though ElasticSearch is often called schema free specifying the mapping is still a crucial part of creating a search application. This time we will look at the query side and see how we can get our indexed talks out of it again.

Simple Search

Recall that our documents consist of a title, the date and the speaker of a talk. We have adjusted the mapping so that for the title we are using the German analyzer that stems our terms and we can search on variations of words. This curl request creates a similar index:

curl -XPUT "http://localhost:9200/blog" -d'
{
    "mappings" : {
        "talk" : {
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Let's see how we can search on our content. We are indexing another document with a German title.

curl -XPOST "http://localhost:9200/blog/talk/" -d'
{
    "speaker" : "Florian Hopf",
    "date" : "2012-07-04T19:15:00",
    "title" : "Suchen und Finden mit Lucene und Solr"
}'

All searching is done on the _search endpoint that is available on the type or index level (you can also search on multiple types and indexes by separating them with a comma). As the title field uses the German analyzer we can search on variations of the words, e.g. suche which stems to the same root as suchen, such.

curl -XGET "http://localhost:9200/blog/talk/_search?q=title:suche&pretty=true"                                                                       
{
  "took" : 14, 
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
},                                                                                                                                                                                                                             
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "A2Qv3fN3TkeYEhxA4zicgw",
      "_score" : 0.15342641, "_source" : {
        "speaker" : "Florian Hopf",
        "date" : "2012-07-04T19:15:00",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The _all field

Now that this works, we might want to search on multiple fields. ElasticSearch provides the convenience functionality of copying all field content to the _all field that is used when omitting the field name in the query. Let's try the query again:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }

No results. Why is that? Of course we have set the analyzer correctly for the title as we have seen above. But this doesn't mean that the content is analyzed in the same way for the _all field. As we didn't specify an analyzer for this field it still uses the StandardAnalyzer that splits on whitespace but doesn't do any stemming. If you want to have a consistent behavior for the title and the _all field you need to set the analyzer in the mapping:

curl -XPUT "http://localhost:9200/blog/talk/_mapping" -d'
{
    "mappings" : {
        "talk" : {
            "_all" : {"analyzer" : "german"},
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Note that as with all mapping changes you can't change the type of the _all field once it's created. You need to delete the index, put the new mapping and reindex your data. Afterwards our search will return the same results for the two queries.

_source

You might have noticed from the example above that ElasticSearch returns the special _source field for each result. This is very convenient as you don't need to specify which fields should be stored. But be aware that this might become a problem for large fields that you don't need for each search request (content section of articles, images that you might store in the index). You can either disable the use of the source field and indicate which fields should be stored in the mapping for your indexed type or you can specify in the query which fields you'd like to retrieve:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true&fields=speaker,title"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "MA2oYAqnTdqJhbjnCNq2zA",
      "_score" : 0.15342641
    }, {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "aGdDy24cSImz6DVNSQ5iwA",
      "_score" : 0.076713204,
      "fields" : {
        "speaker" : "Florian Hopf",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The same can be done if you are not using the simple query parameters but the more advanced query DSL:

curl -XPOST "http://localhost:9200/blog/talk/_search" -d'
{
    "fields" : ["title", "speaker"],
    "query" : {
        "term" : { "speaker" : "florian" }
    }
}'

Querying from Java

Besides the JSON based Query DSL you can also query ElasticSearch using Java. The default ElasticSearch Java client provides builders for creating different parts of the query that can then be combinded. For example if you'd like to query on two fields using the multi_match query this is what it looks like using curl:

curl -XPOST "http://localhost:9200/blog/_search" -d'
{
    "query" : {
        "multi_match" : {
            "query" : "Solr",
            "fields" : [ "title", "speaker" ]
        }
    }
}'

The Java version maps quite well to this. Once you found the builders you need you can use the excellent documentation of the Query DSL for your Java client as well.

QueryBuilder multiMatch = multiMatchQuery("Solr", "title", "speakers");
SearchResponse response = esClient.prepareSearch("blog")
        .setQuery(multiMatch)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));

The same QueryBuilder we are constructing above can also be used on other parts of the query: For example it can be passed as a parameter to create a QueryFilterBuilder or can be used to construct a QueryFacetBuilder. This composition is a very powerful way to build flexible applications. It is easier to reason about the components of the query and you could even test parts of the query on its own.

Faceting

One of the most prominent features of ElasticSearch is its excellent faceting support that not only is used for building search applications but also for doing analytics of large data sets. You can use different kinds of faceting, e.g. for certain terms, using the TermsFacet, or for queries, using the query facet. The query facet would accept the same QueryBuilder that we used above.

TermsFacetBuilder facet = termsFacet("speaker").field("speaker");
QueryBuilder query = queryString("solr");
SearchResponse response = esClient.prepareSearch("blog")
        .addFacet(facet)
        .setQuery(query)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));
TermsFacet resultFacet = response.getFacets().facet(TermsFacet.class, "speaker");
assertEquals(1, resultFacet.getEntries().size());

Conclusion

ElasticSearch has a really nice Java API, be it for indexing or for querying. You can get started with indexing and searching in no time though you need to know some concepts or the results might not be what you expect.

Elasticsearch - Der praktische Einstieg
Java Code Geeks