Donnerstag, 28. November 2013

Reindexing Content in Elasticsearch with stream2es

Last week I wrote about reindexing content in Elasticsearch using a script that extracts the source field from the indexed content. You can use it for cases when your mapping changes or you need to adjust the index settings. After publishing the post Drew Raines mentioned that there is an easier way using the stream2es utility only. Time to have a look at it!

stream2es can be used to stream content from several inputs to Elasticsearch. In my last post I used it to stream a file containing the sources of documents to an Elasticsearch index. Besides that it can index data from Wikipedia or Twitter or from Elasticsearch directly, which we will look at now.

Again, we are indexing some documents:

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Now, if we need to adjust the mapping we can just create a new index with the new mapping:

curl -XPOST "http://localhost:9200/twitter2" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

You can now use stream2es to transfer the documents from the old index to the new one:

stream2es es --source http://localhost:9200/twitter/ --target http://localhost:9200/twitter2/

This will make our documents available in the new index:

curl -XGET http://localhost:9200/twitter2/_count?pretty=true
{                                                                  
  "count" : 2,                                                      
  "_shards" : {                                                    
    "total" : 5,                                                    
    "successful" : 5,                                               
    "failed" : 0                                                   
  }
}

You can now delete the old index. To keep your data available on the same old index name you can also create an alias that will point to your new index:

curl -XDELETE http://localhost:9200/twitter
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "twitter2", "alias" : "twitter" } }
    ]
}'

Looking at the mapping you can see that the twitter index now points to our updated version:

curl -XGET http://localhost:9200/twitter/tweet/_mapping?pretty=true
{
  "tweet" : {
    "properties" : {
      "bytes" : {
        "type" : "long"
      },
      "message" : {
        "type" : "string"
      },
      "offset" : {
        "type" : "long"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string",
        "index" : "not_analyzed",
        "omit_norms" : true,
        "index_options" : "docs"
      }
    }
  }
}

Donnerstag, 21. November 2013

Reindexing Content in Elasticsearch

One of the crucial parts on any search application is the way you map your content to the analyzers. It will determine which query terms match the terms that are indexed with the documents. Sometimes during development you might notice that you didn't get this right from the beginning and need to reindex your data with a new mapping. While for some applications you can easily start the indexing process again this become more difficult for others. Luckily Elasticsearch by default stores the original content in the _source field. In this short article I will show you how to use a script developed by Simon Willnauer that lets you retrieve all the data and reindex it with a new mapping.

You can do the same thing in an easier way using the utility stream2es only. Look at this post if you are interested

Reindexing

Suppose you have indexed documents in Elasticsearch. Imagine that those are a lot that can not be reindexed again easily or reindexing would take some time.

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Initially this will create the mapping that is determined from the values.

curl -XGET "http://localhost:9200/twitter/tweet/_mapping?pretty=true"
{
  "tweet" : {
    "properties" : {
      "message" : {
        "type" : "string"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string"
      }
    }
  }
}

Now if you notice that you would like to change some of the existing fields to another type you need to reindex as Elasticsearch doesn't allow you to modify the mapping for existing fields. Additional fields are fine, but not existing fields. You can leverage the _source field that you can also see when querying a document.

curl -XGET "http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true&size=1"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "oaFqxMnqSrex6T7_Ut-erw",
      "_score" : 0.30685282, "_source" : {
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}

    } ]
  }
}

For his "no slides no bullshit introduction to Elasticsearch" Simon Willnauer has implemented a script that retrieves the _source fields for all documents of an index. After installing the prerequisites you can use it by passing in your index name:

fetchSource.sh twitter > result.json

It prints all the documents to stdout which can be redirected to a file. We can now delete our index and recreate it using a different mapping.

curl -XDELETE http://localhost:9200/twitter
curl -XPOST "http://localhost:9200/twitter" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

The file we just created can now be send to Elasticsearch again using the handy stream2es utility.

stream2es stdin --target "http://localhost:9200/twitter/tweet" < result.json

All your documents are now indexed using the new mapping.

Implementation

Let's look at the details of the script. At the time of writing this post the relevant part of the script looks like this:

SCROLL_ID=`curl -s -XGET 'localhost:9200/'${INDEX_NAME}'/_search?search_type=scan&scroll=11m&size=250' -d '{"query" : {"match_all" : {} }}' | jq '._scroll_id' | sed s/\"//g`
RESULT=`curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}`

while [[ `echo ${RESULT} | jq -c '.hits.hits | length'` -gt 0 ]] ; do
  #echo "Processed batch of " `echo ${RESULT} | jq -c '.hits.hits | length'`
  SCROLL_ID=`echo $RESULT | jq '._scroll_id' | sed s/\"//g`
  echo $RESULT | jq -c '.hits.hits[] | ._source + {_id}' 
  RESULT=$(eval "curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}")
done

It uses scrolling to efficiently traverse the documents. Processing of the JSON output is done using jq, a lightweight and flexible command-line JSON processor, that I should have used as well when querying the SonarQube REST API.

The first line in the script creates a scan search that uses scrolling. The scroll will be valid for 11 minutes, returns 250 documents on each request and queries all documents, as requested with the match_all query. This response doesn't contain any documents but the _scroll_id which is then extracted with jq. The final sed command removes the quotes around it.

The scroll id now is used to send queries to Elasticsearch. On each iteration it is checked if there are any hits at all. If there are the request will return a new scroll id for the next batch. The result is echoed to the console. .hits.hits[] will return the list of all hits. Using the pipe symbol in jq processes each hit with the filter on the right that prints the source as well as the id of the hit.

Conclusion

The script is a very useful addition to your Elasticsearch toolbox. You can use it to reindex or just export your content. I am glad I looked at the details of the implementation as in the future jq can come in really handy as well.

Samstag, 16. November 2013

Devoxx in Tweets

For the first time in several years I unfortunately had to skip this years Devoxx. There are so many tweets that remind me of the good talks going on there and I thought I would do something useful with them. So again I indexed them in Elasticsearch using the Twitter river and therefore can look at them using Kibana. David Pilato also has set up a public instance and I could imagine that there will be a more thorough analysis done by the Devoxx team but here are my thoughts on this years Devoxx without having been there.

I'll be looking at three things: The top 10 mentions, the top 10 hashtags and the tweet distribution over time. For the mentions I have excluded @Devoxx and @java, for the hashtags I have excluded #devoxx, #dvx13 and #dv13 as the mentions and tags are too dominant and don't tell a lot. I have collected all tweets mentioning the term devoxx so there will be a lot I missed. Each retweet counts as a seperate tweet.

Overall Trends

Looking at the timeline of the whole week you can see that the amount of tweets is high at the beginning and continually rises with thursday having even more tweets than wednesday which is quite a surprise to me. I would have thought that the first conference day is the one that has the most tweets.

Stephan007, the founder of Devoxx, has the most mentions which is no surprise. Chet Haase and Romain Guy are following. I have never seen a talk done by them but I probably should. The Dart language is the dominant hashtag with a lot of buzz around their 1.0 release. Java, Android and Scala are still hot technologies. Android is a bit of a surprise here. It's nice that the initiative Devoxx4Kids ranks quite high.

Daily Analysis

Monday

On monday the top mention is @AngularJS. Of course this is caused by the two AngularJS university sessions that lasted nearly the whole day. Angular is a hot topic but I am not yet planning to do any work with it. The session on JavaEE 7 also created a lot of interest as can be seen by the mentions of its hosts Arun Gupta and Antonio Goncalves. They especially encouraged people to participate on Twitter which seems to have been received very well. Scala is another hot topic with the university session by Dick Wall and Joshua Suereth that I would really have liked to see.

Tuesday

Tuesday is dominated by the two excellent speakers Matt Raible and Venkat Subramaniam. I especially regret that I couldn't see Venkat in action who I consider to be one of the best speakers available. I am not sure what the tag hackergarden is referring to as I didn't find an event on monday evening or tuesday. There is also quite some interest in Reactor, the reactive framework of the Spring ecosystem.

Wednesday

Brian Goetz got a lot of mentions for the keynote. I think it's a surprise that there are so many mentions of David Blevins for his talk EJB 3.2 and beyond which I wouldn't have expected to be that popular. The big event of the day was the launch of Ceylon 1.0 as can be seen from the hashtag. I heard good things about Ceylon but I still consider it an underdog of the alternative JVM languages.

Thursday

Romain Guy is leading the mentions with his very popular talk "Filthy Rich Android Clients", followed by Jonas Boner of Akka fame and Venkat Subramaniam. The launch of Dart 1.0 dominates the keywords. The Javaposse still ranks in the top 10 with their popular traditional session.

Friday

Friday normally has fewer participants than the other days. Joshua Suereth received a lot of tweets for his Scala talk, ranking high both in mentions and hashtags. The session on Google Glass also was very popular. I am not sure which session caused the mention of dagger.

Programming Language Popularity

Niko Schmuck proposed to add the language popularity over the week. As this is quite interesting here is the totally unscientific popularity chart that of course should determine which language you are learning next. I am not querying the hashtags but any mention of the terms.

Java dominates but JavaScript is very strong on Monday and Thursday. Ceylon has its share on Wednesday while Thursday is the Dart day. Scala is very popular on Monday and Friday.

A ranked version:

Java1234
JS584
Dart490
Scala252
Ceylon172
Groovy171
Clojure94
Kotlin16

The Drink Tweets

As there are quite some people tweeting we can see some trends with regards to the drink tweets. First the coffee tweets:

Quite a spike on Monday with people either mentioning that they need coffee or complaining about the coffee. This repeats on tuesday on wednesday in the morning, people seem to have accepted the situation on thursday.

Another common topic, especially since the conference is located in Belgium, are the beer tweets.

Surprise, surprise, people tend to tweet about beer in the evening. I like the huge Javaposse-spike on thursday with a lot of mentions of the beer sponsor Atlassian.

Conclusion

Though I haven't been there I could get a small glimpse of the trends at this years Devoxx. As soon as the videos are available I will buy the account for this years conference, not only because there are so many interesting talks to see but also because the Devoxx team is doing a fantastic job that I'd like to support in any way I can.

Updates
  • 17.11. Added a section on programming language popularity
  • 18.11. Updated the weekly diagram with a more accurate

Sonntag, 10. November 2013

Lucene Solr Revolution 2013 in Dublin

I just returned from Lucene Solr Revolution Europe, the conference on everything Lucene and Solr which this year was held in Dublin. I always like to recap what I took from a conference so here are some impressions.

The Venue

In the spirit of last years conference, which was merged with ApacheCon and held in a soccer stadium in Sinsheim, this years venue was a Rugby Stadium. It's seems to be quite common that conferences are organized there and the location was well suited. For some of the room changes you had to walk quite a distance but that's nothing that couldn't be managed.

The Talks

As there were four tracks in parallel choosing the talk to attend could prove to be difficult. There were so many interesting things to choose from. Fortunately all the talks have been recorded and will be made available for free on the conference website.

The following are a selection of talks that I think were most valuable to me.

Keynote: Michael Busch on Lucene at Twitter

Michael Busch is a regular speaker at Search conferences because Twitter is doing some interesting things. On the one hand they have to handle near realtime search, massive data sets and lots of requests. On the other hand they can always be sure that their documents are of a certain size. They maintain two different data stores as Lucene indices, the realtime index that contains the most recent data and the archive index that makes older tweets searchable. They introduced the archive index only a few months ago which in my opinion led to a far more reliable search experience. They have done some really interesting things like encoding the position info of a term with the doc id because they only need few bits to address positions in a 140 character document. Also they changed some aspects of the posting list encoding because they always display results sorted by date. They are trying to make their changes more general so those can be contributed back to Lucene.

Solr Indexing and Analysis Tricks by Erik Hatcher

I always enjoy listening to the talks of Erik Hatcher, probably also because his university session at Devoxx 2009 was the driving factor for me starting to use Solr. In this years talk he presented lots of useful aspects for indexing data in Solr. One of the most interesting facts I took from this talk is the use of the ScriptUpdateProcessor that is included in Solr since version 4. You can define scripts that are executed during indexing and can manipulate the document. This is a valuable alternative to copyFields, especially if you would like to have the content stored as well. By default you can implement the logic in JavaScript but there are alternatives available.

Hacking Lucene and Solr for Fun and Profit by Grant Ingersoll

Grant Ingersoll presented some applications of Lucene and Solr not directly involving search like Classification, Recommendations and Analytics. Some examples had been taken from his excellent book Taming Text (watch this blog for a review of the book in the near future).

Schemaless Solr and the Solr Schema REST API by Steve Rowe

One of the factors of the success of Elasticsearch is its ease of use. You can download it and start indexing documents immediately without doing any configuration work. One of the features that enables you to do this is the autodiscovery of fields by value. Starting with Solr 4.4 you can now use Solr in a similar way. You can configure that you want Solr to manage your schema. This way unknown fields are then created automatically based on the first value that is extracted by configured parsers. As with Elasticsearch you shouldn't rely on this feature exclusively so there is also a way to add new fields of a certain type via the Schema REST API. When Solr is in managed mode it will modify the schema.xml so you might lose changes you made manually. For the future the developers are even thinking about moving away from XML for the managed mode as there are better options for when readability doesn't matter.

Stump the Chump with Chris Hostetter

This seems to be a tradition at Lucene Solr Revolution. Chris Hostetter has to find solutions to problems that have been submitted before or are posted by the audience. It's a fun event but you can also learn a lot.

Query Latency Optimization with Lucene by Stefan Pohl

Stefan first introduced some basic latency factors and how to measure them. He recommended to not instrument the low level Lucene classes when profiling your application as those rely heavily on hotspot optimizations. Besides introducing the basic mechanisms of how conjunction (AND) and disjunction (OR) work he described some recent Lucene improvements that can speed up your application, among those LUCENE-4571, the new minShouldMatch implementation and LUCENE-4752, which allows custom ordering of documents in the index.

Relevancy Hacks for eCommerce by Varun Thacker

Varun introduced the basics of relevancy sorting in Lucene and Solr and how those might affect product searches. TF/IDF is sometimes not the best solution ("IDF is a measurement of rarity not necessarily importance"). He also showed the ways to influence the relevancy: Implementation of a custom Similarity class, boosting and function queries.

What is in a Lucene Index by Adrien Grand

Adrien started with the basics fo a Lucene index and how it differs from a database index: the dictionary structure, segments and merging. He then moved on to topics like the structure of the posting list, term vectors, the FST terms index and the difference between stored fields and doc values. This is a talk full of interesting details on the internal workings of Lucene and the implications for the performance of your application.

Conclusion

As said before I couldn't attend all the talks I would have liked. I especially heard good things about the following talks which I will watch as soon as those are available:

  • Integrate Solr with Real-Time Stream Processing Applications by Timothy Potter
  • The Typed Index by Christoph Goller
  • Implementing a Custom Search Syntax Using Solr, Lucene and Parboiled by John Berryman

I really enjoyed Lucene Solr Revolution. Not only were there a lot of interesting talks to listen to but it was also a good opportunity to meet new people. On both evenings there have been get togethers with free drinks and food which must have cost LucidWorks a fortune. I couldn't attend the closing remarks but I heard they announced that they want to move to smaller, national events in Europe instead of the central conference. I hope those will still be events that attract so many commiters and interesting people.

Donnerstag, 31. Oktober 2013

Switch Off Legacy Code Violations in SonarQube

While I don't believe in putting numbers on source code quality, SonarQube (formerly known as Sonar) can be a really useful tool during development. It enforces a consistent style across your team, has discovered several possible bugs for me and is a great tool to learn: You can browse the violations and see why a certain expression or code block can be a problem.

To make sure that your code base stays in a consistent state you can also go as far as mandating that there should be no violations in the code developers check in. One of the problems with this is that a lot of projects are not green field projects and you have a lot of existing code. If your violation number already is high it is difficult to judge if no new violations were introduced.

In this post I will show you how you can start with zero violations for existing code without touching the sources, something I got inspired to do by Jens Schauder in his great talk Working with Legacy Teams. We will ignore all violations based on the line in the file so if anybody touches the file the violations will show again and the developer is responsible for fixing the legacy violations.

The Switch Off Violations Plugin

We are using the Switch Off Violations Plugin for SonarQube. It can be configured with different exclusion patterns for the issues. You can define regular expressions for code blocks that should be ignored or deactivate violations at all or on a file or line basis.

For existing code you want to ignore all violations for certain files and lines. This can be done by inserting something like this in the text area Exclusion patterns:

de.fhopf.akka.actor.IndexingActor;pmd:SignatureDeclareThrowsException;[23]

This will exclude the violation for throwing raw Exceptions in line 23 of the IndexingActor class. When analyzing the code again this violation will be ignored.

Retrieving violations via the API

Besides the nice dashboard SonarQube also offers an API that can be used to retrieve all the violations for a project. If you are not keen to look up all existing violations in your code base and insert those by hand you can use it to generate the exclusion patterns automatically. All of the violations can be found at /api/violations, e.g. http://localhost:9000/api/violations.

I am sure there are other ways to do it but I used jsawk to parse the JSON response (On Ubuntu you have to install Spidermonkey instead of the default js interpreter.. And you have to compile it yourself. And I had to use a specific version. Sigh.).

Once you have set up all the components you can now use jsawk to create the exclusion patterns for all existing violations:

curl -XGET 'http://localhost:9000/api/violations?depth=-1' | ./jsawk -a 'return this.join("\n")' 'return this.resource.key.split(":")[1] + ";*;[" + this.line + "]"' | sort | uniq

This will present a list that can just be pasted in the text area of the Switch Off Violations plugin or checked in to the repository as a file. With the next analysis process you will then hopefully see zero violations. When somebody changes a file by inserting a line the violations will be shown again and should be fixed. Unfortunately some violations are not line based and will yield a line number 'undefined'. Currently I just removed those manually so you still might see some violations.

Conclusion

I presented one way to reset your legacy code base to zero violations. With SonarQube 4.0 the functionality of the Switch Violations Off plugin will be available in the core so it will be easier to use. I am still looking for the best way to keep the exclusion patterns up to date. Once somebody had to fix the violations for an existing file the pattern should be removed.

Update 09.01.2014

Starting with SonarQube 4 this approach doesn't work anymore. Some features of the SwitchOffViolations plugin have been moved to the core but excluding violations by line is not possible anymore and also will not be implemented. The developers recommend to only look at the trends of the project and not the overall violation count. This can be done nicely using the differentials.

Freitag, 25. Oktober 2013

Elasticsearch at Scale - Kiln and GitHub

Most of us are not exposed to data at real scale. It is getting more common but still I appreciate that more progressive companies that have to fight with large volumes of data are open about it and talk about their problems and solutions. GitHub and Fog Creek are two of the larger users of Elasticsearch and both have published articles and interviews on their setup. It's interesting that both of these companies are using it for a very specialized use case, source code search. As I have recently read the article on Kiln as well as the interview with the folks at GitHub I'd like to summarize some of the points they made. Visit the original links for in depth information.

Elasticsearch at Fog Creek for Kiln

In this article on InfoQ Kevin Gessnar, a developer at Fog Creek describes the process of migrating the code search of Kiln to Elasticsearch.

Initial Position

Kiln allows you to search on commit messages, filenames and file contents. For commit messages and filenames they were initially using the full text search features of SQL Server. For the file content search they were using a tool called OpenGrok that leverages Ctags to analyze the code and stores it in a Lucene index. This provided them will all of the features they needed but unfortunately the solution couldn't scale with their requirements. Queries took several seconds up to the timeout value of 30 seconds.

It's interesting to see that they decided against Solr because of poor read performance on heavy writes. Would be interesting to see if this is still the case for current versions.

Scale

They are indexing several million documents every day, which comes to terabytes of data. They are still running their production system on two nodes only. These are numbers that really surprised me. I would have guessed that you need more nodes for this amount of data (well, probably those are really big machines). They only seem to be using Elasticsearch for indexing and search but retrieve the result display data from their primary storage layer.

Elasticsearch at GitHub

Andrew Cholakian, who is doing a great job with writing his book Exploring Elasticsearch in the open, published an interview with Tim Pease and Grant Rodgers of GitHub on their Elasticsearch setup, going through a lot of details.

Initial Position

GitHub used to have their search based on Solr. As the volume of data and search increased they needed a solution that scales. Again, I would be interested if current versions of Solr Cloud could handle this volume.

Scale

They are really searching big data. 44 Amazon EC2 instances power search on 2 billion documents which make up 30 terabyte of data. 8 instances don't hold any data but are only there to distribute the queries. They are planning to move from the 44 Amazon instances to 8 larger physical machines. Besides their user facing data they are indexing internal data like audit logs and exceptions (it isn't clear to me from the interview if in this case Elasticsearch is their primary data store which would be remarkable). They are using different clusters for different data types so that the external search is not affected when there are a lot of exceptions.

Challenges

Shortly after launching their new search feature people started discovering that you could also search for files people had accidentally commited like private ssh keys or passwords. This is an interesting phenomen where just the possibility for better retrieval made a huge difference. All the information had been there before but it just couldn't be found easily. This led to an increase in search volume that was not anticipated. Due to some configuration issues (suboptimal Java version, no setting for minimum of masters) their cluster became unstable and they had to disable search for the whole site.

Further Takeaways
  • Use routing to keep your data together on one shard
  • Thrift seems to be far more complicated from an ops point of view compared to HTTP
  • Use the slow query log
  • Time slicing your indices is a good idea if the data allows

A Common Theme

Both of these articles have some observations in common:

  • Elasticsearch is easy to get started with
  • Scaling is not an issue
  • the HTTP interface is good for debugging and operations
  • the Elasticsearch community and the company are really helpful when it comes to problems

Freitag, 11. Oktober 2013

Cope with Failure - Actor Supervision in Akka

A while ago I showed an example on how to use Akka to scale a simple application with multiple threads. Tasks can be split into several actors that communicate via immutable messages. State is encapsulated and each actor can be scaled independently. While implementing an actor you don't have to take care of low level building blocks like Threads and synchronization so it is far more easy to reason about the application.

Besides these obvious benefits, fault tolerance is another important aspect. In this post I'd like to show you how you can leverage some of Akkas characteristics to make our example more robust.

The Application

To recap, we are building a simple web site crawler in Java to index pages in Lucene. The full code of the examples is available on GitHub. We are using three actors: one which carries the information on the pages to be visited and visited already, one that downloads and parses the pages and one that indexes the pages in Lucene.

By using several actors to download and parse pages we could see some good performance improvements.

What could possibly go wrong?

Things will fail. We are relying on external services (the page we are crawling) and therefore the network. Requests could time out or our parser could choke on the input. To make our example somewhat reproducible I just simulated an error. A new PageRetriever, the ChaosMonkeyPageRetriever sometimes just throws an Exception:

@Override
public PageContent fetchPageContent(String url) {
    // this error rate is derived from scientific measurements
    if (System.currentTimeMillis() % 20 == 0) {
      throw new RetrievalException("Something went horribly wrong when fetching the page.");
    }
    return super.fetchPageContent(url);
}

You can surely imagine what happens when we use this retriever in the sequential example that doesn't use Akka or threads. As we didn't take care of the failure our application just stops when the Exception occurs. One way we could mitigate this is by surrounding statements with try/catch-Blocks but this will soon intermingle a lot of recovery and fault processing code with our application logic. Once we have an application that is running in multiple threads fault processing gets a lot harder. There is no easy way to notify other Threads or save the state of the failing thread.

Supervision

Let's see Akkas behavior in case of an error. I added some logging that indicates the current state of the visited pages.

1939 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:  55, allPages:  60
1952 [default-akka.actor.default-dispatcher-4] INFO de.fhopf.akka.actor.Master - inProgress:  54, allPages:  60
[ERROR] [10/10/2013 06:47:39.752] [default-akka.actor.default-dispatcher-5] [akka://default/user/$a/$a] Something went horribly wrong when fetching the page.
de.fhopf.akka.RetrievalException: Something went horribly wrong when fetching the page.
        at de.fhopf.akka.actor.parallel.ChaosMonkeyPageRetriever.fetchPageContent(ChaosMonkeyPageRetriever.java:21)
        at de.fhopf.akka.actor.PageParsingActor.onReceive(PageParsingActor.java:26)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

1998 [default-akka.actor.default-dispatcher-8] INFO de.fhopf.akka.actor.Master - inProgress:  53, allPages:  60
2001 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-2] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-10] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
[...]
2469 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.Master - inProgress:   8, allPages:  78
2487 [default-akka.actor.default-dispatcher-7] INFO de.fhopf.akka.actor.Master - inProgress:   7, allPages:  78
2497 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:   6, allPages:  78
2540 [default-akka.actor.default-dispatcher-13] INFO de.fhopf.akka.actor.Master - inProgress:   5, allPages:  78

We can see each exception that is happening in the log file but our application keeps running. That is because of Akkas supervision support. Actors form hierarchies where our PageParsingActor is a child of the Master actor because it is created from its context. The Master is responsible to determine the fault strategy for its children. By default it will restart the Actor in case of an exception which makes sure that the next message is processed correctly. This means even in case of an error Akka tries to keep the system in a running state.

The reaction to a failure is determined by the method supervisorStrategy() in the parent actor. Based on an Exception class you can choose several outcomes:

  • resume: Keep the actor running as if nothing had happened
  • restart: Replace the failing actor with a new instance
  • suspend: Stop the failing actor
  • escalate: Let your own parent decide on what to do

A supervisor that would restart the actor for our exception and escalate otherwise could be added like this:

// allow 100 restarts in 1 minute ... this is a lot but we the chaos monkey is rather busy
private SupervisorStrategy supervisorStrategy = new OneForOneStrategy(100, Duration.create("1 minute"), new Function() {

    @Override
    public Directive apply(Throwable t) throws Exception {
        if (t instanceof RetrievalException) {
            return SupervisorStrategy.restart();
        }
        // it would be best to model the default behaviour in other cases
        return SupervisorStrategy.escalate();
    }

});

@Override
public SupervisorStrategy supervisorStrategy() {
    return supervisorStrategy;
}

Let's come back to our example. Though Akka takes care of restarting our failing actors the end result doesn't look good. The application continues to run after several exceptions but our application then just stops and hangs. This is caused by our business logic. The Master actor keeps all pages to visit in the VisitedPageStore and only commits the Lucene index when all pages are visited. As we had several failures we didn't receive the result for those pages and the Master still waits.

One way to fix this is to resend the message once the actor is restarted. Each Actor class can implement some methods that hook into the actors lifecycle. In preRestart() we can just send the message again.

@Override
public void preRestart(Throwable reason, Option<Object> message) throws Exception {
    logger.info("Restarting PageParsingActor and resending message '{}'", message);
    if (message.nonEmpty()) {
        getSelf().forward(message.get(), getContext());
    }
    super.preRestart(reason, message);
}

Now if we run this example we can see our actors recover from the failure. Though some exceptions are happening all pages get visited eventually and everything will be indexed and commited in Lucene.

Though resending seems to be the solution to our failures you need to be careful to not break your system with it: For some applications the message might be the cause for the failure and by resending it you will keep your system busy with it in a livelock state. When using this approach you should at least add a count to the message that you can increment on restart. Once it is sent too often you can then escalate the failure to have it handled in a different way.

Conclusion

We have only handled one certain type of failure but you can already see how powerful Akka can be when it comes to fault tolerance. Recovery code is completely separated from the business code. To learn more on different aspects of error handling read the Akka documentation on supervision and fault tolerance or this excellent article by Daniel Westheide.