Donnerstag, 31. Oktober 2013

Switch Off Legacy Code Violations in SonarQube

While I don't believe in putting numbers on source code quality, SonarQube (formerly known as Sonar) can be a really useful tool during development. It enforces a consistent style across your team, has discovered several possible bugs for me and is a great tool to learn: You can browse the violations and see why a certain expression or code block can be a problem.

To make sure that your code base stays in a consistent state you can also go as far as mandating that there should be no violations in the code developers check in. One of the problems with this is that a lot of projects are not green field projects and you have a lot of existing code. If your violation number already is high it is difficult to judge if no new violations were introduced.

In this post I will show you how you can start with zero violations for existing code without touching the sources, something I got inspired to do by Jens Schauder in his great talk Working with Legacy Teams. We will ignore all violations based on the line in the file so if anybody touches the file the violations will show again and the developer is responsible for fixing the legacy violations.

The Switch Off Violations Plugin

We are using the Switch Off Violations Plugin for SonarQube. It can be configured with different exclusion patterns for the issues. You can define regular expressions for code blocks that should be ignored or deactivate violations at all or on a file or line basis.

For existing code you want to ignore all violations for certain files and lines. This can be done by inserting something like this in the text area Exclusion patterns:

de.fhopf.akka.actor.IndexingActor;pmd:SignatureDeclareThrowsException;[23]

This will exclude the violation for throwing raw Exceptions in line 23 of the IndexingActor class. When analyzing the code again this violation will be ignored.

Retrieving violations via the API

Besides the nice dashboard SonarQube also offers an API that can be used to retrieve all the violations for a project. If you are not keen to look up all existing violations in your code base and insert those by hand you can use it to generate the exclusion patterns automatically. All of the violations can be found at /api/violations, e.g. http://localhost:9000/api/violations.

I am sure there are other ways to do it but I used jsawk to parse the JSON response (On Ubuntu you have to install Spidermonkey instead of the default js interpreter.. And you have to compile it yourself. And I had to use a specific version. Sigh.).

Once you have set up all the components you can now use jsawk to create the exclusion patterns for all existing violations:

curl -XGET 'http://localhost:9000/api/violations?depth=-1' | ./jsawk -a 'return this.join("\n")' 'return this.resource.key.split(":")[1] + ";*;[" + this.line + "]"' | sort | uniq

This will present a list that can just be pasted in the text area of the Switch Off Violations plugin or checked in to the repository as a file. With the next analysis process you will then hopefully see zero violations. When somebody changes a file by inserting a line the violations will be shown again and should be fixed. Unfortunately some violations are not line based and will yield a line number 'undefined'. Currently I just removed those manually so you still might see some violations.

Conclusion

I presented one way to reset your legacy code base to zero violations. With SonarQube 4.0 the functionality of the Switch Violations Off plugin will be available in the core so it will be easier to use. I am still looking for the best way to keep the exclusion patterns up to date. Once somebody had to fix the violations for an existing file the pattern should be removed.

Update 09.01.2014

Starting with SonarQube 4 this approach doesn't work anymore. Some features of the SwitchOffViolations plugin have been moved to the core but excluding violations by line is not possible anymore and also will not be implemented. The developers recommend to only look at the trends of the project and not the overall violation count. This can be done nicely using the differentials.

Freitag, 25. Oktober 2013

Elasticsearch at Scale - Kiln and GitHub

Most of us are not exposed to data at real scale. It is getting more common but still I appreciate that more progressive companies that have to fight with large volumes of data are open about it and talk about their problems and solutions. GitHub and Fog Creek are two of the larger users of Elasticsearch and both have published articles and interviews on their setup. It's interesting that both of these companies are using it for a very specialized use case, source code search. As I have recently read the article on Kiln as well as the interview with the folks at GitHub I'd like to summarize some of the points they made. Visit the original links for in depth information.

Elasticsearch at Fog Creek for Kiln

In this article on InfoQ Kevin Gessnar, a developer at Fog Creek describes the process of migrating the code search of Kiln to Elasticsearch.

Initial Position

Kiln allows you to search on commit messages, filenames and file contents. For commit messages and filenames they were initially using the full text search features of SQL Server. For the file content search they were using a tool called OpenGrok that leverages Ctags to analyze the code and stores it in a Lucene index. This provided them will all of the features they needed but unfortunately the solution couldn't scale with their requirements. Queries took several seconds up to the timeout value of 30 seconds.

It's interesting to see that they decided against Solr because of poor read performance on heavy writes. Would be interesting to see if this is still the case for current versions.

Scale

They are indexing several million documents every day, which comes to terabytes of data. They are still running their production system on two nodes only. These are numbers that really surprised me. I would have guessed that you need more nodes for this amount of data (well, probably those are really big machines). They only seem to be using Elasticsearch for indexing and search but retrieve the result display data from their primary storage layer.

Elasticsearch at GitHub

Andrew Cholakian, who is doing a great job with writing his book Exploring Elasticsearch in the open, published an interview with Tim Pease and Grant Rodgers of GitHub on their Elasticsearch setup, going through a lot of details.

Initial Position

GitHub used to have their search based on Solr. As the volume of data and search increased they needed a solution that scales. Again, I would be interested if current versions of Solr Cloud could handle this volume.

Scale

They are really searching big data. 44 Amazon EC2 instances power search on 2 billion documents which make up 30 terabyte of data. 8 instances don't hold any data but are only there to distribute the queries. They are planning to move from the 44 Amazon instances to 8 larger physical machines. Besides their user facing data they are indexing internal data like audit logs and exceptions (it isn't clear to me from the interview if in this case Elasticsearch is their primary data store which would be remarkable). They are using different clusters for different data types so that the external search is not affected when there are a lot of exceptions.

Challenges

Shortly after launching their new search feature people started discovering that you could also search for files people had accidentally commited like private ssh keys or passwords. This is an interesting phenomen where just the possibility for better retrieval made a huge difference. All the information had been there before but it just couldn't be found easily. This led to an increase in search volume that was not anticipated. Due to some configuration issues (suboptimal Java version, no setting for minimum of masters) their cluster became unstable and they had to disable search for the whole site.

Further Takeaways
  • Use routing to keep your data together on one shard
  • Thrift seems to be far more complicated from an ops point of view compared to HTTP
  • Use the slow query log
  • Time slicing your indices is a good idea if the data allows

A Common Theme

Both of these articles have some observations in common:

  • Elasticsearch is easy to get started with
  • Scaling is not an issue
  • the HTTP interface is good for debugging and operations
  • the Elasticsearch community and the company are really helpful when it comes to problems

Freitag, 11. Oktober 2013

Cope with Failure - Actor Supervision in Akka

A while ago I showed an example on how to use Akka to scale a simple application with multiple threads. Tasks can be split into several actors that communicate via immutable messages. State is encapsulated and each actor can be scaled independently. While implementing an actor you don't have to take care of low level building blocks like Threads and synchronization so it is far more easy to reason about the application.

Besides these obvious benefits, fault tolerance is another important aspect. In this post I'd like to show you how you can leverage some of Akkas characteristics to make our example more robust.

The Application

To recap, we are building a simple web site crawler in Java to index pages in Lucene. The full code of the examples is available on GitHub. We are using three actors: one which carries the information on the pages to be visited and visited already, one that downloads and parses the pages and one that indexes the pages in Lucene.

By using several actors to download and parse pages we could see some good performance improvements.

What could possibly go wrong?

Things will fail. We are relying on external services (the page we are crawling) and therefore the network. Requests could time out or our parser could choke on the input. To make our example somewhat reproducible I just simulated an error. A new PageRetriever, the ChaosMonkeyPageRetriever sometimes just throws an Exception:

@Override
public PageContent fetchPageContent(String url) {
    // this error rate is derived from scientific measurements
    if (System.currentTimeMillis() % 20 == 0) {
      throw new RetrievalException("Something went horribly wrong when fetching the page.");
    }
    return super.fetchPageContent(url);
}

You can surely imagine what happens when we use this retriever in the sequential example that doesn't use Akka or threads. As we didn't take care of the failure our application just stops when the Exception occurs. One way we could mitigate this is by surrounding statements with try/catch-Blocks but this will soon intermingle a lot of recovery and fault processing code with our application logic. Once we have an application that is running in multiple threads fault processing gets a lot harder. There is no easy way to notify other Threads or save the state of the failing thread.

Supervision

Let's see Akkas behavior in case of an error. I added some logging that indicates the current state of the visited pages.

1939 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:  55, allPages:  60
1952 [default-akka.actor.default-dispatcher-4] INFO de.fhopf.akka.actor.Master - inProgress:  54, allPages:  60
[ERROR] [10/10/2013 06:47:39.752] [default-akka.actor.default-dispatcher-5] [akka://default/user/$a/$a] Something went horribly wrong when fetching the page.
de.fhopf.akka.RetrievalException: Something went horribly wrong when fetching the page.
        at de.fhopf.akka.actor.parallel.ChaosMonkeyPageRetriever.fetchPageContent(ChaosMonkeyPageRetriever.java:21)
        at de.fhopf.akka.actor.PageParsingActor.onReceive(PageParsingActor.java:26)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

1998 [default-akka.actor.default-dispatcher-8] INFO de.fhopf.akka.actor.Master - inProgress:  53, allPages:  60
2001 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-2] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-10] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
[...]
2469 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.Master - inProgress:   8, allPages:  78
2487 [default-akka.actor.default-dispatcher-7] INFO de.fhopf.akka.actor.Master - inProgress:   7, allPages:  78
2497 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:   6, allPages:  78
2540 [default-akka.actor.default-dispatcher-13] INFO de.fhopf.akka.actor.Master - inProgress:   5, allPages:  78

We can see each exception that is happening in the log file but our application keeps running. That is because of Akkas supervision support. Actors form hierarchies where our PageParsingActor is a child of the Master actor because it is created from its context. The Master is responsible to determine the fault strategy for its children. By default it will restart the Actor in case of an exception which makes sure that the next message is processed correctly. This means even in case of an error Akka tries to keep the system in a running state.

The reaction to a failure is determined by the method supervisorStrategy() in the parent actor. Based on an Exception class you can choose several outcomes:

  • resume: Keep the actor running as if nothing had happened
  • restart: Replace the failing actor with a new instance
  • suspend: Stop the failing actor
  • escalate: Let your own parent decide on what to do

A supervisor that would restart the actor for our exception and escalate otherwise could be added like this:

// allow 100 restarts in 1 minute ... this is a lot but we the chaos monkey is rather busy
private SupervisorStrategy supervisorStrategy = new OneForOneStrategy(100, Duration.create("1 minute"), new Function() {

    @Override
    public Directive apply(Throwable t) throws Exception {
        if (t instanceof RetrievalException) {
            return SupervisorStrategy.restart();
        }
        // it would be best to model the default behaviour in other cases
        return SupervisorStrategy.escalate();
    }

});

@Override
public SupervisorStrategy supervisorStrategy() {
    return supervisorStrategy;
}

Let's come back to our example. Though Akka takes care of restarting our failing actors the end result doesn't look good. The application continues to run after several exceptions but our application then just stops and hangs. This is caused by our business logic. The Master actor keeps all pages to visit in the VisitedPageStore and only commits the Lucene index when all pages are visited. As we had several failures we didn't receive the result for those pages and the Master still waits.

One way to fix this is to resend the message once the actor is restarted. Each Actor class can implement some methods that hook into the actors lifecycle. In preRestart() we can just send the message again.

@Override
public void preRestart(Throwable reason, Option<Object> message) throws Exception {
    logger.info("Restarting PageParsingActor and resending message '{}'", message);
    if (message.nonEmpty()) {
        getSelf().forward(message.get(), getContext());
    }
    super.preRestart(reason, message);
}

Now if we run this example we can see our actors recover from the failure. Though some exceptions are happening all pages get visited eventually and everything will be indexed and commited in Lucene.

Though resending seems to be the solution to our failures you need to be careful to not break your system with it: For some applications the message might be the cause for the failure and by resending it you will keep your system busy with it in a livelock state. When using this approach you should at least add a count to the message that you can increment on restart. Once it is sent too often you can then escalate the failure to have it handled in a different way.

Conclusion

We have only handled one certain type of failure but you can already see how powerful Akka can be when it comes to fault tolerance. Recovery code is completely separated from the business code. To learn more on different aspects of error handling read the Akka documentation on supervision and fault tolerance or this excellent article by Daniel Westheide.

Freitag, 4. Oktober 2013

Brian Foote on Prototyping

Big Ball of Mud is a collection of patterns by Brian Foote, published in 1999. The title stems from one of the patterns, Big Ball of Mud, the "most frequently deployed of software architectures". Though this might sound like a joke at first the article contains a lot of really useful information on the forces at work when dealing with large codebases and legacy code. I especially like his take on prototyping applications.

Did you ever find yourself in the following situation? A customer agrees to build a prototype of an application to learn and to see something in action. After the prototype is finished the customer tries to force you to reuse the prototype, "because it already does what we need". Cold sweat, you probably were taking lots of shortcuts in the prototype and didn't build it with maintainability in mind. This is what Brian recommends to circumvent this situation:

One way to minimize the risk of a prototype being put into production is to write the prototype in using a language or tool that you couldn't possibly use for a production version of your product.

Three observations:

  1. Nowadays the choice of languages doesn't matter that much for running the code in production with virtual machines that support a lot of languages.
  2. This only holds true for prototypes that are used to explore the domain. When doing a technical proof of concept at least some parts of the prototype need to use the intended technology.
  3. Prototypes are sometimes also used to make the team familiar with a new technology that is set for the project.

Nevertheless this is a really useful advice to keep in mind.

Elasticsearch - Der praktische Einstieg
Java Code Geeks