Donnerstag, 20. Dezember 2012

Gradle is too Clever for my Plans

While writing this post about the Lucene Codec API I noticed something strange when running the tests with Gradle. When experimenting with a library feature most of the times I write unit tests that validate my expectations. This is a habit I learned from Lucene in Action and can also be useful in real world scenarios, e.g. to make sure that nothing breaks when you update a library.

OK, what happened? This time I did not only want to have the test result but also ran the test for a side effect, I wanted a Lucene index to be written to the /tmp directory to manually have a look at it. This worked fine for the first time, but not afterwards, e.g. after my machine was rebooted and the directory cleared.

It turns out that the Gradle developers know that a test shouldn't be used to execute stuff. So once the test is run successfully it is just not run again until its input changes! Though this bit me this time this is a really nice feature to speed up your builds. And if you really need to execute the tests, you can always run gradle cleanTest test.

Freitag, 7. Dezember 2012

Looking at a Plaintext Lucene Index

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 4
  term document
    doc 0
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
  term second
    doc 1
      freq 1
      pos 4
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END

The content that is marked as stored resides in the .fld file:

doc 0
  numfields 1
  field 0
    name title
    type string
    value The title of my first document
doc 1
  numfields 1
  field 0
    name title
    type string
    value The title of the second document
END

If you'd like to have a look at the rest of the files checkout the code at Github.

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.

Donnerstag, 23. August 2012

Getting rid of synchronized: Using Akka from Java

I've been giving an internal talk on Akka, the Actor framework for the JVM, at my former company synyx. For the talk I implemented a small example application, kind of a web crawler, using Akka. I published the source code on Github and will explain some of the concepts in this post.

Motivation

To see why you might need something like Akka, think you want to implement a simple web crawler for offline search. You are downloading pages from a certain location, parse and index the content and follow any links that you haven't parsed and indexed yet. I am using HtmlParser for downloading and parsing pages and Lucene for indexing them. The logic is contained in two service objects, PageRetriever and Indexer, that can be used from our main application.

A simple sequential execution might then look something like this:

public void downloadAndIndex(String path, IndexWriter writer) {
    VisitedPageStore pageStore = new VisitedPageStore();
    pageStore.add(path);
        
    Indexer indexer = new IndexerImpl(writer);
    PageRetriever retriever = new HtmlParserPageRetriever(path);
        
    String page;
    while ((page = pageStore.getNext()) != null) {
        PageContent pageContent = retriever.fetchPageContent(page);
        pageStore.addAll(pageContent.getLinksToFollow());
        indexer.index(pageContent);
        pageStore.finished(page);
    }
        
    indexer.commit();
}

We are starting with one page, extract the content and the links, index the content and store all links that are to be visited in the VisitedPageStore. This class contains the logic to determine which links are visited already. We are looping as long as there are more links to follow, once we are done we commit the Lucene IndexWriter.

This implementation works fine, when running on my outdated laptop it will finish in around 3 seconds for an example page. (Note that the times I am giving are by no means meant as a benchmark but are just there to give you some idea on the numbers).

So we are done? No, of course we can do better by optimizing the resources we have available. Let's try to improve this solution by splitting it into several tasks that can be executed in parallel.

Shared State Concurrency

The normal way in Java would be to implement several Threads that do parts of the work and access the state via guarded blocks, e.g. by synchronizing methods. So in our case there might be several Threads that access our global state that is stored in the VisitedPageStore.

This model is what Venkat Subramaniam calls Synchronize and Suffer in his great book Programming Concurrency on the JVM. Working with Threads and building correct solutions might not seem that hard at first but is inherintly difficult. I like those two tweets that illustrate the problem:

Brian Goetz of course being the author of the de-facto standard book on the new Java concurrency features, Java Concurrency in Practice.

Akka

So what is Akka? It's an Actor framework for the JVM that is implemented in Scala but that is something that you rarely notice when working from Java. It offers a nice Java API that provides most of the functionality in a convenient way.

Actors are a concept that was introduced in the seventies but became widely known as one of the core features of Erlang, a language to build fault tolerant, self healing systems. Actors employ the concept of Message Passing Concurrency. That means that Actors only communicate by means of messages that are passed into an Actors mailbox. Actors can contain state that they shield from the rest of the system. The only way to change the state is by passing in messages. Each Actor is executed in a different Thread but they provide a higher level of abstraction than working with Threads directly.

When implementing Actors you put the behaviour in a method receive() that can act on incoming messages. You can then reply asynchronously to the sender or send messages to any other Actor.

For our problem at hand an Actor setup might look something like this:

There is one Master Actor that also contains the global state. It sends a message to fetch a certain page to a PageParsingActor that asynchonously responds to the Master with the PageContent. The Master can then send the PageContent to an IndexingActor which responds with another message. With this setup we have done a first step to scale our solution. There are now three Actors that can be run on different cores of your machine.

Actors are instantiated from other Actors. On the top there's the ActorSystem that is provided by the framework. The MasterActor is instaciated from the ActorSystem:

ActorSystem actorSystem = ActorSystem.create();
final CountDownLatch countDownLatch = new CountDownLatch(1);
ActorRef master = actorSystem.actorOf(new Props(new UntypedActorFactory() {

    @Override
    public Actor create() {
        return new SimpleActorMaster(new HtmlParserPageRetriever(path), writer, countDownLatch);
    }
}));

master.tell(path);
try {
    countDownLatch.await();
    actorSystem.shutdown();
} catch (InterruptedException ex) {
    throw new IllegalStateException(ex);
}

Ignore the CountdownLatch as it is only included to make it possible to terminate the application. Note that we are not referencing an instance of our class but an ActorRef, a reference to an actor. You will see later why this is important.

The MasterActor contains references to the other Actors and creates them from its context. This makes the two Actors children of the Master:

public SimpleActorMaster(final PageRetriever pageRetriever, final IndexWriter indexWriter,
    final CountDownLatch latch) {

    super(latch);
    this.indexer = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new IndexingActor(new IndexerImpl(indexWriter));
        }
    }));

    this.parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

           return new PageParsingActor(pageRetriever);
        }
    }));
}

The PageParsingActor acts on messages to fetch pages and sends a message with the result to the sender:

public void onReceive(Object o) throws Exception {
    if (o instanceof String) {
        PageContent content = pageRetriever.fetchPageContent((String) o);
        getSender().tell(content, getSelf());
    } else {
        // fail on any message we don't expect
        unhandled(o);
    }
}

The IndexingActor contains some state with the Indexer. It acts on messages to index pages and to commit the indexing process.

public void onReceive(Object o) throws Exception {
    if (o instanceof PageContent) {
        PageContent content = (PageContent) o;
        indexer.index(content);
        getSender().tell(new IndexedMessage(content.getPath()), getSelf());
    } else if (COMMIT_MESSAGE == o) {
        indexer.commit();
        getSender().tell(COMMITTED_MESSAGE, getSelf());
    } else {
        unhandled(o);
    }
}

The MasterActor finally orchestrates the other Actors in its receive() method. It starts with one page and sends it to the PageParsingActor. It keeps the valuable state of the application in the VisitedPageStore. When no more pages are to be fetched and indexed it sends a commit message and terminates the application.

public void onReceive(Object message) throws Exception {

    if (message instanceof String) {
        // start
        String start = (String) message;
        visitedPageStore.add(start);
        getParser().tell(visitedPageStore.getNext(), getSelf());
    } else if (message instanceof PageContent) {
        PageContent content = (PageContent) message;
        getIndexer().tell(content, getSelf());
        visitedPageStore.addAll(content.getLinksToFollow());

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        } else {
            for (String page : visitedPageStore.getNextBatch()) {
                getParser().tell(page, getSelf());
            }
        }
    } else if (message instanceof IndexedMessage) {
        IndexedMessage indexedMessage = (IndexedMessage) message;
        visitedPageStore.finished(indexedMessage.path);

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        }
    } else if (message == IndexingActor.COMMITTED_MESSAGE) {
        logger.info("Shutting down, finished");
        getContext().system().shutdown();
        countDownLatch.countDown();
    }
}

What happens if we run this example? Unfortunately it now takes around 3.5 seconds on my dual core machine. Though we are now able to run on both cores we have actually decreased the speed of the application. This is probably an important lesson. When building scalable applications it might happen that you are introducing some overhead that decreases the performance when running in the small. Scalability is not about increasing performance but about the ability to distribute the load.

So it was an failure to switch to Akka? Not at all. It turns out that most of the time the application is fetching and parsing pages. This includes waiting for the network. Indexing in Lucene is blazing fast and the Master mostly only dispatches messages. So what can we do about it? We already have split our application into smaller chunks. Fortunately the PageParsingActor doesn't contain any state at all. That means we can easily parallelize its tasks.

This is where the talking to references is important. For an Actor it's transparent if there is one or a million Actors behind a reference. There is one mailbox for an Actor reference that can dispatch the messages to any amount of Actors.

We only need to change the instanciation of the Actor, the rest of the application remains the same:

parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new PageParsingActor(pageRetriever);
        }
}).withRouter(new RoundRobinRouter(10)));

By using a router the Akka framework automatically takes care that there are 10 Actors available. The messages are distributed to any available Actor. This takes the runtime down to 2 seconds.

A word on Blocking

Note that the way I am doing network requests here is not recommended in Akka. HTMLParser is doing blocking networking which should be carefully reconsidered when designing a reactive system. In fact, as this application is highly network bound, we might even gain more benefit by just using an asynchronous networking library. But hey, then I wouldn't be able to tell you how nice it is to use Akka. In a future post I will highlight some more Akka features that can help to make our application more robust and fault tolerant.

Donnerstag, 5. Juli 2012

Slides and demo code for my talk at JUG KA available

I just uploaded the (german) slides as well as the example code for yesterdays talk on Lucene and Solr at our local Java User Group.

The demo application contains several subprojects for indexing and searching with Lucene and Solr as well as a simple Dropwizard application that demonstrates some search features. See the README files in the source tree to find out how to run the application.

Freitag, 29. Juni 2012

Dropwizard Encoding Woes

I have been working on an example application for Lucene and Solr for my upcoming talk at the Java User Group Karlsruhe. As a web framework I wanted to try Dropwizard, a lightweight application framework that can expose resources via JAX-RS, provides out of the box monitoring support and can render resource representations using Freemarker. It's really easy to get started, there's a good tutorial and the manual.

An example resource might look like this:

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

@Path("/example")
@Produces(MediaType.TEXT_HTML)
public class ExampleResource {

    @GET
    public ExampleView illustrate() {
        return new ExampleView("Mot\u00f6rhead");
    }

}

The Resource produces HTML using Freemarker, which is possible if you add the view bundle in the service. There is one method that is called when the resource is addressed using GET. Inside the method we create a view object accepting a message that in this case contains the umlaut 'ö'. The view class that is returned by the method looks like this:

import com.yammer.dropwizard.views.View;

public class ExampleView extends View {

    private final String message;

    public ExampleView(String message) {
        super("example.fmt");
        this.message = message;
    }

    public String getMessage() {
        return message;
    }
}

It accepts a message as constructor parameter. The template name is passed to the parent class. This view class is now available in a freemarker template, an easy variant looks like this:

<html>
    <body>
        <h1>${message} rocks!</h1>
    </body>
</html>
If I run this on my machine and access it with Firefox it doesn't work as expected. The umlaut character is broken, something Lemmy surely doesn't approve:

Accessing the resource using curl works flawlessly:

curl http://localhost:8080/example
<html>
    <body>
        <h1>Motörhead rocks!</h1>
    </body>
</html>

Why is that? It's Servlet Programming 101: You need to set the character encoding of the response. My Firefox defaults to ISO-8859-1, curl seems to use UTF-8 by default. How can we fix it? Tell the client which encoding we are using, which can be done using the Produces annotation:

@Produces("text/html; charset=utf-8")

So what does it have to do with Dropwizard? Nothing really, it's a JAX-RS thing. All components in Dropwizard (Jetty and Freemarker notably) are using UTF-8 by default.

Mittwoch, 20. Juni 2012

Running and Testing Solr with Gradle

A while ago I blogged on testing Solr with Maven on the synyx blog. In this post I will show you how to setup a similar project with Gradle that can start the Solr webapp and execute tests against your configuration.

Running Solr

Solr is running as a webapp in any JEE servlet container like Tomcat or Jetty. The index and search configuration resides in a directory commonly referred to as Solr home that can be outside of the webapp directory. This is also the place where the Lucene index files are created. The location for Solr home can be set using an environment variable.

The Solr war file is available in Maven Central. This post describes how to run a war file that is deployed in a Maven repository using Gradle. Let's see how the Gradle build file looks like for running Solr:

import org.gradle.api.plugins.jetty.JettyRunWar

apply plugin: 'java'
apply plugin: 'jetty'

repositories {
    mavenCentral()
}

// custom configuration for running the webapp
configurations {
    solrWebApp
}

dependencies {
    solrWebApp "org.apache.solr:solr:3.6.0@war"
}

// custom task that configures the jetty plugin
task runSolr(type: JettyRunWar) {
    webApp = configurations.solrWebApp.singleFile

    // jetty configuration
    httpPort = 8082
    contextPath = 'solr'
}

// executed before jetty starts
runSolr.doFirst {
    System.setProperty("solr.solr.home", "./solrhome")
}

We are creating a custom configuration that contains the Solr war file. In the task runSolr we configure the Jetty plugin. To add the Solr home environment variable we can use the way described by Sebastian Himberger. We add a code block that is executed before Jetty starts and sets the environment variable using standard Java mechanisms. You can now start Solr using gradle runSolr. You will see some errors regarding multiple versions of slf4j that are very like caused by this bug.

Testing the Solr configuration

Solr provides some classes that start an embedded instance using your configuration. You can use these classes in any setup as they do not depend on the gradle jetty plugin. Starting with Solr 3.2 the test framework is not included in solr-core anymore. This is what the relevant part of the dependency section looks like now:

testCompile "junit:junit:4.10"
testCompile "org.apache.solr:solr-test-framework:3.6.0"

Now you can place a test in src/test/java that either uses the convenience methods provided by SolrTestCaseJ4 or you can instantiate an EmbeddedSolrServer and execute any SolrJ actions. Both of these ways will use your custom config. This way you can easily validate that configuration changes don't break existing functionality. An example of using the convenience methods:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrServerException;
import org.junit.BeforeClass;
import org.junit.Test;
import java.io.IOException;

public class BasicConfigurationTest extends SolrTestCaseJ4 {

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solrhome/conf/solrconfig.xml", "solrhome/conf/schema.xml", "solrhome/");
    }

    @Test
    public void noResultInEmptyIndex() throws SolrServerException {
        assertQ("test query on empty index",
                req("text that is not found")
                , "//result[@numFound='0']"
        );
    }

    @Test
    public void pathIsMandatory() throws SolrServerException, IOException {
        assertFailedU(adoc("title", "the title"));
    }

    @Test
    public void simpleDocumentIsIndexedAndFound() throws SolrServerException, IOException {
        assertU(adoc("path", "/tmp/foo", "content", "Some important content."));
        assertU(commit());

        assertQ("added document found",
                req("important")
                , "//result[@numFound='1']"
        );
    }

}

We extend the class SolrTestCaseJ4 that is responsible for creating the core and instanciating the runtime using the paths we provide with the method initCore(). Using the available assert methods you can execute queries and validate the result using XPath expressions.

An example that instanciates a SolrServer might look like this:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import java.io.IOException;

public class ServerBasedTalkTest extends SolrTestCaseJ4 {

    private EmbeddedSolrServer server;

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solr/conf/solrconfig.xml", "solr/conf/schema.xml");
    }

    @Before
    public void initServer() {
        server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
    }

    @Test
    public void queryOnEmptyIndexNoResults() throws SolrServerException {
        QueryResponse response = server.query(new SolrQuery("text that is not found"));
        assertTrue(response.getResults().isEmpty());
    }

    @Test
    public void singleDocumentIsFound() throws IOException, SolrServerException {
        SolrInputDocument document = new SolrInputDocument();
        document.addField("path", "/tmp/foo");
        document.addField("content", "Mein Hut der hat 4 Ecken");

        server.add(document);
        server.commit();

        SolrParams params = new SolrQuery("ecke");
        QueryResponse response = server.query(params);
        assertEquals(1L, response.getResults().getNumFound());
        assertEquals("/tmp/foo", response.getResults().get(0).get("path"));
    }

    @After
    public void clearIndex() {
        super.clearIndex();
    }
}

The tests can now be executed using gradle test.

Testing your Solr configuration is important as changes in one place might easily lead to side effects with another search functionality. I recommend to add tests even for basic functionality and evolve the tests with your project.

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

FieldTermDocuments
AuthorFlorian Hopf1
Theo Tester2
CategoryCategory11
Category21, 2
Category32

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

Mittwoch, 6. Juni 2012

Berlin Buzzwords 2012

Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.

The Barcamp

As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.

Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.

HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.

The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.

I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.

Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.

The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.

Day 1

The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.

Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.

A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.

Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.

The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.

The day ended with interesting discussions and some beer at the Buzz Party and BBQ.

Day 2

The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.

Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.

In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.

One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.

In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.

The location

The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.

The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.

The format

This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.

Berlin Buzzwords is a fantastic conference and I will definitively go there again.

Freitag, 11. Mai 2012

Content Extraction with Apache Tika

Sometimes you need access to the content of documents, be it that you want to analyze it, store the content in a database or index it for searching. Different formats like word documents, pdfs and html documents need different treatment. Apache Tika is a project that combines several open source projects for reading content from a multitude of file formats and makes the textual content as well as some metadata available using a uniform API. I will show two ways how to leverage the power of Tika for your projects.

Accessing Tika programmatically

First, Tika can of course be used as a library. Surprisingly the user docs on the website explain a lot of the functionality that you might be interested in when writing custom parsers for Tika but don't show directly how to use it.

I am using Maven again, so I add a dependency for the most recent version:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.1</version>
    <type>jar</type>
</dependency>

tika-parsers also includes all the other projects that are used so be patient when Maven fetches all the transitive dependencies.

Let's see what some test code for extracting data from a pdf document called slides.pdf, that is available in the classpath, looks like.

Parser parser = new PdfParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream content = getClass().getResourceAsStream("/slides.pdf");
parser.parse(content, handler, metadata, new ParseContext());
assertEquals("Solr Vortrag", metadata.get(Metadata.TITLE));
assertTrue(handler.toString().contains("Lucene"));

First, we need to instanciate a Parser that is capable of reading the format, in this case PdfParser that uses PDFBox for extracting the content. The parse method expects some parameters to configure the parsing process as well as an InputStream that contains the data of the document. Metadata will contain all the metadata for the document, e.g. the title or the author after the parsing is finished.

Tika uses XHTML as the internal representation for all parsed content. This XHTML document can be processed by a SAX ContentHandler. A custom implementation BodyContentHandler returns all the text in the body area, which is the main content. The last parameter ParseContext can be used to configure the underlying parser instance.

The Metadata class consists of a Map-like structure with some common keys like the title as well as optional format specific information. You can look at the contents with a simple loop:

for (String name: metadata.names()) { 
    System.out.println(name + ": " + metadata.get(name));
}

This will produce an output similar to this:

xmpTPg:NPages: 17
Creation-Date: 2010-11-20T09:47:28Z
title: Solr Vortrag
created: Sat Nov 20 10:47:28 CET 2010
producer: OpenOffice.org 2.4
Content-Type: application/pdf
creator: Impress

The textual content of the document can be retrieved by calling the toString() method on the BodyContentHandler.

This is all fine if you exactly know that you only want to retrieve data from pdf documents. But you probably don't want to introduce a huge switch-block for determining the parser to use depending on the file name or some other information. Fortunately Tika also provides an AutodetectParser that employs different strategies for determining the content type of the document. The code above all stays the same, you just use a different parser:

Parser parser = new AutodetectParser();

This way you don't have to know what kind of document you are currently processing, Tika will provide you with metadata as well as the content. You can pass in additional hints for the parser e.g. the filename or the content type by setting it in the Metadata object.

Extracting content using Solr

If you are using the search server Solr you can also leverage its REST API for extracting the content. The default configuration has a request handler configured for /update/extract that you can send a document to and it will return the content it extracted using Tika. You just need to add the necessary libraries for the extraction. I am still using Maven so I have to add an additional dependency:

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
</dependency>
<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-cell</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
</dependency>

This will include all of the Tika dependencies as well as all necessary third party libraries.

Solr Cell, the request handler, normally is used to index binary files directly but you can also just use it for extraction. To transfer the content you can use any tool that can speak http, e.g. for curl this might look like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"

By setting the parameter extractOnly to true we advice Solr that we don't want to index the content but want to have it extracted to the response. The result will be the standard Solr XML format that contains the body content as well as the metadata.

You can also use the Java client library SolrJ for doing the same:

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("slides.pdf"));
request.setParam("extractOnly", "true");
request.setParam("extractFormat", "text");
NamedList<Object> result = server.request(request);

The NamedList will contain entries for the body content as well as another NamedList with the metadata.


Update


Robert has asked in the comments what the response looks like.
Solr uses configurable response writers for marshalling the message. The default format is xml but can be influenced by passing the wt attribute to the request. A simplified standard response looks like this:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1952</int></lst><str name="slides.pdf">

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax
Java­Client SolrJ

[... more content ...]

</str><lst name="slides.pdf_metadata"><arr name="xmpTPg:NPages"><str>17</str></arr><arr name="Creation-Date"><str>2010-11-20T09:47:28Z</str></arr><arr name="title"><str>Solr Vortrag</str></arr><arr name="stream_source_info"><str>file</str></arr><arr name="created"><str>Sat Nov 20 10:47:28 CET 2010</str></arr><arr name="stream_content_type"><str>application/octet-stream</str></arr><arr name="stream_size"><str>425327</str></arr><arr name="producer"><str>OpenOffice.org 2.4</str></arr><arr name="stream_name"><str>slides.pdf</str></arr><arr name="Content-Type"><str>application/pdf</str></arr><arr name="creator"><str>Impress</str></arr></lst>
</response>

The response contains some metadata (how long the processing took), the content of the file as well as the metadata that is extracted from the document.


If you pass the atrribute wt and set it to json, the response is contained in a json structure:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text&wt=json"             
{"responseHeader":{"status":0,"QTime":217},"slides.pdf":"\n\n\n\n\n\n\n\n\n\n\n\nSolr Vortrag\n\n   \n\nEinfach mehr finden mit\n\nFlorian Hopf\n29.09.2010\n\n\n   \n\nSolr?\n\n\n   \n\nSolr?\n\nServer­ization of Lucene\n\n\n   \n\nApache Lucene?\n\nSearch engine library\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \nScoring\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\n\n\n   \n\nArchitektur\n\nClient SolrWebapp Lucene\nhttp\n\nKommunikation über XML, JSON, JavaBin, Ruby, ...\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\nJava­Client SolrJ\n\n\n   \n\nDemo\n\n\n   \n\nWas noch?\nAdmin­Interface\nCaching\nSkalierung\nSpellchecker\nMore­Like­This\nData Import Handler\nSolrCell\n\n\n   \n\nRessourcen\nhttp://lucene.apache.org/solr/\n\n\n\n","slides.pdf_metadata":["xmpTPg:NPages",["17"],"Creation-Date",["2010-11-20T09:47:28Z"],"title",["Solr Vortrag"],"stream_source_info",["file"],"created",["Sat Nov 20 10:47:28 CET 2010"],"stream_content_type",["application/octet-stream"],"stream_size",["425327"],"producer",["OpenOffice.org 2.4"],"stream_name",["slides.pdf"],"Content-Type",["application/pdf"],"creator",["Impress"]]}

There are quite some ResponseWriters available for different languages, e.g. for Ruby. You can have a look at them at the bottom of this page: http://wiki.apache.org/solr/QueryResponseWriter

Montag, 7. Mai 2012

Importing Atom feeds in Solr using the Data Import Handler

I am working on a search solution that makes some of the content I am producing available through one search interface. One of the content stores is the blog you are reading right now, which among other options makes the content available here using Atom.

Solr, my search server of choice, provides the Data Import Handler that can be used to import data on a regular basis from sources like databases via JDBC or remote XML sources, like Atom.

Data Import Handler used to be a core part of Solr but starting from 3.1 it is shipped as a separate jar and not included in the standard war anymore. I am using Maven with overlays for development so I have to add a dependency for it:

<dependencies>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
  </dependency>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-dataimporthandler</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
  </dependency>
</dependencies>

To enable the data import handler you have to add a request handler to your solrconfig.xml. Request handlers are registered for a certain url and, as the name suggests, are responsible for handling incoming requests:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

The file data-config.xml that is referenced here contains the mapping logic as well as the endpoint to access:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
    <dataSource type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
    <document>
        <entity name="blog"
                pk="url"
                url="http://fhopf.blogspot.com/feeds/posts/default?max-results=100"
                processor="XPathEntityProcessor"
                forEach="/feed/entry" transformer="DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
            <field column="title" xpath="/feed/entry/title"/>
            <field column="url" xpath="/feed/entry/link[@rel='alternate']/@href"/>
            <!-- 2012-03-07T21:35:51.229-08:00 -->
            <field column="last_modified" xpath="/feed/entry/updated" 
                dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" locale="en"/>
            <field column="text" xpath="/feed/entry/content" stripHTML="true"/>
            <field column="category" xpath="/feed/entry/category/@term"/>
            <field column="type" template="blog"/> 
        </entity>
    </document>
</dataConfig>

First we configure which datasource to use. This is where you alternatively would use another implementation when fetching documents from a database.

Documents describe the fields that will be stored in the index. The attributes for the entity element determine where and how to fetch the data, most importantly the url and the processor. forEach contains an XPath to identify the elements we'd like to loop over. The transformer attribute is used to specify some classes that are the available when mapping the remote XML to the Solr fields.

The field elements contain the mapping between the Atom document and the Solr index fields. The column attribute determines the name of the index field, xpath determines the node to use in the remote XML document. You can use advanced XPath options like mapping to attributes of elements where only another attribute is set. E.g. /feed/entry/link[@rel='alternate']/@href points to an element that determines an alternative representation of a blog post entry:

<feed ...> 
  ...
  <entry> 
    ...
    <link rel='alternate' type='text/html' href='http://fhopf.blogspot.com/2012/03/testing-akka-actors-from-java.html' title='Testing Akka actors from Java'/>
    ...
  </entry>
...
</feed>

For the column last_modified we are transforming the remote date format to the internal Solr representation using the DateProcessor. I am not sure yet if this is the correct solution as it seems to me I'm losing the timezone information. For the text field we are first removing all html elements that are contained in the blog post using the HTMLStripTransformer. Finally, the type contains a hardcoded value that is set using the TemplateTransformer.

To have everything in one place let's see how the schema for our index looks like:

<field name="url" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="false"/>

Finally, how can you trigger the dataimport? There is an option described in the Solr wiki, but probably a simple solution might be enough for you. I am using a shell script that is triggered by a cron job. These are the contents:

#!/bin/bash
curl localhost:8983/solr/dataimport?command=full-import

The data import handler is really easy to setup and you can use it to import quite a lot of data sources into your index. If you need more advanced crawling features you might want to have a look at Apache ManifoldCF, a connector framework for plugging content repositories into search engines like Apache Solr.

Mittwoch, 7. März 2012

Testing Akka actors from Java

If you're looking for a general introduction into using Akka from Java have a look at this post

In a recent project I've been using Akka for a concurrent producer-consumer setup. It is an actor framework for the JVM that is implemented in Scala but provides a Java API so normally you don't notice that your dealing with a Scala library.

Most of my business code is encapsulated in services that don't depend on Akka and can therefore be tested in isolation. But for some cases I've been looking for a way to test the behaviour of the actors. As I struggled with this for a while and didn't find a real howto on testing Akka actors from Java I hope my notes might be useful for other people as well.

The main problem when testing actors is that they are managed objects and you can't just instanciate them. Akka comes with a module for tests that is documented well for using it from Scala. But besides the note that it's possible you don't find a lot of information on using it from Java.

When using Maven you need to make sure that you have the akka-testkit dependency in place:

<dependency>
<groupId>com.typesafe.akka</groupId>
<artifactId>akka-testkit</artifactId>
<version>2.1-SNAPSHOT</version>
<scope>test</scope>
</dependency>

I will show you how to implement a test for the actors that are introduced in the Akka java tutorial. It involves one actor that does a substep of calculating Pi for a certain start number and a given number of elements.

To test this actor we need a way to set it up. Akka-testkit provides a helper TestActorRef that can be used to set it up. Using scala this seems to be rather simple:

val testActor = TestActorRef[Worker]

If you try to do this from Java you will notice that you can't use a similar call. I have to admit that I am not quite sure yet what is going on. I would have expected that there is an apply() method on the TestActorRef companion object that uses some kind of implicits to instanciate the Worker object. But when inspecting the sources the thing that comes closest to it is this definition:

def apply[T <: Actor](factory: ⇒ T)(implicit system: ActorSystem)

No sign of implicit for the factory. Something I still have to investigate further.

To use it from Java you can use the method apply that takes a reference to a Function0 and an actor system. The actor system can be setup easily using

actorSystem = ActorSystem.apply();

The apply() method is very important in scala as it's kind of the default method for objects. For example myList(1) is internally using myList.apply(1).

If you're like me and expect that Function0 is a single method interface you will be surprised. It contains a lot of strange looking methods that you really don't want to have cluttering your test code:

TestActorRef workerRef = TestActorRef.apply(new Function0() {

@Override
public Worker apply() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public void apply$mcV$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public boolean apply$mcZ$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public byte apply$mcB$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public short apply$mcS$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public char apply$mcC$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public int apply$mcI$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public long apply$mcJ$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public float apply$mcF$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}

@Override
public double apply$mcD$sp() {
throw new UnsupportedOperationException("Not supported yet.");
}
}, actorSystem);

The only method we really are interested in is the normal apply method. Where do those other methods come from? There is no obvious hint in the scaladocs.

During searching for the solution I found a mailing list thread that explains some of the magic. The methods are performance optimizations for boxing and unboxing that are automatically generated by the scala compiler for the @specialized annotation. Still, I am unsure about why this is happening exactly. According to this presentation I would have expected that I am using the specialized instance for Object, maybe that is something special regarding traits?

Fortunately we don't really need to implement the interface ourself: There's an adapter class, AbstractFunction0, that makes your code look much nicer:

@Before
public void initActor() {
actorSystem = ActorSystem.apply();
actorRef = TestActorRef.apply(new AbstractFunction0() {

@Override
public Pi.Worker apply() {
return new Pi.Worker();
}

}, actorSystem);
}

This is like I would have expected it to behave in the first place.

Now, as we have setup our test we can use the TestActorRef to really test the actor. For example we can test that the actor doesn't do anything for a String message:

@Test
public void doNothingForString() {
TestProbe testProbe = TestProbe.apply(actorSystem);
actorRef.tell("Hello", testProbe.ref());

testProbe.expectNoMsg(Duration.apply(100, TimeUnit.MILLISECONDS));
}

TestProbe is another helper that can be used to check the messages that are sent between cooperating actors. In this example we are checking that no message is passed to the sender for 100 miliseconds, which should be enough for execution.

Let's test some real functionality. Send a message to the actor and check that the result message is send:

@Test
public void calculatePiFor0() {
TestProbe testProbe = TestProbe.apply(actorSystem);
Pi.Work work = new Pi.Work(0, 0);
actorRef.tell(work, testProbe.ref());

testProbe.expectMsgClass(Pi.Result.class);
TestActor.Message message = testProbe.lastMessage();
Pi.Result resultMsg = (Pi.Result) message.msg();
assertEquals(0.0, resultMsg.getValue(), 0.0000000001);
}

Now we use the TestProbe to block until a message arrives. When it's there we can have a look at using the lastMessage().

You can look at the rest of the test on Github. Comments are more than welcome as I am pretty new to Scala as well as Akka.

Update

As Jonas Bonér points out I've been using the Scala API. Using the Props class the setup is easier:

    @Before
public void initActor() {
actorSystem = ActorSystem.apply();
actorRef = TestActorRef.apply(new Props(Pi.Worker.class), actorSystem);
}

Sonntag, 19. Februar 2012

Legacy Code Retreat

Yesterday I attended the first german Legacy Code Retreat in Bretten. The event was organized by Softwerkskammer, the german software craftsmanship community.

A legacy code retreat doesn't work like a common code retreat where you implement a certain functionality again and again. It instead starts with some really flawed code and the participants apply different refactoring steps to make it more testable and maintainable. There are six iterations of 45 minutes with different tasks or aims. For each iteration you work with a different partner and after a short retrospective with all participants you mostly start again from the original code.

The github repository for the legacy code contains the code in several languages among which are Java, C++, C# and Ruby.

Iteration 1

The first iteration was used to get to know the functionality of the code. There were no real rules so the participants were free to explore the code in any way they liked.

I paired with Heiko Seebach who I already knew to be a Ruby guy. We were looking at the code with a standard text editor, already quite unfamiliar to standard Java IDE work. I already got enough Ruby knowledge to understand code when I see it so this was no problem. For quite some time we tried to understand a certain aspect that was happening when running the code. It turned out that this was a bug in the Ruby-version of the code. Next we tried to setup RSpec and get starting with some tests.

During this iteration I didn't learn that much about the legacy code but more about some Ruby stuff.

Iteration 2

The target of the second iteration was to prepare a golden master test that could be used during all of the following iterations. The original legacy code is triggered by random input (in the Java version using java.util.Random) and writes all its state to System.out. We should capture the output for a certain input sequence and write it to a file. This can then automatically be compared to the output of a modified version. If both files are the same there are likely no regresions in the code.

I paired with another Java guy and we were working on my machine in Netbeans. I noticed how unfamiliar I am with standard Netbeans project setup as I am using Maven most of the time. We were doing the test and started some refactorings, all in all a quite productive iteration. Things I learned: java.util.Random really only uses the seed for its number generation so if you are using the same seed again and again you always get the same result. Also, when doing file stuff in plain Java I really miss commons-io.

Iteration 3

In Iteration 3 we were supposed to use an antipattern for testing: Subclass to Test. You take the original class and overwrite some methods in it that are called from the method to test.

It turned out that the original code is not suited well for this approach. There are only few methods that really rely on other methods. Most of the methods are accessing the state via the fields directly. Me and my partner therefore didn't really overwrite the methods but instead use an initializer block for prepareing the state of method calls. This is similar to an approach for Map-initialization that I started to apply only just recently:

Map data = new HashMap() {
{
put("key", "value");
}
};

The approach worked quite fine for the given code but it's probably true that the tests won't stay maintainable.

Iteration 4

Iteration 4 was based on the previous iteration. All the methods that have been subclassed for testing should be moved to delegates and passed into the original class using a dependency injection approach.

I paired with a C++ guy who is doing embedded stuff during his day job on his C++ code. It turned out that we had quite different opinions and experiences. He was really focused on performance and couldn't understand why you would want to move methods to another class just to delegate to them as you are introducing overhead with the method call.

I haven't done any C++ programming since University. Eclipse seems to be suited well for development but compared to Java it still seems to lack a lot of convenience functionality.

Iteration 5


On Iteration 5 I paired with Tilman, a Clean Code Aficionado who I already knew from our local Java User Group. We were supposed to change as many methods as possible to real functions that don't work on fields but on parameter values only.

A lot of people were struggling with this approach at first. But it turns out if you are doing this you have a really good starting position for doing further refactorings more easily.

My partner was doing most of the coding with some input from me. We were taking some directions I wouldn't have taken by myself but the resulting code was really well structured and could be reduced in size. Also we worked with an interesting Eclipse plugin I had seen before already: Infinitest always runs the tests in the background, no need to run the tests manually. Have to check if there's something like this available for Netbeans as well.

Iteration 6

To be honest, I don't know what the goal of the sixth iteration really was. I was pairing with a developer that was still fighting with the failing tests from the previous iteration. Most of the iteration we tried to get these running again. In the last few minutes we managed to extract some clases and clean up some code.

Conclusion

The first german legacy code retreat really was a great experience. I learned a lot and, probably even more important, had a lot of fun.

The food and the location both were excellent. Thanks to the organizers Nicole and Andreas as well as the sponsors for making it possible. It's great to be able to attend a high quality event totally for free.

Dienstag, 10. Januar 2012

Running my Tests again

For some time I've been bugged by a Netbeans problem that I couldn't find any solution to. When running a unit test from within Netbeans from time to time it happended that the tests just failed. They seemed to be executed in an old state. Running them again didn't help either, it seemed that some parts of the project didn't get recompiled. When executing the tests from a command line Maven build there were never any problems and afterwards the tests could be run again from Netbeans. The problem only occured very infrequently but nevertheless it was really annoying. I started not running the tests from Netbeans at all but only using Maven. That is also not a good solution as you either run all tests or have to edit the command line all the time for running only a single test.

Recently I noticed what caused the problem: Netbeans has its compile on save feature on for tests. This means it is using its internal incremental compile feature which doesn't seem to work fine at least for some project setups.


You can disable it in the project properties on the Build/Compile node. I haven't seen any problems since disabling it. Saves me a lot of time to run the tests from the IDE again.
Elasticsearch - Der praktische Einstieg
Java Code Geeks