Freitag, 29. Juni 2012

Dropwizard Encoding Woes

I have been working on an example application for Lucene and Solr for my upcoming talk at the Java User Group Karlsruhe. As a web framework I wanted to try Dropwizard, a lightweight application framework that can expose resources via JAX-RS, provides out of the box monitoring support and can render resource representations using Freemarker. It's really easy to get started, there's a good tutorial and the manual.

An example resource might look like this:

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

@Path("/example")
@Produces(MediaType.TEXT_HTML)
public class ExampleResource {

    @GET
    public ExampleView illustrate() {
        return new ExampleView("Mot\u00f6rhead");
    }

}

The Resource produces HTML using Freemarker, which is possible if you add the view bundle in the service. There is one method that is called when the resource is addressed using GET. Inside the method we create a view object accepting a message that in this case contains the umlaut 'ö'. The view class that is returned by the method looks like this:

import com.yammer.dropwizard.views.View;

public class ExampleView extends View {

    private final String message;

    public ExampleView(String message) {
        super("example.fmt");
        this.message = message;
    }

    public String getMessage() {
        return message;
    }
}

It accepts a message as constructor parameter. The template name is passed to the parent class. This view class is now available in a freemarker template, an easy variant looks like this:

<html>
    <body>
        <h1>${message} rocks!</h1>
    </body>
</html>
If I run this on my machine and access it with Firefox it doesn't work as expected. The umlaut character is broken, something Lemmy surely doesn't approve:

Accessing the resource using curl works flawlessly:

curl http://localhost:8080/example
<html>
    <body>
        <h1>Motörhead rocks!</h1>
    </body>
</html>

Why is that? It's Servlet Programming 101: You need to set the character encoding of the response. My Firefox defaults to ISO-8859-1, curl seems to use UTF-8 by default. How can we fix it? Tell the client which encoding we are using, which can be done using the Produces annotation:

@Produces("text/html; charset=utf-8")

So what does it have to do with Dropwizard? Nothing really, it's a JAX-RS thing. All components in Dropwizard (Jetty and Freemarker notably) are using UTF-8 by default.

Mittwoch, 20. Juni 2012

Running and Testing Solr with Gradle

A while ago I blogged on testing Solr with Maven on the synyx blog. In this post I will show you how to setup a similar project with Gradle that can start the Solr webapp and execute tests against your configuration.

Running Solr

Solr is running as a webapp in any JEE servlet container like Tomcat or Jetty. The index and search configuration resides in a directory commonly referred to as Solr home that can be outside of the webapp directory. This is also the place where the Lucene index files are created. The location for Solr home can be set using an environment variable.

The Solr war file is available in Maven Central. This post describes how to run a war file that is deployed in a Maven repository using Gradle. Let's see how the Gradle build file looks like for running Solr:

import org.gradle.api.plugins.jetty.JettyRunWar

apply plugin: 'java'
apply plugin: 'jetty'

repositories {
    mavenCentral()
}

// custom configuration for running the webapp
configurations {
    solrWebApp
}

dependencies {
    solrWebApp "org.apache.solr:solr:3.6.0@war"
}

// custom task that configures the jetty plugin
task runSolr(type: JettyRunWar) {
    webApp = configurations.solrWebApp.singleFile

    // jetty configuration
    httpPort = 8082
    contextPath = 'solr'
}

// executed before jetty starts
runSolr.doFirst {
    System.setProperty("solr.solr.home", "./solrhome")
}

We are creating a custom configuration that contains the Solr war file. In the task runSolr we configure the Jetty plugin. To add the Solr home environment variable we can use the way described by Sebastian Himberger. We add a code block that is executed before Jetty starts and sets the environment variable using standard Java mechanisms. You can now start Solr using gradle runSolr. You will see some errors regarding multiple versions of slf4j that are very like caused by this bug.

Testing the Solr configuration

Solr provides some classes that start an embedded instance using your configuration. You can use these classes in any setup as they do not depend on the gradle jetty plugin. Starting with Solr 3.2 the test framework is not included in solr-core anymore. This is what the relevant part of the dependency section looks like now:

testCompile "junit:junit:4.10"
testCompile "org.apache.solr:solr-test-framework:3.6.0"

Now you can place a test in src/test/java that either uses the convenience methods provided by SolrTestCaseJ4 or you can instantiate an EmbeddedSolrServer and execute any SolrJ actions. Both of these ways will use your custom config. This way you can easily validate that configuration changes don't break existing functionality. An example of using the convenience methods:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrServerException;
import org.junit.BeforeClass;
import org.junit.Test;
import java.io.IOException;

public class BasicConfigurationTest extends SolrTestCaseJ4 {

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solrhome/conf/solrconfig.xml", "solrhome/conf/schema.xml", "solrhome/");
    }

    @Test
    public void noResultInEmptyIndex() throws SolrServerException {
        assertQ("test query on empty index",
                req("text that is not found")
                , "//result[@numFound='0']"
        );
    }

    @Test
    public void pathIsMandatory() throws SolrServerException, IOException {
        assertFailedU(adoc("title", "the title"));
    }

    @Test
    public void simpleDocumentIsIndexedAndFound() throws SolrServerException, IOException {
        assertU(adoc("path", "/tmp/foo", "content", "Some important content."));
        assertU(commit());

        assertQ("added document found",
                req("important")
                , "//result[@numFound='1']"
        );
    }

}

We extend the class SolrTestCaseJ4 that is responsible for creating the core and instanciating the runtime using the paths we provide with the method initCore(). Using the available assert methods you can execute queries and validate the result using XPath expressions.

An example that instanciates a SolrServer might look like this:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import java.io.IOException;

public class ServerBasedTalkTest extends SolrTestCaseJ4 {

    private EmbeddedSolrServer server;

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solr/conf/solrconfig.xml", "solr/conf/schema.xml");
    }

    @Before
    public void initServer() {
        server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
    }

    @Test
    public void queryOnEmptyIndexNoResults() throws SolrServerException {
        QueryResponse response = server.query(new SolrQuery("text that is not found"));
        assertTrue(response.getResults().isEmpty());
    }

    @Test
    public void singleDocumentIsFound() throws IOException, SolrServerException {
        SolrInputDocument document = new SolrInputDocument();
        document.addField("path", "/tmp/foo");
        document.addField("content", "Mein Hut der hat 4 Ecken");

        server.add(document);
        server.commit();

        SolrParams params = new SolrQuery("ecke");
        QueryResponse response = server.query(params);
        assertEquals(1L, response.getResults().getNumFound());
        assertEquals("/tmp/foo", response.getResults().get(0).get("path"));
    }

    @After
    public void clearIndex() {
        super.clearIndex();
    }
}

The tests can now be executed using gradle test.

Testing your Solr configuration is important as changes in one place might easily lead to side effects with another search functionality. I recommend to add tests even for basic functionality and evolve the tests with your project.

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

FieldTermDocuments
AuthorFlorian Hopf1
Theo Tester2
CategoryCategory11
Category21, 2
Category32

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

Mittwoch, 6. Juni 2012

Berlin Buzzwords 2012

Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.

The Barcamp

As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.

Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.

HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.

The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.

I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.

Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.

The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.

Day 1

The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.

Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.

A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.

Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.

The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.

The day ended with interesting discussions and some beer at the Buzz Party and BBQ.

Day 2

The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.

Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.

In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.

One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.

In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.

The location

The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.

The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.

The format

This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.

Berlin Buzzwords is a fantastic conference and I will definitively go there again.

Elasticsearch - Der praktische Einstieg
Java Code Geeks