Freitag, 7. Dezember 2012

Looking at a Plaintext Lucene Index

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 4
  term document
    doc 0
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
  term second
    doc 1
      freq 1
      pos 4
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END

The content that is marked as stored resides in the .fld file:

doc 0
  numfields 1
  field 0
    name title
    type string
    value The title of my first document
doc 1
  numfields 1
  field 0
    name title
    type string
    value The title of the second document
END

If you'd like to have a look at the rest of the files checkout the code at Github.

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.

Donnerstag, 23. August 2012

Getting rid of synchronized: Using Akka from Java

I've been giving an internal talk on Akka, the Actor framework for the JVM, at my former company synyx. For the talk I implemented a small example application, kind of a web crawler, using Akka. I published the source code on Github and will explain some of the concepts in this post.

Motivation

To see why you might need something like Akka, think you want to implement a simple web crawler for offline search. You are downloading pages from a certain location, parse and index the content and follow any links that you haven't parsed and indexed yet. I am using HtmlParser for downloading and parsing pages and Lucene for indexing them. The logic is contained in two service objects, PageRetriever and Indexer, that can be used from our main application.

A simple sequential execution might then look something like this:

public void downloadAndIndex(String path, IndexWriter writer) {
    VisitedPageStore pageStore = new VisitedPageStore();
    pageStore.add(path);
        
    Indexer indexer = new IndexerImpl(writer);
    PageRetriever retriever = new HtmlParserPageRetriever(path);
        
    String page;
    while ((page = pageStore.getNext()) != null) {
        PageContent pageContent = retriever.fetchPageContent(page);
        pageStore.addAll(pageContent.getLinksToFollow());
        indexer.index(pageContent);
        pageStore.finished(page);
    }
        
    indexer.commit();
}

We are starting with one page, extract the content and the links, index the content and store all links that are to be visited in the VisitedPageStore. This class contains the logic to determine which links are visited already. We are looping as long as there are more links to follow, once we are done we commit the Lucene IndexWriter.

This implementation works fine, when running on my outdated laptop it will finish in around 3 seconds for an example page. (Note that the times I am giving are by no means meant as a benchmark but are just there to give you some idea on the numbers).

So we are done? No, of course we can do better by optimizing the resources we have available. Let's try to improve this solution by splitting it into several tasks that can be executed in parallel.

Shared State Concurrency

The normal way in Java would be to implement several Threads that do parts of the work and access the state via guarded blocks, e.g. by synchronizing methods. So in our case there might be several Threads that access our global state that is stored in the VisitedPageStore.

This model is what Venkat Subramaniam calls Synchronize and Suffer in his great book Programming Concurrency on the JVM. Working with Threads and building correct solutions might not seem that hard at first but is inherintly difficult. I like those two tweets that illustrate the problem:

Brian Goetz of course being the author of the de-facto standard book on the new Java concurrency features, Java Concurrency in Practice.

Akka

So what is Akka? It's an Actor framework for the JVM that is implemented in Scala but that is something that you rarely notice when working from Java. It offers a nice Java API that provides most of the functionality in a convenient way.

Actors are a concept that was introduced in the seventies but became widely known as one of the core features of Erlang, a language to build fault tolerant, self healing systems. Actors employ the concept of Message Passing Concurrency. That means that Actors only communicate by means of messages that are passed into an Actors mailbox. Actors can contain state that they shield from the rest of the system. The only way to change the state is by passing in messages. Each Actor is executed in a different Thread but they provide a higher level of abstraction than working with Threads directly.

When implementing Actors you put the behaviour in a method receive() that can act on incoming messages. You can then reply asynchronously to the sender or send messages to any other Actor.

For our problem at hand an Actor setup might look something like this:

There is one Master Actor that also contains the global state. It sends a message to fetch a certain page to a PageParsingActor that asynchonously responds to the Master with the PageContent. The Master can then send the PageContent to an IndexingActor which responds with another message. With this setup we have done a first step to scale our solution. There are now three Actors that can be run on different cores of your machine.

Actors are instantiated from other Actors. On the top there's the ActorSystem that is provided by the framework. The MasterActor is instaciated from the ActorSystem:

ActorSystem actorSystem = ActorSystem.create();
final CountDownLatch countDownLatch = new CountDownLatch(1);
ActorRef master = actorSystem.actorOf(new Props(new UntypedActorFactory() {

    @Override
    public Actor create() {
        return new SimpleActorMaster(new HtmlParserPageRetriever(path), writer, countDownLatch);
    }
}));

master.tell(path);
try {
    countDownLatch.await();
    actorSystem.shutdown();
} catch (InterruptedException ex) {
    throw new IllegalStateException(ex);
}

Ignore the CountdownLatch as it is only included to make it possible to terminate the application. Note that we are not referencing an instance of our class but an ActorRef, a reference to an actor. You will see later why this is important.

The MasterActor contains references to the other Actors and creates them from its context. This makes the two Actors children of the Master:

public SimpleActorMaster(final PageRetriever pageRetriever, final IndexWriter indexWriter,
    final CountDownLatch latch) {

    super(latch);
    this.indexer = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new IndexingActor(new IndexerImpl(indexWriter));
        }
    }));

    this.parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

           return new PageParsingActor(pageRetriever);
        }
    }));
}

The PageParsingActor acts on messages to fetch pages and sends a message with the result to the sender:

public void onReceive(Object o) throws Exception {
    if (o instanceof String) {
        PageContent content = pageRetriever.fetchPageContent((String) o);
        getSender().tell(content, getSelf());
    } else {
        // fail on any message we don't expect
        unhandled(o);
    }
}

The IndexingActor contains some state with the Indexer. It acts on messages to index pages and to commit the indexing process.

public void onReceive(Object o) throws Exception {
    if (o instanceof PageContent) {
        PageContent content = (PageContent) o;
        indexer.index(content);
        getSender().tell(new IndexedMessage(content.getPath()), getSelf());
    } else if (COMMIT_MESSAGE == o) {
        indexer.commit();
        getSender().tell(COMMITTED_MESSAGE, getSelf());
    } else {
        unhandled(o);
    }
}

The MasterActor finally orchestrates the other Actors in its receive() method. It starts with one page and sends it to the PageParsingActor. It keeps the valuable state of the application in the VisitedPageStore. When no more pages are to be fetched and indexed it sends a commit message and terminates the application.

public void onReceive(Object message) throws Exception {

    if (message instanceof String) {
        // start
        String start = (String) message;
        visitedPageStore.add(start);
        getParser().tell(visitedPageStore.getNext(), getSelf());
    } else if (message instanceof PageContent) {
        PageContent content = (PageContent) message;
        getIndexer().tell(content, getSelf());
        visitedPageStore.addAll(content.getLinksToFollow());

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        } else {
            for (String page : visitedPageStore.getNextBatch()) {
                getParser().tell(page, getSelf());
            }
        }
    } else if (message instanceof IndexedMessage) {
        IndexedMessage indexedMessage = (IndexedMessage) message;
        visitedPageStore.finished(indexedMessage.path);

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        }
    } else if (message == IndexingActor.COMMITTED_MESSAGE) {
        logger.info("Shutting down, finished");
        getContext().system().shutdown();
        countDownLatch.countDown();
    }
}

What happens if we run this example? Unfortunately it now takes around 3.5 seconds on my dual core machine. Though we are now able to run on both cores we have actually decreased the speed of the application. This is probably an important lesson. When building scalable applications it might happen that you are introducing some overhead that decreases the performance when running in the small. Scalability is not about increasing performance but about the ability to distribute the load.

So it was an failure to switch to Akka? Not at all. It turns out that most of the time the application is fetching and parsing pages. This includes waiting for the network. Indexing in Lucene is blazing fast and the Master mostly only dispatches messages. So what can we do about it? We already have split our application into smaller chunks. Fortunately the PageParsingActor doesn't contain any state at all. That means we can easily parallelize its tasks.

This is where the talking to references is important. For an Actor it's transparent if there is one or a million Actors behind a reference. There is one mailbox for an Actor reference that can dispatch the messages to any amount of Actors.

We only need to change the instanciation of the Actor, the rest of the application remains the same:

parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new PageParsingActor(pageRetriever);
        }
}).withRouter(new RoundRobinRouter(10)));

By using a router the Akka framework automatically takes care that there are 10 Actors available. The messages are distributed to any available Actor. This takes the runtime down to 2 seconds.

A word on Blocking

Note that the way I am doing network requests here is not recommended in Akka. HTMLParser is doing blocking networking which should be carefully reconsidered when designing a reactive system. In fact, as this application is highly network bound, we might even gain more benefit by just using an asynchronous networking library. But hey, then I wouldn't be able to tell you how nice it is to use Akka. In a future post I will highlight some more Akka features that can help to make our application more robust and fault tolerant.

Donnerstag, 5. Juli 2012

Slides and demo code for my talk at JUG KA available

I just uploaded the (german) slides as well as the example code for yesterdays talk on Lucene and Solr at our local Java User Group.

The demo application contains several subprojects for indexing and searching with Lucene and Solr as well as a simple Dropwizard application that demonstrates some search features. See the README files in the source tree to find out how to run the application.

Freitag, 29. Juni 2012

Dropwizard Encoding Woes

I have been working on an example application for Lucene and Solr for my upcoming talk at the Java User Group Karlsruhe. As a web framework I wanted to try Dropwizard, a lightweight application framework that can expose resources via JAX-RS, provides out of the box monitoring support and can render resource representations using Freemarker. It's really easy to get started, there's a good tutorial and the manual.

An example resource might look like this:

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

@Path("/example")
@Produces(MediaType.TEXT_HTML)
public class ExampleResource {

    @GET
    public ExampleView illustrate() {
        return new ExampleView("Mot\u00f6rhead");
    }

}

The Resource produces HTML using Freemarker, which is possible if you add the view bundle in the service. There is one method that is called when the resource is addressed using GET. Inside the method we create a view object accepting a message that in this case contains the umlaut 'ö'. The view class that is returned by the method looks like this:

import com.yammer.dropwizard.views.View;

public class ExampleView extends View {

    private final String message;

    public ExampleView(String message) {
        super("example.fmt");
        this.message = message;
    }

    public String getMessage() {
        return message;
    }
}

It accepts a message as constructor parameter. The template name is passed to the parent class. This view class is now available in a freemarker template, an easy variant looks like this:

<html>
    <body>
        <h1>${message} rocks!</h1>
    </body>
</html>
If I run this on my machine and access it with Firefox it doesn't work as expected. The umlaut character is broken, something Lemmy surely doesn't approve:

Accessing the resource using curl works flawlessly:

curl http://localhost:8080/example
<html>
    <body>
        <h1>Motörhead rocks!</h1>
    </body>
</html>

Why is that? It's Servlet Programming 101: You need to set the character encoding of the response. My Firefox defaults to ISO-8859-1, curl seems to use UTF-8 by default. How can we fix it? Tell the client which encoding we are using, which can be done using the Produces annotation:

@Produces("text/html; charset=utf-8")

So what does it have to do with Dropwizard? Nothing really, it's a JAX-RS thing. All components in Dropwizard (Jetty and Freemarker notably) are using UTF-8 by default.

Mittwoch, 20. Juni 2012

Running and Testing Solr with Gradle

A while ago I blogged on testing Solr with Maven on the synyx blog. In this post I will show you how to setup a similar project with Gradle that can start the Solr webapp and execute tests against your configuration.

Running Solr

Solr is running as a webapp in any JEE servlet container like Tomcat or Jetty. The index and search configuration resides in a directory commonly referred to as Solr home that can be outside of the webapp directory. This is also the place where the Lucene index files are created. The location for Solr home can be set using an environment variable.

The Solr war file is available in Maven Central. This post describes how to run a war file that is deployed in a Maven repository using Gradle. Let's see how the Gradle build file looks like for running Solr:

import org.gradle.api.plugins.jetty.JettyRunWar

apply plugin: 'java'
apply plugin: 'jetty'

repositories {
    mavenCentral()
}

// custom configuration for running the webapp
configurations {
    solrWebApp
}

dependencies {
    solrWebApp "org.apache.solr:solr:3.6.0@war"
}

// custom task that configures the jetty plugin
task runSolr(type: JettyRunWar) {
    webApp = configurations.solrWebApp.singleFile

    // jetty configuration
    httpPort = 8082
    contextPath = 'solr'
}

// executed before jetty starts
runSolr.doFirst {
    System.setProperty("solr.solr.home", "./solrhome")
}

We are creating a custom configuration that contains the Solr war file. In the task runSolr we configure the Jetty plugin. To add the Solr home environment variable we can use the way described by Sebastian Himberger. We add a code block that is executed before Jetty starts and sets the environment variable using standard Java mechanisms. You can now start Solr using gradle runSolr. You will see some errors regarding multiple versions of slf4j that are very like caused by this bug.

Testing the Solr configuration

Solr provides some classes that start an embedded instance using your configuration. You can use these classes in any setup as they do not depend on the gradle jetty plugin. Starting with Solr 3.2 the test framework is not included in solr-core anymore. This is what the relevant part of the dependency section looks like now:

testCompile "junit:junit:4.10"
testCompile "org.apache.solr:solr-test-framework:3.6.0"

Now you can place a test in src/test/java that either uses the convenience methods provided by SolrTestCaseJ4 or you can instantiate an EmbeddedSolrServer and execute any SolrJ actions. Both of these ways will use your custom config. This way you can easily validate that configuration changes don't break existing functionality. An example of using the convenience methods:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrServerException;
import org.junit.BeforeClass;
import org.junit.Test;
import java.io.IOException;

public class BasicConfigurationTest extends SolrTestCaseJ4 {

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solrhome/conf/solrconfig.xml", "solrhome/conf/schema.xml", "solrhome/");
    }

    @Test
    public void noResultInEmptyIndex() throws SolrServerException {
        assertQ("test query on empty index",
                req("text that is not found")
                , "//result[@numFound='0']"
        );
    }

    @Test
    public void pathIsMandatory() throws SolrServerException, IOException {
        assertFailedU(adoc("title", "the title"));
    }

    @Test
    public void simpleDocumentIsIndexedAndFound() throws SolrServerException, IOException {
        assertU(adoc("path", "/tmp/foo", "content", "Some important content."));
        assertU(commit());

        assertQ("added document found",
                req("important")
                , "//result[@numFound='1']"
        );
    }

}

We extend the class SolrTestCaseJ4 that is responsible for creating the core and instanciating the runtime using the paths we provide with the method initCore(). Using the available assert methods you can execute queries and validate the result using XPath expressions.

An example that instanciates a SolrServer might look like this:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import java.io.IOException;

public class ServerBasedTalkTest extends SolrTestCaseJ4 {

    private EmbeddedSolrServer server;

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solr/conf/solrconfig.xml", "solr/conf/schema.xml");
    }

    @Before
    public void initServer() {
        server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
    }

    @Test
    public void queryOnEmptyIndexNoResults() throws SolrServerException {
        QueryResponse response = server.query(new SolrQuery("text that is not found"));
        assertTrue(response.getResults().isEmpty());
    }

    @Test
    public void singleDocumentIsFound() throws IOException, SolrServerException {
        SolrInputDocument document = new SolrInputDocument();
        document.addField("path", "/tmp/foo");
        document.addField("content", "Mein Hut der hat 4 Ecken");

        server.add(document);
        server.commit();

        SolrParams params = new SolrQuery("ecke");
        QueryResponse response = server.query(params);
        assertEquals(1L, response.getResults().getNumFound());
        assertEquals("/tmp/foo", response.getResults().get(0).get("path"));
    }

    @After
    public void clearIndex() {
        super.clearIndex();
    }
}

The tests can now be executed using gradle test.

Testing your Solr configuration is important as changes in one place might easily lead to side effects with another search functionality. I recommend to add tests even for basic functionality and evolve the tests with your project.

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

FieldTermDocuments
AuthorFlorian Hopf1
Theo Tester2
CategoryCategory11
Category21, 2
Category32

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

Mittwoch, 6. Juni 2012

Berlin Buzzwords 2012

Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.

The Barcamp

As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.

Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.

HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.

The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.

I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.

Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.

The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.

Day 1

The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.

Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.

A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.

Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.

The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.

The day ended with interesting discussions and some beer at the Buzz Party and BBQ.

Day 2

The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.

Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.

In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.

One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.

In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.

The location

The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.

The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.

The format

This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.

Berlin Buzzwords is a fantastic conference and I will definitively go there again.