Donnerstag, 24. Januar 2013

Make your Filters Match: Faceting in Solr

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I'll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn't have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

As a very simple example consider this schema definition:

<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="text" type="text_general" indexed="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="false"/>
</fields>

There are three fields, the id, a title that we'd probably like to search on and an author. The author is defined as a string field which means no analyzing at all. The faceting mechanism uses the term value and not a stored value so we want to make sure that the original value is preserved. I explicitly don't store the author information to make it clear that we are working with the indexed value.

Let's index some book data with curl (see this GitHub repo for the complete example including some unit tests that execute the same functionality using Java).

curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary \
'<add><doc>
<field name="id">1</field>
<field name="text">On the Shortness of Life</field>
<field name="author">Seneca</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">What I Talk About When I Talk About Running</field>
<field name="author">Haruki Murakami</field>
</doc>
<doc>
<field name="id">3</field>
<field name="text">The Dude and the Zen Master</field>
<field name="author">Jeff "The Dude" Bridges</field>
</doc>
</add>'
curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

And verify that the documents are available:

curl http://localhost:8082/solr/query?q=*:*
{
"responseHeader":{
"status":0,
"QTime":3,
"params":{
"q":"*:*"}},
"response":{"numFound":3,"start":0,"docs":[
{
"id":"1",
"text":"On the Shortness of Life"},
{
"id":"2",
"text":"What I Talk About When I Talk About Running"},
{
"id":"3",
"text":"The Dude and the Zen Master"}]
}}

I'll omit parts of the response in the following examples. We can also have a look at the shiny new administration view of Solr 4 to see all terms that are indexed for the field author.

Each of the author names is indexed as one term.

Faceting

Let's move on to the faceting part. To let the user drill down on search results there are two steps involved. First you tell Solr that you would like to retrieve facets with the results. Facets are contained in an extra section of the response and consist of the indexed term as well as a count. As with most Solr parameters you can either send the necessary options with the query or preconfigure them in solrconfig.xml. This query has faceting on the author field enabled:

curl "http://localhost:8082/solr/query?q=*:*&facet=on&facet.field=author"
{
  "responseHeader":{...},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"},
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"},
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1,
        "Jeff \"The Dude\" Bridges",1,
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

And this is what a configuration in solrconfig looks like:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">none</str>
    <int name="rows">10</int>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
</requestHandler>

This way we don't have to pass the parameters with the query anymore and can see which parts of the query change.

Common Filtering

When a user chooses a facet you issue the same query again, this time adding a filter query that restricts the search results to any that have the value for this certain fields set. In our case the user would only see books of one certain author. Let's start simple and pretend that a user can't handle the massive amount of 3 search results and is only interested in books on Seneca:

curl 'http://localhost:8082/solr/select?fq=author:Seneca'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Works fine. We added a filter query that restricts the results to only those that are written by Seneca. Note that there is only one facet left because the search results don't contain any books by other authors. Let's see what happens when we try to filter the results to see only books by Haruki Murakami. We need to URL encode the blank, the rest of the query stays the same:

curl 'http://localhost:8082/solr/select?fq=author:Haruki%20Murakami'
{
  "responseHeader":{...},
  "response":{"numFound":0,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[]},
    "facet_dates":{},
    "facet_ranges":{}}}

No results. Why is that? The default query parser for filter queries is the Lucene query parser. It tokenizes the query on whitespace, so even if we store the field unanalyzed it's not the query we are probably expecting to use. The query that is the result of the parsing process is not a term query as in our first example. It's a boolean query that consists of two term queries author:Haruki text:murakami. If you are familiar with the Lucene query syntax this won't be a surprise to you. If you prefix a term with a field name and a colon it will search on this field, otherwise it will search on the default field we declared in solrconfig.xml.

How can we fix it? Simple, just turn it into a phrase by surrounding the words with double quotes:

curl 'http://localhost:8082/solr/select?fq=author:"Haruki%20Murakami"'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Or, if you prefer, you can also escape the blank using the backslash, which yields the same result:

curl 'http://localhost:8082/solr/select?fq=author:Haruki\%20Murakami'

Fun fact: I am not that good at picking examples. If we are filtering on our last author we will be surprised (at least I scratched my head for a while):

curl 'http://localhost:8082/solr/select?fq=author:Jeff%20"The%20Dude"%20Bridges'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Jeff \"The Dude\" Bridges",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

This actually seemed to work though we neither turned it into a phrase nor did we escape the blanks. If we look at how the Lucene query parser handles this query we see immediately why this returns a result. As with the last example this is turned into a boolean query, only the first query is executed against the author field. The other two tokens are searching on the default field and in this case "The Dude" matches the text field: author:Jeff text:"the dude" text:bridges. If you just want to match on the author field you can escape the blanks as we did in the example before:

curl 'http://localhost:8082/solr/select?fq=author:Jeff\%20\"The\%20Dude\"\%20Bridges'

I'll spare you with the response.

Using Local Params to set the Query Parser

At ApacheCon Europe in November Eric Hatcher did a really interesting presentation on query parsers in Solr where he introduced another, probably cleaner way to do this: You can use the local param syntax for choosing a different query parser. As we have learnt, the query parser defaults to the Lucene query parser. You can change the query parser for the query by setting the defType parameter, either via request parameters or in the solrconfig.xml but I am not aware of any way to set it for the filter queries. As we have unanalyzed terms the correct thing to do would be to use a TermQuery, which can be built using the TermQParserPlugin. To use this parser we can explicitly set it in the filter query:

curl 'http://localhost:8082/solr/select?fq={!term%20f=author%20v='Jeff%20"The%20Dude"%20Bridges'}'

Or, for better readability, without the URL encoding:

curl 'http://localhost:8082/solr/select?fq={!term f=author v='Jeff "The Dude" Bridges'}'

The local params are enclosed by curly braces. The value term is a shorthand for type='term', f is the fiels the TermQuery should be built for and v the value. Though this might look quirky at first this is a really powerful feature, especially since you can reference other request parameters from the local params. Consider this configuration of a request handler:

<requestHandler name="/selectfiltered" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
  <lst name="appends">
    <str name="fq">{!term f=author v=$author}</str>
  </lst>
</requestHandler>

The default configuration is the same as we were using above. Only the appends section is new, which adds additional parameters to the request. There are similar local params as we were using via curl, but the real filter query is replaced by the variable $author. This can now be passed in cleanly via an aptly named parameter:

curl 'http://localhost:8082/solr/selectfiltered?author=Jeff%20"The%20Dude"%20Bridges'

There are a lot of powerful features in Solr that are not that commonly used. To see this example in Java have a look at the Github repository of this blogpost.

Donnerstag, 10. Januar 2013

JUnit Rule for ElasticSearch

While I am using Solr a lot in my current engagement I recently started a pet project with ElasticSearch to learn more about it. Some of its functionality is rather different from Solr so there is quite some experimentation involved. I like to start small and implement tests if I like to find out how things work (see this post on how to write tests for Solr).

ElasticSearch internally uses TestNG and the test classes are not available in the distributed jar files. Fortunately it is really easy to start an ElasticSearch instance from within a test so it's no problem to do something similar in JUnit. Felix Müller posted some useful code snippets on how to do this, obviously targeted at a Maven build. The ElasticSearch instance is started in a setUp method and stopped in a tearDown method:

private EmbeddedElasticsearchServer embeddedElasticsearchServer;

@Before
public void startEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer = new EmbeddedElasticsearchServer();
}

@After
public void shutdownEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer.shutdown();
}

As it is rather cumbersome to add these methods to all tests I transformed the code to a JUnit rule. Rules can execute code before and after a test is run and influence its execution. There are some base classes available that make it really easy to get started with custom rules.

Our ElasticSearch example can be easily modeled using the base class ExternalResource (see the full example code on GitHub):

public class ElasticsearchTestNode extends ExternalResource {

    private Node node;
    private Path dataDirectory;
    
    @Override
    protected void before() throws Throwable {
        try {
            dataDirectory = Files.createTempDirectory("es-test", new FileAttribute []{});
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }

        ImmutableSettings.Builder elasticsearchSettings = ImmutableSettings.settingsBuilder()
                .put("http.enabled", "false")
                .put("path.data", dataDirectory.toString());

        node = NodeBuilder.nodeBuilder()
                .local(true)
                .settings(elasticsearchSettings.build())
                .node();
    }

    @Override
    protected void after() {
        node.close();
        try {
            FileUtils.deleteDirectory(dataDirectory.toFile());
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }
    }
    
    public Client getClient() {
        return node.client();
    }
}

The before method is executed before the test is run so we can use it to start ElasticSearch. All data is written to a temporary folder. The after method is used to stop ElasticSearch and delete the folder.

In your test you can now just use the rule, either with the @Rule annotation to have it triggered on each test method, or using @ClassRule to execute it only once per class:

public class CoreTest {

    @Rule
    public ElasticsearchTestNode testNode = new ElasticsearchTestNode();
    
    @Test
    public void indexAndGet() throws IOException {
        testNode.getClient().prepareIndex("myindex", "document", "1")
                .setSource(jsonBuilder().startObject().field("test", "123").endObject())
                .execute()
                .actionGet();
        
        GetResponse response = testNode.getClient().prepareGet("myindex", "document", "1").execute().actionGet();
        assertThat(response.getSource().get("test")).isEqualTo("123");
    }
}

As it is really easy to implement custom rules I think this is a feature I will be using more often in the future.

Donnerstag, 3. Januar 2013

12 Conferences of 2012

I went to a lot, probably too many conferences in 2012. As the year is over now I'd like to summarize some of the impressions, maybe there's a conference you didn't know about and you'd like to attend this year.

FOSDEM

FOSDEM is the Free and Open Source Software Developer European Meetup, a yearly event that takes place in Brussels, Belgium. There are multiple tracks and developer rooms on a multitude of topics ranging from databases to programming languages and open source tools. The rooms are spread across some buildings at the University so there might be some walking involved when switching tracks. What is rather special is that there's no registration involved, you just go there and that's it. The amount of people can be overwhelming, especially in the main entrance area. Unfortunately I was rather disappointed with the talks I chose. The event is a very good fit if you are working on an Open Source project and, as the name of the conference suggests, want to meet other developers of the project.

Berlin Expert Days

BED-Con is a rather young conference organized by the Java User Group Berlin-Brandenburg. I haven't been there in the first year but in 2012 it still had a small and informal feeling. The conference takes place in three rooms of the Freie Universität Berlin, the content selection was an excellent mixture of technical and process/meta talks, most of them in German. If you can afford the trip to Berlin I'd definitively recommend going there.

JAX

The largest and best known German Java conference. There are two editions, JAX in April in Mainz (the one I attended) and W-JAX in November in Munich. There's one huge hall and several smaller rooms and a wide variety of topics you can choose from. I never planned to go there as the admission fee is rather high (thanks to Software & Support for sponsoring my ticket) but I have to admit that it really can be worth the money. There were excellent talks by Charles Nutter, Tim Berglund and many more. The infrastructure (food, coffee, schedules) is very good, if you are on a business budget you can gain a lot by visiting.

Berlin Buzzwords

A niche conference on Search, Scale and Big Data. A lot of people are coming from overseas just to visit. If you are interested in these technologies definitively go there. For more information see this post.

Barcamp Karlsruhe

My first real Barcamp. Really fun event but of course there are always some sessions that are not as interesting as anticipated. Topics ranged from Computer and work stuff to more soft content. I always thought this is a nerd only event but as there is so much to choose from Barcamps might even be interesting for people who are not that much into computers. Very well organized, no admission fee, interesting sessions.

Socrates

The International Software Craftmanship and Testing Conference. Awesome setting and the first open space conference I attended. It takes place in a seminar center in the middle of nowhere which makes it a very intense experience. Besides the sessions there are a lot of informal discussions going on around the day with the very enthusiastic attendants. The 2012 event started Thursday evening with a world cafe, Friday and Saturday open space and an optional code retreat on Sunday. I'd say there were three kinds of sessions: informal discussion rounds, practical hands on sessions and talks. It seems that most people liked the practical sessions best, so if I could choose again I'd go to more of those. Thanks to the sponsors all we had to pay was the accommodation and one meal which additionally makes it an incredible cheap event. Be quick with registration as space is limited.

FrOSCon

The Free and Open Source Software Conference is a great community weekend event with different tracks on admin and development topics. I like it a lot because of the variation of talks and the very informal setting. It's a mixture of holiday and learning and for me a chance to get information on topics that are not presented at the other developer conferences I attend. Talks are partly english, partly german. You can either stay in St. Augustin or in Bonn, it's only a short tram ride.

JAX On Tour

JAX On Tour is another event I attended because of the generous sponsorship of Software & Support. It's not a conference but a training event with longer talks that are grouped together. It's a small event and there's always time for questions. I learned a lot, mainly about documenting architectures. This is a really good alternative to a normal conference to grasp a topic in depth.

OpenCms-Days

If you are into OpenCms you probably already heard about OpenCms-Days, and if not this is probably not for you. Two days of OpenCms only, used by Alkacon to present new features and by the community to present extensions and projects. I am always impressed that there are people who fly around the world just to attend, but of course this is the only conference of its kind worldwide. There is always something new to learn and it's fun to meet the community.

DevFest Karlsruhe

A one day event on Google technologies. There have been several events in different cities worldwide, this one organized by the Google Developer Group Karlsruhe. The organizers were really unlucky as multiple speakers canceled on short notice, nevertheless there were some really good talks. Kudos to the organizers who managed to get this event started in a really short time frame.

ApacheCon Europe

I originally went to ApacheCon because it took place in Sinsheim, which is close to Karlsruhe. Fortunately in this year the conference also hosted the LuceneCon Europe, so there were lots of interesting talks for me. Additionally I had been voted as a committer to the ODFToolkit just before it and I was able to meet some other people that are involved in the project. The location was really special (soccer stadium) but I think a lot of non-locals had to suffer a bit because of the lack of hotels and taxis. This community event can be really interesting even if you are only a user of a project.

Devoxx

Devoxx is the largest Java conference in Europe, organized by the Java User Group Belgium. They attract a lot of high class speakers so this is the place to keep you informed on the Java universe. Located in a large multiplex cinema in a suburb of Antwerp, very comfy chairs and huge screens. The week starts with two university days that contain longer talks and are usually less crowded. This year I went there on Wednesday, due to a train strike in Belgium I arrived quite late. I have to admit that it probably wasn't worth the hassle for only 1.5 days. If you are going there I recommend to stay the whole week.

2013

So this was a lot last year. I don't plan to visit that many conferences again. I am sure that I will be going to Berlin Buzzwords, Socrates, FrOSCon. There probably will be more, but it won't get 13 this year :).

Finally, I couldn't have afforded to pay for all those conferences myself, so thanks to synyx, who paid for FOSDEM and BEDCon when I was still employed there and provided me with a free ticket to OpenCms-Days. Thanks to Software & Support for letting me attend JAX and JAX On Tour for free, those guys are fantastic supporters of the Java User Group Karlsruhe. Also, thanks to the Devoxx team for letting me attend as an ambassador of our JUG.