Donnerstag, 5. Juli 2012

Slides and demo code for my talk at JUG KA available

I just uploaded the (german) slides as well as the example code for yesterdays talk on Lucene and Solr at our local Java User Group.

The demo application contains several subprojects for indexing and searching with Lucene and Solr as well as a simple Dropwizard application that demonstrates some search features. See the README files in the source tree to find out how to run the application.

Freitag, 29. Juni 2012

Dropwizard Encoding Woes

I have been working on an example application for Lucene and Solr for my upcoming talk at the Java User Group Karlsruhe. As a web framework I wanted to try Dropwizard, a lightweight application framework that can expose resources via JAX-RS, provides out of the box monitoring support and can render resource representations using Freemarker. It's really easy to get started, there's a good tutorial and the manual.

An example resource might look like this:

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

@Path("/example")
@Produces(MediaType.TEXT_HTML)
public class ExampleResource {

    @GET
    public ExampleView illustrate() {
        return new ExampleView("Mot\u00f6rhead");
    }

}

The Resource produces HTML using Freemarker, which is possible if you add the view bundle in the service. There is one method that is called when the resource is addressed using GET. Inside the method we create a view object accepting a message that in this case contains the umlaut 'ö'. The view class that is returned by the method looks like this:

import com.yammer.dropwizard.views.View;

public class ExampleView extends View {

    private final String message;

    public ExampleView(String message) {
        super("example.fmt");
        this.message = message;
    }

    public String getMessage() {
        return message;
    }
}

It accepts a message as constructor parameter. The template name is passed to the parent class. This view class is now available in a freemarker template, an easy variant looks like this:

<html>
    <body>
        <h1>${message} rocks!</h1>
    </body>
</html>
If I run this on my machine and access it with Firefox it doesn't work as expected. The umlaut character is broken, something Lemmy surely doesn't approve:

Accessing the resource using curl works flawlessly:

curl http://localhost:8080/example
<html>
    <body>
        <h1>Motörhead rocks!</h1>
    </body>
</html>

Why is that? It's Servlet Programming 101: You need to set the character encoding of the response. My Firefox defaults to ISO-8859-1, curl seems to use UTF-8 by default. How can we fix it? Tell the client which encoding we are using, which can be done using the Produces annotation:

@Produces("text/html; charset=utf-8")

So what does it have to do with Dropwizard? Nothing really, it's a JAX-RS thing. All components in Dropwizard (Jetty and Freemarker notably) are using UTF-8 by default.

Mittwoch, 20. Juni 2012

Running and Testing Solr with Gradle

A while ago I blogged on testing Solr with Maven on the synyx blog. In this post I will show you how to setup a similar project with Gradle that can start the Solr webapp and execute tests against your configuration.

Running Solr

Solr is running as a webapp in any JEE servlet container like Tomcat or Jetty. The index and search configuration resides in a directory commonly referred to as Solr home that can be outside of the webapp directory. This is also the place where the Lucene index files are created. The location for Solr home can be set using an environment variable.

The Solr war file is available in Maven Central. This post describes how to run a war file that is deployed in a Maven repository using Gradle. Let's see how the Gradle build file looks like for running Solr:

import org.gradle.api.plugins.jetty.JettyRunWar

apply plugin: 'java'
apply plugin: 'jetty'

repositories {
    mavenCentral()
}

// custom configuration for running the webapp
configurations {
    solrWebApp
}

dependencies {
    solrWebApp "org.apache.solr:solr:3.6.0@war"
}

// custom task that configures the jetty plugin
task runSolr(type: JettyRunWar) {
    webApp = configurations.solrWebApp.singleFile

    // jetty configuration
    httpPort = 8082
    contextPath = 'solr'
}

// executed before jetty starts
runSolr.doFirst {
    System.setProperty("solr.solr.home", "./solrhome")
}

We are creating a custom configuration that contains the Solr war file. In the task runSolr we configure the Jetty plugin. To add the Solr home environment variable we can use the way described by Sebastian Himberger. We add a code block that is executed before Jetty starts and sets the environment variable using standard Java mechanisms. You can now start Solr using gradle runSolr. You will see some errors regarding multiple versions of slf4j that are very like caused by this bug.

Testing the Solr configuration

Solr provides some classes that start an embedded instance using your configuration. You can use these classes in any setup as they do not depend on the gradle jetty plugin. Starting with Solr 3.2 the test framework is not included in solr-core anymore. This is what the relevant part of the dependency section looks like now:

testCompile "junit:junit:4.10"
testCompile "org.apache.solr:solr-test-framework:3.6.0"

Now you can place a test in src/test/java that either uses the convenience methods provided by SolrTestCaseJ4 or you can instantiate an EmbeddedSolrServer and execute any SolrJ actions. Both of these ways will use your custom config. This way you can easily validate that configuration changes don't break existing functionality. An example of using the convenience methods:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrServerException;
import org.junit.BeforeClass;
import org.junit.Test;
import java.io.IOException;

public class BasicConfigurationTest extends SolrTestCaseJ4 {

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solrhome/conf/solrconfig.xml", "solrhome/conf/schema.xml", "solrhome/");
    }

    @Test
    public void noResultInEmptyIndex() throws SolrServerException {
        assertQ("test query on empty index",
                req("text that is not found")
                , "//result[@numFound='0']"
        );
    }

    @Test
    public void pathIsMandatory() throws SolrServerException, IOException {
        assertFailedU(adoc("title", "the title"));
    }

    @Test
    public void simpleDocumentIsIndexedAndFound() throws SolrServerException, IOException {
        assertU(adoc("path", "/tmp/foo", "content", "Some important content."));
        assertU(commit());

        assertQ("added document found",
                req("important")
                , "//result[@numFound='1']"
        );
    }

}

We extend the class SolrTestCaseJ4 that is responsible for creating the core and instanciating the runtime using the paths we provide with the method initCore(). Using the available assert methods you can execute queries and validate the result using XPath expressions.

An example that instanciates a SolrServer might look like this:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import java.io.IOException;

public class ServerBasedTalkTest extends SolrTestCaseJ4 {

    private EmbeddedSolrServer server;

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solr/conf/solrconfig.xml", "solr/conf/schema.xml");
    }

    @Before
    public void initServer() {
        server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
    }

    @Test
    public void queryOnEmptyIndexNoResults() throws SolrServerException {
        QueryResponse response = server.query(new SolrQuery("text that is not found"));
        assertTrue(response.getResults().isEmpty());
    }

    @Test
    public void singleDocumentIsFound() throws IOException, SolrServerException {
        SolrInputDocument document = new SolrInputDocument();
        document.addField("path", "/tmp/foo");
        document.addField("content", "Mein Hut der hat 4 Ecken");

        server.add(document);
        server.commit();

        SolrParams params = new SolrQuery("ecke");
        QueryResponse response = server.query(params);
        assertEquals(1L, response.getResults().getNumFound());
        assertEquals("/tmp/foo", response.getResults().get(0).get("path"));
    }

    @After
    public void clearIndex() {
        super.clearIndex();
    }
}

The tests can now be executed using gradle test.

Testing your Solr configuration is important as changes in one place might easily lead to side effects with another search functionality. I recommend to add tests even for basic functionality and evolve the tests with your project.

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

FieldTermDocuments
AuthorFlorian Hopf1
Theo Tester2
CategoryCategory11
Category21, 2
Category32

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

Mittwoch, 6. Juni 2012

Berlin Buzzwords 2012

Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.

The Barcamp

As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.

Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.

HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.

The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.

I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.

Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.

The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.

Day 1

The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.

Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.

A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.

Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.

The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.

The day ended with interesting discussions and some beer at the Buzz Party and BBQ.

Day 2

The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.

Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.

In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.

One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.

In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.

The location

The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.

The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.

The format

This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.

Berlin Buzzwords is a fantastic conference and I will definitively go there again.

Freitag, 11. Mai 2012

Content Extraction with Apache Tika

Sometimes you need access to the content of documents, be it that you want to analyze it, store the content in a database or index it for searching. Different formats like word documents, pdfs and html documents need different treatment. Apache Tika is a project that combines several open source projects for reading content from a multitude of file formats and makes the textual content as well as some metadata available using a uniform API. I will show two ways how to leverage the power of Tika for your projects.

Accessing Tika programmatically

First, Tika can of course be used as a library. Surprisingly the user docs on the website explain a lot of the functionality that you might be interested in when writing custom parsers for Tika but don't show directly how to use it.

I am using Maven again, so I add a dependency for the most recent version:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.1</version>
    <type>jar</type>
</dependency>

tika-parsers also includes all the other projects that are used so be patient when Maven fetches all the transitive dependencies.

Let's see what some test code for extracting data from a pdf document called slides.pdf, that is available in the classpath, looks like.

Parser parser = new PdfParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream content = getClass().getResourceAsStream("/slides.pdf");
parser.parse(content, handler, metadata, new ParseContext());
assertEquals("Solr Vortrag", metadata.get(Metadata.TITLE));
assertTrue(handler.toString().contains("Lucene"));

First, we need to instanciate a Parser that is capable of reading the format, in this case PdfParser that uses PDFBox for extracting the content. The parse method expects some parameters to configure the parsing process as well as an InputStream that contains the data of the document. Metadata will contain all the metadata for the document, e.g. the title or the author after the parsing is finished.

Tika uses XHTML as the internal representation for all parsed content. This XHTML document can be processed by a SAX ContentHandler. A custom implementation BodyContentHandler returns all the text in the body area, which is the main content. The last parameter ParseContext can be used to configure the underlying parser instance.

The Metadata class consists of a Map-like structure with some common keys like the title as well as optional format specific information. You can look at the contents with a simple loop:

for (String name: metadata.names()) { 
    System.out.println(name + ": " + metadata.get(name));
}

This will produce an output similar to this:

xmpTPg:NPages: 17
Creation-Date: 2010-11-20T09:47:28Z
title: Solr Vortrag
created: Sat Nov 20 10:47:28 CET 2010
producer: OpenOffice.org 2.4
Content-Type: application/pdf
creator: Impress

The textual content of the document can be retrieved by calling the toString() method on the BodyContentHandler.

This is all fine if you exactly know that you only want to retrieve data from pdf documents. But you probably don't want to introduce a huge switch-block for determining the parser to use depending on the file name or some other information. Fortunately Tika also provides an AutodetectParser that employs different strategies for determining the content type of the document. The code above all stays the same, you just use a different parser:

Parser parser = new AutodetectParser();

This way you don't have to know what kind of document you are currently processing, Tika will provide you with metadata as well as the content. You can pass in additional hints for the parser e.g. the filename or the content type by setting it in the Metadata object.

Extracting content using Solr

If you are using the search server Solr you can also leverage its REST API for extracting the content. The default configuration has a request handler configured for /update/extract that you can send a document to and it will return the content it extracted using Tika. You just need to add the necessary libraries for the extraction. I am still using Maven so I have to add an additional dependency:

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
</dependency>
<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-cell</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
</dependency>

This will include all of the Tika dependencies as well as all necessary third party libraries.

Solr Cell, the request handler, normally is used to index binary files directly but you can also just use it for extraction. To transfer the content you can use any tool that can speak http, e.g. for curl this might look like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"

By setting the parameter extractOnly to true we advice Solr that we don't want to index the content but want to have it extracted to the response. The result will be the standard Solr XML format that contains the body content as well as the metadata.

You can also use the Java client library SolrJ for doing the same:

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("slides.pdf"));
request.setParam("extractOnly", "true");
request.setParam("extractFormat", "text");
NamedList<Object> result = server.request(request);

The NamedList will contain entries for the body content as well as another NamedList with the metadata.


Update


Robert has asked in the comments what the response looks like.
Solr uses configurable response writers for marshalling the message. The default format is xml but can be influenced by passing the wt attribute to the request. A simplified standard response looks like this:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1952</int></lst><str name="slides.pdf">

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax
Java­Client SolrJ

[... more content ...]

</str><lst name="slides.pdf_metadata"><arr name="xmpTPg:NPages"><str>17</str></arr><arr name="Creation-Date"><str>2010-11-20T09:47:28Z</str></arr><arr name="title"><str>Solr Vortrag</str></arr><arr name="stream_source_info"><str>file</str></arr><arr name="created"><str>Sat Nov 20 10:47:28 CET 2010</str></arr><arr name="stream_content_type"><str>application/octet-stream</str></arr><arr name="stream_size"><str>425327</str></arr><arr name="producer"><str>OpenOffice.org 2.4</str></arr><arr name="stream_name"><str>slides.pdf</str></arr><arr name="Content-Type"><str>application/pdf</str></arr><arr name="creator"><str>Impress</str></arr></lst>
</response>

The response contains some metadata (how long the processing took), the content of the file as well as the metadata that is extracted from the document.


If you pass the atrribute wt and set it to json, the response is contained in a json structure:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text&wt=json"             
{"responseHeader":{"status":0,"QTime":217},"slides.pdf":"\n\n\n\n\n\n\n\n\n\n\n\nSolr Vortrag\n\n   \n\nEinfach mehr finden mit\n\nFlorian Hopf\n29.09.2010\n\n\n   \n\nSolr?\n\n\n   \n\nSolr?\n\nServer­ization of Lucene\n\n\n   \n\nApache Lucene?\n\nSearch engine library\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \nScoring\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\n\n\n   \n\nArchitektur\n\nClient SolrWebapp Lucene\nhttp\n\nKommunikation über XML, JSON, JavaBin, Ruby, ...\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\nJava­Client SolrJ\n\n\n   \n\nDemo\n\n\n   \n\nWas noch?\nAdmin­Interface\nCaching\nSkalierung\nSpellchecker\nMore­Like­This\nData Import Handler\nSolrCell\n\n\n   \n\nRessourcen\nhttp://lucene.apache.org/solr/\n\n\n\n","slides.pdf_metadata":["xmpTPg:NPages",["17"],"Creation-Date",["2010-11-20T09:47:28Z"],"title",["Solr Vortrag"],"stream_source_info",["file"],"created",["Sat Nov 20 10:47:28 CET 2010"],"stream_content_type",["application/octet-stream"],"stream_size",["425327"],"producer",["OpenOffice.org 2.4"],"stream_name",["slides.pdf"],"Content-Type",["application/pdf"],"creator",["Impress"]]}

There are quite some ResponseWriters available for different languages, e.g. for Ruby. You can have a look at them at the bottom of this page: http://wiki.apache.org/solr/QueryResponseWriter

Montag, 7. Mai 2012

Importing Atom feeds in Solr using the Data Import Handler

I am working on a search solution that makes some of the content I am producing available through one search interface. One of the content stores is the blog you are reading right now, which among other options makes the content available here using Atom.

Solr, my search server of choice, provides the Data Import Handler that can be used to import data on a regular basis from sources like databases via JDBC or remote XML sources, like Atom.

Data Import Handler used to be a core part of Solr but starting from 3.1 it is shipped as a separate jar and not included in the standard war anymore. I am using Maven with overlays for development so I have to add a dependency for it:

<dependencies>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
  </dependency>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-dataimporthandler</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
  </dependency>
</dependencies>

To enable the data import handler you have to add a request handler to your solrconfig.xml. Request handlers are registered for a certain url and, as the name suggests, are responsible for handling incoming requests:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

The file data-config.xml that is referenced here contains the mapping logic as well as the endpoint to access:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
    <dataSource type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
    <document>
        <entity name="blog"
                pk="url"
                url="http://fhopf.blogspot.com/feeds/posts/default?max-results=100"
                processor="XPathEntityProcessor"
                forEach="/feed/entry" transformer="DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
            <field column="title" xpath="/feed/entry/title"/>
            <field column="url" xpath="/feed/entry/link[@rel='alternate']/@href"/>
            <!-- 2012-03-07T21:35:51.229-08:00 -->
            <field column="last_modified" xpath="/feed/entry/updated" 
                dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" locale="en"/>
            <field column="text" xpath="/feed/entry/content" stripHTML="true"/>
            <field column="category" xpath="/feed/entry/category/@term"/>
            <field column="type" template="blog"/> 
        </entity>
    </document>
</dataConfig>

First we configure which datasource to use. This is where you alternatively would use another implementation when fetching documents from a database.

Documents describe the fields that will be stored in the index. The attributes for the entity element determine where and how to fetch the data, most importantly the url and the processor. forEach contains an XPath to identify the elements we'd like to loop over. The transformer attribute is used to specify some classes that are the available when mapping the remote XML to the Solr fields.

The field elements contain the mapping between the Atom document and the Solr index fields. The column attribute determines the name of the index field, xpath determines the node to use in the remote XML document. You can use advanced XPath options like mapping to attributes of elements where only another attribute is set. E.g. /feed/entry/link[@rel='alternate']/@href points to an element that determines an alternative representation of a blog post entry:

<feed ...> 
  ...
  <entry> 
    ...
    <link rel='alternate' type='text/html' href='http://fhopf.blogspot.com/2012/03/testing-akka-actors-from-java.html' title='Testing Akka actors from Java'/>
    ...
  </entry>
...
</feed>

For the column last_modified we are transforming the remote date format to the internal Solr representation using the DateProcessor. I am not sure yet if this is the correct solution as it seems to me I'm losing the timezone information. For the text field we are first removing all html elements that are contained in the blog post using the HTMLStripTransformer. Finally, the type contains a hardcoded value that is set using the TemplateTransformer.

To have everything in one place let's see how the schema for our index looks like:

<field name="url" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="false"/>

Finally, how can you trigger the dataimport? There is an option described in the Solr wiki, but probably a simple solution might be enough for you. I am using a shell script that is triggered by a cron job. These are the contents:

#!/bin/bash
curl localhost:8983/solr/dataimport?command=full-import

The data import handler is really easy to setup and you can use it to import quite a lot of data sources into your index. If you need more advanced crawling features you might want to have a look at Apache ManifoldCF, a connector framework for plugging content repositories into search engines like Apache Solr.