Donnerstag, 27. Februar 2014

Introduction to the ODF Toolkit

Microsoft Office has been the dominating office suite and unfortunately it still is. For a long time not only the programs were closed but also the file format.

Open Document

Nevertheless there are open alternatives available, most notable Libre Office/Apache OpenOffice.org. In 2005 the OASIS foundation standardized Open Document, an open alternative to the proprietary world of Microsoft. Open Document is heavily influenced by the OpenOffice.org file format but is supported by multiple office suites and viewers.

Open Document files are zip files that contain some XML documents. You can go ahead and unzip any documents you might have:

unzip -l aufwaende-12.ods 
Archive:  aufwaende-12.ods
  Length      Date    Time    Name
---------  ---------- -----   ----
       46  2012-12-31 15:16   mimetype
      815  2012-12-31 15:16   meta.xml
     8680  2012-12-31 15:16   settings.xml
   171642  2012-12-31 15:16   content.xml
     3796  2012-12-31 15:16   Thumbnails/thumbnail.png
        0  2012-12-31 15:16   Configurations2/images/Bitmaps/
        0  2012-12-31 15:16   Configurations2/popupmenu/
        0  2012-12-31 15:16   Configurations2/toolpanel/
        0  2012-12-31 15:16   Configurations2/statusbar/
        0  2012-12-31 15:16   Configurations2/progressbar/
        0  2012-12-31 15:16   Configurations2/toolbar/
        0  2012-12-31 15:16   Configurations2/menubar/
        0  2012-12-31 15:16   Configurations2/accelerator/current.xml
        0  2012-12-31 15:16   Configurations2/floater/
    22349  2012-12-31 15:16   styles.xml
      993  2012-12-31 15:16   META-INF/manifest.xml
---------                     -------
   208321                     16 files

The mimetype file determines what kind of document it is (in this case application/vnd.oasis.opendocument.spreadsheet), META-INF/manifest.xml lists the files in the archive. The most important file is content.xml that contains the body of the document.

Server Side Processing

Though there are quite some viewers and editors for Open Document available when it comes to the server side the situation used to be different. For processing Microsoft Office files there is the Java library Apache POI, which provides a lot of functionality to read and manipulate Microsoft Office files. But if you wanted to process Open Document files nearly your only option was to install OpenOffice.org on the server and talk to it by means of its UNO API. Not exactly an easy thing to do.

ODF Toolkit

Fortunately there is light at the end of the tunnel: the ODF Toolkit project, currently incubating at Apache, provides lightweight access to files in the Open Document format from Java. As the name implies it's a toolkit, consisting of multiple projects.

The heart of it is the schema generator that ingests the Open Document specification that is available as a RelaxNG schema. It provides a template based facility to generate files from the ODF specification. Currently it only generates Java classes but it can also be used to create different files (think of documentation or accessors for different programming languages).

The next layer of the toolkit is ODFDOM. It provides templates that generate classes for DOM access of elements and attributes of ODF documents. Additionally it provides facilities like packaging and document encryption.

For example, you can list the file paths of an ODF document using the ODFPackage class:

OdfPackage pkg = OdfPackage.loadPackage("aufwaende-12.ods");
Set filePaths = pkg.getFilePaths();

If you are familiar with the Open Document spec ODFDOM will be the only library you need. But if you are like most of us and don't know all the elements and attributes by heart there is another project for you: Simple API provides easy access to a lot of the features you might expect from a library like this: You can deal with higher level abstractions like paragraphs for text or rows and cells in the spreadsheet world or search for and replace text.

This code snippet creates a spreadsheet, adds some cells to it and saves it:

SpreadsheetDocument doc = SpreadsheetDocument.newSpreadsheetDocument();
Table sheet = doc.getSheetByIndex(0);
sheet.getCellByPosition(0, 0).setStringValue("Betrag");
sheet.getCellByPosition(1, 0).setDoubleValue(23.0);
doc.save(File.createTempFile("odf", ".ods"));

Code

If you are interested in seeing more code using the ODF Toolkit you can have a look at the cookbook that contains a lot of useful code snippets for the Simple API. Additionally you should keep an eye on this blog for the second part of the series where we will look at an application that extracts data from spreadsheets.

Donnerstag, 20. Februar 2014

Book Review: Search-Based Applications

Search is moving away from the simple keyword search box with a result list for indexed web pages. Features like facetting and aggregations offer completely new possibilities for data discovery, making them relevant for business applications as well.

The Book

Search-Based Applications by Greogory Grefenstette and Laura Wilber describes the changes that have occured in the last years regarding search engines, traditionally used for indexing web pages and databases, that have been used for business applications.

The short book introduces the motivation for using search engines in business applications, mostly caused by exponential data growth and realtime needs. Several chapters describe what has changed in the database and search engine world, focusing on one aspect in each chapter. On the search side it shows that advanced features like faceted search or natural language processing techniques can be valuable for offering real time access on data that has traditionally been put to a data warehouse. On the database side it shows that with the advent of non-relational types, some databases are moving in the direction of the flexible schema, scalability or specialized access patterns of search engines.

Some common themes in the book for using search based applications are the aggregation of content from different data sources and the reduction of load on databases by offloading the traffic to the read optimized search engines. Mixing content from different data sources can be useful to provide flexible access on multiple legacy systems, increasing usability of the applications. The document model of search engines and the possibility to do incremental indexing lead to applications that provide near realtime access to data and can be adjusted to match changing needs quicker.

Though most of the book is product agnostic one chapter lists some platforms that are available for building search based applications, mainly focusing on big commercial players like Exalead (the company of the authors), Endeca and Autonomy. The book closes with three case studies that show different aspects of building search based applications.

Even if there are some statements contained that I don't fully agree with or that are even contradictory it is a very good book for understanding the reasoning behind building search based applications. I got some new ideas for applications of search engines and this alone makes it a worthwile read.

Open Source Options for Search Based Applications

Though the book lists quite some SBA platforms and related technology there is not a single mention of Apache Solr, which is quite surprising as it employs a lot of the features the authors define for SBAs. Solr has the Data Import Handler to connect external data sources, semantic technologies (though probably not as rich as some of the commercial options) and complementary open source projects like carrot² for search result clustering or ManifoldCF as a connector framework.

When the book talks about replacing parts of data warehouse applications with SBAs for real time analytics this of course reminds me of use cases for Elasticsearch. Kibana or custom dashboards can make a wealth of information that is contained in the index accessible in an easy way.

Donnerstag, 13. Februar 2014

Search Meetups in Germany

I enjoy going to user group events. Not only because of the talks that are an integral part of most meetups but also to meet and chat with likeminded people.

Fortunately there are some user groups in Germany that are focused on search technology, a topic I am especially interested in. This post lists those I know, if there is one I missed let me know in the comments. For reasons of suspense I am listing the groups from east to west.

Elasticsearch User Group Berlin

Berlin has the luxury of a usergroup dedicated to Elasticsearch only. The group is organized by people of Trifork who are seasoned event organizers. The group seems to have a surprising success with regular meetings and up to 50 participants. This is probably caused by the high startup density in Berlin, the ease of use and scalability of Elasticsearch makes it very popular among them.

Search Meetup Munich

Search Meetup Munich is a very active group organized by Alexander Reelsen of Elasticsearch. There are bimonthly meetings at alternating companies with 2 to 3 talks per event. Topics are open source search in general with a strong emphasis on Lucene, Solr and Elasticsearch. Most speakers will give the talk in English if there are people in the audience who don't speak German. The amount of participants ranges from 20 - 40 people. I am surprised about the vital community in Munich with a lot of startups doing interesting things with search. Though it is quite a way from Karlsruhe to Munich I try to attend the meetings as often as I can.

Solr Lucene User Group Deutschland e.V.

Though the name implies it's a national group Solr Lucene User Group Deutschland e.V. is located in Augsburg. It seems to be mainly organized by members of SHI GmbH, a prominent Lucidworks and Elasticsearch partner. The meetup page is rather quite so far with one event last year with one participant.

Search Meetup Frankfurt

The first search meetup I attended with around 10 participants, a talk on the indexing pipeline of the Solr based product search solution Searchperience and some discussions. There are quite some people with non-Java background doing PHP web development. Unfortunately the 2012 event I attended seems to be the last event that happened. I don't take that personally.

Search Meetup Karlsruhe

Last but not least: As I probably can't travel to Munich all the time and I would like to have some exchange with locals I just started Search Meetup Karlsruhe together with Exensio, long time Solr users and Elasticsearch partners. I don't expect it to be as huge as Munich or Berlin but I hope we can start some interesting discussions.

We just scheduled our first meeting with two talks on Linked Data Search and the difference between building applications based on databases vs. search engines. If you are in the area and interested in search you should join us.

elasticsearch.Stuttgart (Update 16.02.2014)

Just a day after publishing this post another Elasticsearch Meetup was announced, this time in Stuttgart. The first event is scheduled for March 25 with an Elasticsearch 1.0 release party including a talk by Alexander Reelsen. If this didn't clash with JavaLand conference I would definitively go there but I hope there will be more events in the future I can attend.

Freitag, 7. Februar 2014

Elasticsearch is Distributed by Default

One of the big advantages Elasticsearch has over Solr is that it is really easy to get started with. You can download it, start it, index and search immediately. Schema discovery and the JSON based REST API all make it a very beginner friendly tool.

Also, another aspect, Elasticsearch is distributed by default. You can add nodes that will automatically be discovered and your index can be distributed across several nodes.

The distributed nature is great to get started with but you need to be aware that there are some consequences. Distribution comes with a cost. In this post I will show you how relevancy of search results might be affected by sharding in Elasticsearch.

Relevancy

As Elasticsearch is based on Lucene it also uses its relevancy algorithm by default, called TF/IDF. Term frequency (the amount of terms in a document) and the frequency of the term in an index (IDF) are important parts of the relevancy function. You can see details of the default formula in the Lucene API docs but for this post it is sufficient to know that the more often a term occurs in a document the more relevant it is considered. Terms that are more frequent in the index are considered less relevant.

A Problematic Example

Let's see the problem in action. We are starting with a fresh Elasticsearch instance and index some test documents. The documents only consist of one field that has the same text in it:

curl -XPOST http://localhost:9200/testindex/doc/0 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"0","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/1 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"1","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/2 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"2","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/3 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"3","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/4 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"4","_version":1}

When we search for those documents by text they of course are returned correctly.

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.10848885,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "0",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    } ]
  }
}

Now, let's index five more documents that are similar to the first documents but contain our test term Hut only once.

curl -XPOST http://localhost:9200/testindex/doc/5 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"5","_version":1} 
curl -XPOST http://localhost:9200/testindex/doc/6 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"6","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/7 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"7","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/8 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"8","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/9 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"9","_version":1}

As the default relevancy formula takes the term frequency in a document into account those documents should score less than our original documents. So if we query for hut again the results still contain our original documents at the beginning:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.2101998,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "9",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    } ]
  }
}

We are still happy. The most relevant documents are at the top of our search results. Now let's index something that is completely different from our original documents:

curl -XPOST http://localhost:9200/testindex/doc/10 -d '{ "title" : "mayhem and chaos" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"10","_version":1}

Now, if we search again for our test term something strange will happen:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.35355338,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.35355338, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.25, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    } ]
  }
}

Though the document we indexed last has nothing to do with our original documents it influenced our search and one of the documents that should score less is now the second result. This is something that you wouldn't expect. The behavior is caused by the default sharding of Elasticsearch that distributes a logical Elasticsearch index across several Lucene indices.

Sharding

When you are starting a single instance and index some documents Elasticsearch will by default create five shards under the hood, so there are five Lucene indices. Each of those shards contains some of the documents you are adding to the index. The assignment of documents to a shard happens in a way so that the documents will be distributed evenly.

You can get information about the shards and their document counts using the indices status API or more visually appealing using one of the plugins, e.g. elasticsearch-head. There are five shards for our index, once we click on a shard we can see further details about the shard, including the doc count.

If you check the shards right after you indexed the first five documents you will notice that those are distributed evenly across all shards. Each shard contains one of the documents. The second batch is again distributed evenly. The final document we index creates some imbalance. One shard will have an additional document.

The Effects on Relevancy

Each shard in Elasticsearch is a Lucene index in itself and as an index in Elasticsearch consists of multiple shards it needs to distribute the queries across multiple Lucene indices. Especially the inverse document frequency is difficult to calculate in this case.

Reconsider the Lucene relevancy formula: the term frequency as well as the inverse document frequency are important. When indexing the original 5 documents all documents had the same term frequency as well as the same idf for our term. The next documents still had no impact on the idf as each document in the index still contained the term.

Now, when indexing the last document something potentially unexpected is happening. The new document is added to one of the shards. On this shard we therefore changed the inverse document frequency which is calculated from all the documents that contain the term but also takes the overall document count in the Lucene index into account. On the shard that contains the new document we increased the idf value as now there are more documents in the Lucene index. As idf has quite some weight on the overall relevancy score we "boosted" the documents of the Lucene index that now contains more documents.

If you'd like to see details on the relevancy calculation you can use the explain API or simply add a parameter explain=true. This will not only tell you all the details of the results of the relevancy function for each document but also which shard a document resides on. It can give you really useful information when debugging relevancy problems.

How to Fix It?

When beginning with Elasticsearch you might fix this by setting the your index to use one shard only. Though this will work it is not a good idea: Sharding is a very powerful feature of Elasticsearch and you shouldn't give it up on it easily. If you notice that there are problems with your relevancy that are caused by these issues you should rather try to use the search_type dfs_query_then_fetch instead of the default query_then_fetch. The difference between those is that dfs queries all the document frequencies of the shards in advance. This way Elasticsearch can calculate the overall document frequency and all results will be in the correct order:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true&explain=true&search_type=dfs_query_then_fetch"

Conclusion

Though the example we have seen here is artificially constructed this is something that can occur and I have already seen in live applications. The behaviour can especially be relevant when there are either very few documents or your documents are distributed to the shards in an unfortunate way. It is great that Elasticsearch makes distributed searches as easy and as performant as possible but you need to be aware that you might not get exact hits.

Zachary Thong has written a blog post about this behavior as well at the Elasticsearch blog.

Elasticsearch - Der praktische Einstieg
Java Code Geeks