Dienstag, 5. April 2016

Learning Lucene

I am currently working with a team starting a new project based on Lucene. While most of the time I would argue on using either Solr or Elasticsearch instead of plain Lucene it was a conscious decision. In this post I am compiling some sources for learning Lucene – I hope you will find them helpful or you can hint what sources I missed.

Project documentation

The first choice of course is the excellent project documentation. It contains the Javadoc for all the modules (core, analyzers-common and queryparser being the most important ones) that also contains further documentation, for example an explanation of a simple demo app and helpful introductions to analysis and querying and scoring. You might also be interested in the standard index file formats.

Besides the documentation that comes with the releases there is also lots of information in the project wiki but you need to know what you are looking for. You can also join the mailing lists to learn about what other users are doing.

When looking at analyzer components the Solr Start website can be useful. Though dedicated to Solr the list of analyzer components can be useful to determine analyzers for Lucene as well. It also contains a searchable version of the Javadocs.

Books

The classic book about the topic is Lucene in Action. On over 500 pages it explains all the underlying concepts in detail. Unfortunately some of the information is outdated and lots of the code examples won't work anymore. Also the newer concepts are not included. Still it's the recommended piece on learning Lucene.

Anonther book I've read is Lucene 4 Cookbook published at Packt. It contains more current examples but is not suited well for learning the basics. Additionally it felt to me as if no editor worked on this book, there are lots of repetitions, typos and broken sentences. (I am making lots of grammar mistakes myself when blogging - but I am expecting more from a published book.)

You can also learn a lot about different aspects of Lucene by reading a book on one of the search servers based on it. I can recommend Elasticsearch in Action, Solr in Action and Elasticsearch – The definitive Guide. (If you can read German I am of course inviting you to read my book on Elasticsearch.)

Blogs, Conferences and Videos

There are countless blog posts on Lucene, a very good introduction is Lucene: The Good Parts by Andrew Montalenti. Some blogs publish regular pieces on Lucene, recommended ones are by Mike McCandless (who now mostly blogs on the elastic Blog), OpenSource Connections, Flax and Uwe Schindler. There is a lot of content about Lucene on the elastic Blog, if you want to hear about current development I can recommend the "This week in Elasticsearch and Apache Lucene" series. There are also some interesting posts on the Lucidworks Blog and I am sure there are lots of other blogs I forgot to mention here.

Lucene is a regular topic on two larger conferences: Lucene/Solr Revolution and Berlin Buzzwords. You can find lots of video recordings of the past events on their website.

Sources

Finally, the project is open source so you can learn a lot about it by reading the source code of either the library or the tests.

Another option is to look at applications using it, either Solr and Elasticsearch. Of course you need to find your way around the sources of the project but sometimes this isn't too hard. One example for Elasticsearch: If you would like to learn about how the common multi_match-Query is implemented in Lucene you will easily find the class MultiMatchQuery that creates the Lucene queries.

What did I miss?

I hope there is something useful for you in this post. I am sure I missed lots of great resources for learning Lucene. If you would like to add one let me know in the comments or on Twitter.

Mittwoch, 23. März 2016

Logging Requests to Elasticsearch

This is something I wanted to write down for years but never got down to completing the post. It can help you a lot with certain Elasticsearch setups by answering two questions using the slow log.

  • Is my application talking to Elasticsearch?
  • What kind of queries are being built by my application?

A while ago I helped a colleague on one of my current projects to debug some problems with Elasticsearch integrated into proprietary software. He was not sure if there are any requests arriving at Elasticsearch and what those look like. We activated the slow log for Elasticsearch, which not only can be used to log the slow queries but also to enable debugging for any queries that reach Elasticsearch.

The slow log, as the name suggests, is there to log slow requests. As slow is a subjective term you can define thresholds that need to be passed. For example you can define that any queries slower than 50ms are logged in the debug level but any queries that take longer than 500ms in the warn level.

Slow queries can be configured for both phases of the query execution: query and fetch. In the query phase only the ids of the documents are retrieved in the form of a search result list. The fetch phase is where the result documents are retrieved.

Besides the slow query log there is also the slow index log which can be used in the same way but measures the time for indexing.

Both of these settings are index settings. That means they are configured for each index and can therefore be different across indices.

Instance Settings

There are multiple places where you can configure index settings. The first is config/elasticsearch.yml that contains the configuration of the instance. For older versions of Elasticsearch it already contains the lines that are commented out, in newer versions you need to include them yourself. If you want to log all requests at debug level you can just add the following lines and set a threshold of 0s.

index.search.slowlog.threshold.query.debug: 0s
index.search.slowlog.threshold.fetch.debug: 0s
index.indexing.slowlog.threshold.index.debug: 0s

You need to reboot the instance so that the settings are activated. Any indexing and search requests will now be logged to separate log file in the log folder. With the default configuration the logs will be at logs/elasticsearch_index_indexing_slowlog.log and logs/elasticsearch_index_search_slowlog.log. The query log will now contain entries like this:

[2016-03-23 06:43:47,231][DEBUG][index.search.slowlog.fetch] took[5.8ms], took_millis[5], types[talk], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[{"query":{"match":{"tags":"Java"}}}], extra_source[]

If you are testing this with multiple shards on one instance you might get more log lines than expected: There will be one line for every shard in the query phase and one line for the fetch phase.

Runtime Settings

Besides the setting in elasticsearch.yml the slow request logs can also be activated using the HTTP API which doesn't require a reboot of the instance and is therefore really well suited for debugging production issues. The following request changes the setting for the query log for an index conference.

curl -XPUT "http://localhost:9200/conference/_settings" -d'
{
    "index.search.slowlog.threshold.query.debug": "0s"
}'

When you are done debugging your issue you can just set a higher threshold again.