Donnerstag, 13. Februar 2014

Search Meetups in Germany

I enjoy going to user group events. Not only because of the talks that are an integral part of most meetups but also to meet and chat with likeminded people.

Fortunately there are some user groups in Germany that are focused on search technology, a topic I am especially interested in. This post lists those I know, if there is one I missed let me know in the comments. For reasons of suspense I am listing the groups from east to west.

Elasticsearch User Group Berlin

Berlin has the luxury of a usergroup dedicated to Elasticsearch only. The group is organized by people of Trifork who are seasoned event organizers. The group seems to have a surprising success with regular meetings and up to 50 participants. This is probably caused by the high startup density in Berlin, the ease of use and scalability of Elasticsearch makes it very popular among them.

Search Meetup Munich

Search Meetup Munich is a very active group organized by Alexander Reelsen of Elasticsearch. There are bimonthly meetings at alternating companies with 2 to 3 talks per event. Topics are open source search in general with a strong emphasis on Lucene, Solr and Elasticsearch. Most speakers will give the talk in English if there are people in the audience who don't speak German. The amount of participants ranges from 20 - 40 people. I am surprised about the vital community in Munich with a lot of startups doing interesting things with search. Though it is quite a way from Karlsruhe to Munich I try to attend the meetings as often as I can.

Solr Lucene User Group Deutschland e.V.

Though the name implies it's a national group Solr Lucene User Group Deutschland e.V. is located in Augsburg. It seems to be mainly organized by members of SHI GmbH, a prominent Lucidworks and Elasticsearch partner. The meetup page is rather quite so far with one event last year with one participant.

Search Meetup Frankfurt

The first search meetup I attended with around 10 participants, a talk on the indexing pipeline of the Solr based product search solution Searchperience and some discussions. There are quite some people with non-Java background doing PHP web development. Unfortunately the 2012 event I attended seems to be the last event that happened. I don't take that personally.

Search Meetup Karlsruhe

Last but not least: As I probably can't travel to Munich all the time and I would like to have some exchange with locals I just started Search Meetup Karlsruhe together with Exensio, long time Solr users and Elasticsearch partners. I don't expect it to be as huge as Munich or Berlin but I hope we can start some interesting discussions.

We just scheduled our first meeting with two talks on Linked Data Search and the difference between building applications based on databases vs. search engines. If you are in the area and interested in search you should join us.

elasticsearch.Stuttgart (Update 16.02.2014)

Just a day after publishing this post another Elasticsearch Meetup was announced, this time in Stuttgart. The first event is scheduled for March 25 with an Elasticsearch 1.0 release party including a talk by Alexander Reelsen. If this didn't clash with JavaLand conference I would definitively go there but I hope there will be more events in the future I can attend.

Freitag, 7. Februar 2014

Elasticsearch is Distributed by Default

One of the big advantages Elasticsearch has over Solr is that it is really easy to get started with. You can download it, start it, index and search immediately. Schema discovery and the JSON based REST API all make it a very beginner friendly tool.

Also, another aspect, Elasticsearch is distributed by default. You can add nodes that will automatically be discovered and your index can be distributed across several nodes.

The distributed nature is great to get started with but you need to be aware that there are some consequences. Distribution comes with a cost. In this post I will show you how relevancy of search results might be affected by sharding in Elasticsearch.

Relevancy

As Elasticsearch is based on Lucene it also uses its relevancy algorithm by default, called TF/IDF. Term frequency (the amount of terms in a document) and the frequency of the term in an index (IDF) are important parts of the relevancy function. You can see details of the default formula in the Lucene API docs but for this post it is sufficient to know that the more often a term occurs in a document the more relevant it is considered. Terms that are more frequent in the index are considered less relevant.

A Problematic Example

Let's see the problem in action. We are starting with a fresh Elasticsearch instance and index some test documents. The documents only consist of one field that has the same text in it:

curl -XPOST http://localhost:9200/testindex/doc/0 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"0","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/1 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"1","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/2 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"2","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/3 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"3","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/4 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"4","_version":1}

When we search for those documents by text they of course are returned correctly.

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.10848885,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "0",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    } ]
  }
}

Now, let's index five more documents that are similar to the first documents but contain our test term Hut only once.

curl -XPOST http://localhost:9200/testindex/doc/5 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"5","_version":1} 
curl -XPOST http://localhost:9200/testindex/doc/6 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"6","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/7 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"7","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/8 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"8","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/9 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"9","_version":1}

As the default relevancy formula takes the term frequency in a document into account those documents should score less than our original documents. So if we query for hut again the results still contain our original documents at the beginning:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.2101998,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "9",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    } ]
  }
}

We are still happy. The most relevant documents are at the top of our search results. Now let's index something that is completely different from our original documents:

curl -XPOST http://localhost:9200/testindex/doc/10 -d '{ "title" : "mayhem and chaos" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"10","_version":1}

Now, if we search again for our test term something strange will happen:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.35355338,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.35355338, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.25, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    } ]
  }
}

Though the document we indexed last has nothing to do with our original documents it influenced our search and one of the documents that should score less is now the second result. This is something that you wouldn't expect. The behavior is caused by the default sharding of Elasticsearch that distributes a logical Elasticsearch index across several Lucene indices.

Sharding

When you are starting a single instance and index some documents Elasticsearch will by default create five shards under the hood, so there are five Lucene indices. Each of those shards contains some of the documents you are adding to the index. The assignment of documents to a shard happens in a way so that the documents will be distributed evenly.

You can get information about the shards and their document counts using the indices status API or more visually appealing using one of the plugins, e.g. elasticsearch-head. There are five shards for our index, once we click on a shard we can see further details about the shard, including the doc count.

If you check the shards right after you indexed the first five documents you will notice that those are distributed evenly across all shards. Each shard contains one of the documents. The second batch is again distributed evenly. The final document we index creates some imbalance. One shard will have an additional document.

The Effects on Relevancy

Each shard in Elasticsearch is a Lucene index in itself and as an index in Elasticsearch consists of multiple shards it needs to distribute the queries across multiple Lucene indices. Especially the inverse document frequency is difficult to calculate in this case.

Reconsider the Lucene relevancy formula: the term frequency as well as the inverse document frequency are important. When indexing the original 5 documents all documents had the same term frequency as well as the same idf for our term. The next documents still had no impact on the idf as each document in the index still contained the term.

Now, when indexing the last document something potentially unexpected is happening. The new document is added to one of the shards. On this shard we therefore changed the inverse document frequency which is calculated from all the documents that contain the term but also takes the overall document count in the Lucene index into account. On the shard that contains the new document we increased the idf value as now there are more documents in the Lucene index. As idf has quite some weight on the overall relevancy score we "boosted" the documents of the Lucene index that now contains more documents.

If you'd like to see details on the relevancy calculation you can use the explain API or simply add a parameter explain=true. This will not only tell you all the details of the results of the relevancy function for each document but also which shard a document resides on. It can give you really useful information when debugging relevancy problems.

How to Fix It?

When beginning with Elasticsearch you might fix this by setting the your index to use one shard only. Though this will work it is not a good idea: Sharding is a very powerful feature of Elasticsearch and you shouldn't give it up on it easily. If you notice that there are problems with your relevancy that are caused by these issues you should rather try to use the search_type dfs_query_then_fetch instead of the default query_then_fetch. The difference between those is that dfs queries all the document frequencies of the shards in advance. This way Elasticsearch can calculate the overall document frequency and all results will be in the correct order:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true&explain=true&search_type=dfs_query_then_fetch"

Conclusion

Though the example we have seen here is artificially constructed this is something that can occur and I have already seen in live applications. The behaviour can especially be relevant when there are either very few documents or your documents are distributed to the shards in an unfortunate way. It is great that Elasticsearch makes distributed searches as easy and as performant as possible but you need to be aware that you might not get exact hits.

Zachary Thong has written a blog post about this behavior as well at the Elasticsearch blog.

Donnerstag, 30. Januar 2014

Proxying Solr

The dominant deployment model for Solr is running it as a standalone webapp. You can use it in embedded mode in Java but then you are missing some of the goodies like the seperate JVM (your GC will thank you for it) and you are of course tied to Java then.

Most of the time Solr is considered similar to a database; only custom webapp code can talk to it and it is not exposed to the net. In your webapp you are then using any of the client libraries to access Solr and build your queries.

With the rise of JavaScript on the client side sometimes people get the idea to put Solr to the web directly. There is no custom webapp layer in between, only the web talking to Solr.

A proxy needs to sit in front of the Solr server that only allows certain requests. You won't allow any requests that are potentially modifying your index or do any other harm. This can be done but you need to be aware of some things:

Most of the time putting Solr directly to the web is not an option, but you can, if you are willing to take some risk. I think that especially the possibility of DOS attacks shouldn't be taken lightly. The more flexibility you want to have on the query side the more care needs to be taken to secure the system. If you'd like to do it anyway see this post on how to use nginx as a proxy to Solr and this list of specialized proxies for Solr. For general instructions on securing your Solr server see the project wiki.

Donnerstag, 23. Januar 2014

Analyze your Maven Project Dependencies with dependency:analyze

When working on a larger Maven project it might happen that you lose track of the dependecies in your project. Over time you are adding new dependencies, remove code or move code to modules so some of the dependencies become obsolete. Though I did lots of Maven projects I have to admit I didn't know until recently that the dependency plugin contains a useful goal for solving this problem: dependency:analyze.

The dependency:analyze mojo can find dependencies that are declared for your project but are not necessary. Additionally it can find dependecies that are used but are undeclared, which happens when you are directly using transitive dependencies in your code.

Analyzing Dependencies

I am showing an example with the Odftoolkit project. It contains quite some dependencies and is old enough that some of them are outdated. ODFDOM is the most important module of the project, providing low level access to the Open Document structure from Java code. Running mvn dependency:tree we can see its dependencies at the time of writing:

mvn dependency:tree
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ odfdom-java ---
[INFO] org.apache.odftoolkit:odfdom-java:jar:0.8.10-incubating-SNAPSHOT
[INFO] +- org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[INFO] |  \- com.sun:tools:jar:1.7.0:system
[INFO] +- xerces:xercesImpl:jar:2.9.1:compile
[INFO] |  \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO] +- junit:junit:jar:4.8.1:test
[INFO] +- org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] |  +- org.apache.jena:jena-core:jar:2.7.4:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.6.4:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.1.3:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.6.4:compile
[INFO] |  \- log4j:log4j:jar:1.2.16:compile
[INFO] +- org.apache.jena:jena-core:jar:tests:2.7.4:test
[INFO] +- net.rootdev:java-rdfa:jar:0.4.2:compile
[INFO] |  \- org.apache.jena:jena-iri:jar:0.9.1:compile
[INFO] \- commons-validator:commons-validator:jar:1.4.0:compile
[INFO]    +- commons-beanutils:commons-beanutils:jar:1.8.3:compile
[INFO]    +- commons-digester:commons-digester:jar:1.8:compile
[INFO]    \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.877s
[INFO] Finished at: Mon Jan 20 00:41:05 CET 2014
[INFO] Final Memory: 13M/172M
[INFO] ------------------------------------------------------------------------

The project contains some direct dependencies with a lot of transitive dependencies. When running mvn dependency:analyze on the project we will see that our dependencies don't seem to be correct:

mvn dependency:analyze
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[...] 
[INFO] <<< maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java <<<
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java ---
[WARNING] Used undeclared dependencies found:
[WARNING]    org.apache.jena:jena-core:jar:2.7.4:compile
[WARNING]    xml-apis:xml-apis:jar:1.3.04:compile
[WARNING] Unused declared dependencies found:
[WARNING]    org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[WARNING]    org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.769s
[INFO] Finished at: Mon Jan 20 00:43:27 CET 2014
[INFO] Final Memory: 28M/295M
[INFO] ------------------------------------------------------------------------

The second part of the warnings is easier to understand; we have declared some dependencies that we are never using, the taglets and jena-arq. When comparing this with the output we got above you will notice that the largest set of transitive dependencies was imported by the jena-arq dependency. And we don't even need it.

The first part seems to be more difficult: there are two used but undeclared dependencies found. What does it mean? Shouldn't compiling fail if there are any dependencies that are undeclared? No, it just means that we are directly using a transitive dependency from our code which we should better declare ourselves.

Breaking the Build on Dependency Problems

If you want to find problems with your dependencies as early as possible it's best to integrate the check in your build. The dependency:analyze goal we have seen above is meant to be used in a standalone way, for automatic execution there is the analyze-only mojo. It automatically binds to the verify phase and can be declared like this:

<plugin>
    <artifactId>maven-dependency-plugin</artifactId>
    <version>2.8</version>
    <executions>
        <execution>
            <id>analyze</id>
            <goals>
                <goal>analyze-only</goal>
            </goals>
            <configuration>
                <failOnWarning>true</failOnWarning>
                <outputXML>true</outputXML>
            </configuration>
        </execution>
    </executions>
</plugin>

Now the build will fail if there are any problems found. Conveniently, if an undeclared dependency has been found, it will also output the XML that you can then paste in your pom file.

A final word of caution: the default analyzer works on the bytecode level so in special cases it might not notice a dependency correctly, e.g. when you are using constants from a dependency that are inlined.

Freitag, 17. Januar 2014

Geo-Spatial Features in Solr 4.2

Last week I have shown how you can use the classic spatial support in Solr. It uses the LatLonType to index locations that can then be used to query, filter or sort by distance. Starting with Solr 4.2 there is a new module available. It uses the Lucene Spatial module which is more powerful but also needs to be used differently. You can still use the old approach but in this post I will show you how to use the new features to do the same operations we saw last week.

Indexing Locations

Again we are indexing talks that contain a title and a location. For the new spatial support you need to add a different field type:

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
    distErrPct="0.025"
    maxDistErr="0.000009"
    units="degrees"/>

Contrary to the LatLonType the SpatialRecursivePrefixTreeFieldType is no subfield type but stores the data structure itself. The attribute maxDistErr determines the accuracy of the location, in this case it is 0.000009 degrees which is close to one meter and should be enough for most location searches.

To use the type in our documents of course we also need to add it as a field:

<field name="location" type="location_rpt" indexed="true" stored="true"/>

Now we are indexing some documents with three fields: the path (which is our id), the title of the talk and the location.

curl http://localhost:8082/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

Again, the location of the first document is Mannheim, the second Karlsruhe. You can see that the locations are encoded in an ngram-like fashion when looking at the schema browser in the administration backend:

Sorting by Distance

A common use case is to sort the results by distance from a certain location. You can't use the Solr 3 syntax anymore but need to use a the geofilt query parser that maps the distance to the score which you then sort on.

http://localhost:8082/solr/select?q={!geofilt%20score=distance%20sfield=location%20pt=49.487036,8.458001%20d=100}&sort=score asc

As the name implies the geofilt query parser originally is for filtering. You need to pass in the distance that is used for filtering so by sorting you might also cause an impact on the results that are returned. For our example passing in a distance of 10 kilometers will only yield one result. This is something to be aware of.

Filtering by Distance

We can use the same approach we saw above to filter our results to only match talks in a given area. We can either use the geofilt query parser (that filters by radius) or the bbox query parser (which filters on a box around the radius). As you can imagine, the query looks similar:

http://localhost:8082/solr/select?q=*:*&fq={!geofilt%20score=distance%20sfield=location%20pt=49.013787,8.419936%20d=10}

This will return all talks in a distance of 10 kilometers from Karlsruhe.

Doing Fancy Stuff

Besides the features we have looked at in this post you can also do more advanced stuff. In Solr 3 Spatial you can't have multivalued location fields, which is possible with Solr 4.2. Also now you can also index lines or polygons that can then be queried and intersected. In this presentation Chris Hostetter uses this feature to determine overlapping of time, an interesting use case that you might not think of at first.

Donnerstag, 9. Januar 2014

Geo-Spatial Features in Solr 3

Solr is mainly known for its full text search capabilities. You index text and are able to search it in lowercase or stemmed form, depending on your analyzer chain. But besides text Solr can do more: You can use RangeQueries to query numeric fields ("Find all products with a price lower than 2€"), do date arithmetic ("Find me all news entries from last week") or do geospatial queries, which we will look at in this post. What I am describing here is the old spatial search support. Next week I will show you how to do the same things using recent versions of Solr.

Indexing Locations

Suppose we are indexing talks in Solr that contain a title and a location. We need to add the field type for locations to our schema:

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

LatLonType is a subfield type which means that it not only creates one field but also additional fields, one for longitude and one for latitude. The subFieldSuffix attribute determines the name of the field that will be <fieldname>_<i>_<subFieldSuffix>. If the name of our field is location and we are indexing a latitude/longitude pair this would lead to three fields: location, location_0_coordinate, location_1_coordinate.

To use the type in our schema we need to add one field and one dynamic field definition for the sub fields:

<field name="location" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>

The dynamic field is of type tdouble so we need to make sure that it is also available in our schema. The attributes indexed on location is special in this case: It determines if the subfields for the coordinates are created at all.

Let's index some documents. We are adding three fields, the path (which is our id), the title of the talk and the location.

curl http://localhost:8983/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

The location of the first document is Mannheim, the second Karlsruhe. We can see that our documents are indexed and that the location is stored by querying all documents:

curl "http://localhost:8983/solr/select?q=*%3A*&wt=json&indent=true"

Looking at the schema browser we can also see that the two subfields have been created. Each contains the terms for the Trie field.

Sorting by Distance

One use case you might have when indexing locations is to sort the results by distance from a certain location. This can for example be useful for classifieds or rentals to show the nearest results first.

Sorting can be done via the geodist() function. We need to pass in the location that is used as a basis via the pt parameter and the location field to use in the function via the sfield parameter. We can see this in action by sorting twice, once for a location in Durlach near Karlsruhe and once for Heidelberg, which is near Mannheim:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.003421,8.483133&sort=geodist%28%29%20asc"
curl http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.399119,8.672479&sort=geodist%28%29%20asc

Both return the results in the correct order. You can also use the geodist() function to boost results that are closer to your location. See the Solr wiki for details.

Filtering by Distance

Another common use case is to filter the search results to only show results from a certain area, e.g. in a distance of 10 kilometers. This can either be done automatically or via facets.

Filtering is done using another function, geofilt(). It accepts the same parameters we have seen before but of course for filtering you add it as a filter query. The distance can be passed using the parameter d, the unit defaults to kilometers. Suppose you are in Durlach and only want to see talk that are in a distance of 10 kilometers:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=10"

This only returns the result in Karlsruhe. Once we decide that we want to see results in a distance of 100 kilometers we again see both results:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=100"

Pretty useful! If you are interested, there is more on the Solr wiki. Next week I will show you how to do the same using the new spatial support in Solr versions starting from 4.2.

Mittwoch, 1. Januar 2014

20 Months of Freelancing

It's now 20 months that I am working as a freelancer on my own. With the end of year thing going on it's time to look back on what happened and I would like to take the chance to write about what I did, what works and what I would like to do in the future.

Why Freelancing?

During my time at university I started working for a small consulting company that specialized in open source software. I was the first employee and started working part time but even paused my studies for half a year to work full time with them. The company grew and in 2006 after finally getting my degree I joined them full time. I always enjoyed the work and dedicated a lot of my energy and time. In 2012, with around 30 employees I noticed that I needed something else. I had already switched to a 4 day work week in 2011 to have more time for myself, to learn and experiment. Though the company is still great to work with it just didn't fit me anymore.

After a long time at a company that has partly been home and family it is difficult to just switch to another company. Also I wanted to have more control over what kind of projects I am doing and I liked to have some time on my own to write blog posts and do talks at user groups and conferences. I always spent time at customer projects and often liked it so it was an obvious decision to go with freelancing.

The Start

Before I quit I decided that I wanted to do more with search technologies. I had worked a lot on content management systems and search always is a crucial part. Having done several larger projects with Lucene and Solr and even the first Solr integration in OpenCms I knew that I had the necessary experience and that I liked it.

I had minimal savings when I quit and no customers so far. Other freelancers are often surprised when I tell this and advice to only quit when you already know who you will be working for next. I guess this was some kind of hubris, I was really determined that I wanted to do freelancing and knew that there were companies who needed my help.

I started freelancing in May and had already organized to give a talk at our local Java User Group on Lucene and Solr in July. I wanted to have the full month of May for talk preparation and bootstrapping the business, all the things like getting a website, getting an accountant and so on. Unfortunately I didn't find a project until the beginning of July with a lot of my savings already spent on living cost and necessary items for the business. Be aware that it will take you up to two months from the beginning of a project until you see the first money.

Marketing

The good thing about freelancing: I can call all the activities I like to do marketing and tell myself that those are necessary. The bad thing about it: I don't spend enough time on paid projects.

I am spending lots of my time that is not paid for on learning: Blog posts, books and conferences. I got to a quite frequent rhythm with weekly posts, spoke at several user groups and conferences and joined an open source project, OdfToolkit. A lot of freelancers don't do any of those and dedicate all their time working on customer projects but those activities are part of the reason I went with freelancing.

The Projects

When talking about freelancing you probably think about sitting in the coffee shop, doing several projects in parallel. For me this is different, lots of Java projects are rather long term and require you to work in a team, which is best done on premise. Though I like the idea of doing more diverse projects I am also happy to have some stability. Having long term clients prevents some of the context switching involved with multiple projects and you have to spend less time on sales.

My first project involved working on an online shop for a large retailer built on Hybris, a commercial Ecommerce engine. I did a lot of Solr stuff and though it was rather stressful working on product search was really interesting. Also the people are nice.

Though I started with the intention of doing more search projects I am currently involved in a large CMS project for a retailer, (re)building parts of their online presence. Search only plays a minor part in it but I like working with the people, it's a great work atmosphere and some of the problems they face are really interesting. Before doing the project I had to think a lot whether I want to sign this long term contract but I am glad I did. Fortunately I still have time to do some short term consulting on the side (mostly single days, mostly Solr).

Where Do I Get The Projects From?

When starting I thought it would be a lot easier to get projects but customers are not exactly magically lining up to get my services. I try to avoid working with freelance agents, though a lot of Java projects are only possible to get through them. Most of the project inquiries I get directly are from people who know me from organizing the local Java User Group. I didn't start helping the user group for the marketing but I have to admit, it really paid of.

Besides that I am still working for customers of my old employer. They contact me with interesting projects and though of course they are taking their share I still earn enough for myself.

Most of the inquiries I get from agencies, mostly through my XING profile are for Hybris and CoreMedia, two of the commercial systems I did work in. I enjoy working with CoreMedia and could imagine to do projects with Hybris again but I would be far more happy if agencies contacted me for Lucene, Solr or Elasticsearch.

There have been some inquiries from people who found me through my blog but never something that was really doable (mostly overseas). Speaking at user groups and conferences has led to some contacts but never to a real project so far. So you could say that the marketing activities I spend most of my time on didn't pay off. But getting direct projects is not the only benefit of both of these activities. Those are also important for me for learning and growing.

The Future

Freelancing has been exactly the right choice for me. I managed to find projects where I can do my 4 day work week, leaving enough time for blogging, preparing talks and learning. I managed to do weekly blog posts for quite some months during the year, cut back a bit because it became overwhelming. Starting with the new year I hope I can get back to more frequent posts. Also, I'll be submitting talks to conferences again and hope that I can find more time to work on the OdfToolkit.

I'll be staying with my current client for as long as they need me but I am determined to only do search centric projects afterwards. Also I am planning to do a bit of work in other countries in Europe with a special twist. Watch this blog for the announcement.

When starting with freelancing you have a lot of questions and even simple things can take some time to find out on your own. I will compile a list of resources that helped me and publish those on my blog soon. If you are just starting with freelancing you are of course also welcome to contact me anytime.