Freitag, 29. August 2014

Use Cases for Elasticsearch: Geospatial Search

In the previous posts we have seen that Elasticsearch can be used to store documents in JSON format and distribute the data across multiple nodes as shards and replicas. Lucene, the underlying library, provides the implementation of the inverted index that can be used to search the documents. Analyzing is a crucial step for building a good search application.

In this post we will look at a different feature that can be used for applications you would not immediately associate Elasticsearch with. We will look at the geo features that can be used to build applications that can filter and sort documents based on the location of the user.

Locations in Applications

Location based features can be useful for a wide range of applications. For merchants the web site can present the closest point of service for the current user. Or there is a search facility for finding points of services according to a location, often integrated with something like Google Maps. For classifieds it can make sense to sort them by distance from the user searching, the same is true for any search for locations like restaurants and the like. Sometimes it also makes sense to only show results that are in a certain area around me, in this case we need to filter by distance. Probably the user is looking for a new appartment and is not interested in results that are too far away from his workplace. Finally locations can also be of interest when doing analytics. Social media data can tell you where something interesting is happening just by looking at the amount of status messages sent from a certain area.

Most of the time locations are stored as a pair of latitude and longitude, which denotes a point. The combination of 48.779506, 9.170045 for example points to Liederhalle Stuttgart which happens to be the location for Java Forum Stuttgart. Geohashes are an alternative means to encode latitude and longitude. They can be stored in arbitrary precision so those can also refer to a larger area instead of a point.

When calculating a Geohash the map is divided into several buckets or cells. Each bucket is identified by a base 32 encoded value. The complete geohash then consists of a sequence of characters. Each following character marks the bucket in the previous bucket so you are zooming in to the location. The longer the geohash string the more precise the location is. For example u0wt88j3jwcp is the geohash for Liederhalle Stuttgart. The prefix u0wt on the other hand is the area of Stuttgart and some of the surrounding cities.

The hierarchical nature of geohashes and the possiblity to express them as strings makes them a good choice for storing them in the inverted index. You can create geohashes using the original geohash service or more visually appealing using the nice GeohashExplorer.

Locations in Elasticsearch

Elasticsearch accepts lat and lon for specifying latitude and longitude. These are two documents for a conference in Stuttgart and one in Nuremberg.

{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart",
            "coordinates": {
                "lon": "9.170045",
                "lat": "48.779506"
            }
    } 
}
{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-15T16:30:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Developer Week",
        "city" : "Nürnberg",
            "coordinates": {
                "lon": "11.115358",
                "lat": "49.417175"
            }
    } 
}

Alternatively you can use the GeoJSON format, accepting an array of longitude and latitude. If you are like me be prepared to hunt down why queries aren't working just to notice that you messed up the order in the array.

The field needs to be mapped with a geo_point field type.

{
    "properties": {
          […],
       "conference": {
            "type": "object",
            "properties": {
                "coordinates": {
                    "type": "geo_point",
                    "geohash": "true",
                    "geohash_prefix": "true"
                }
            }
       }
    }
}'

By passing the optional attribute geohash Elasticsearch will automatically store the geohash for you as well. Depending on your usecase you can also store all the parent cells of the geohash using the parameter geohash_prefix. As the values are just strings this is a normal ngram index operation which stores the different substrings for a term, e.g. u, u0, u0w and u0wt for u0wt.

With our documents in place we can now use the geo information for sorting, filtering and aggregating results.

Sorting by Distance

First, let's sort all our documents by distance from a point. This would allow us to build an application that displays the closest location for the current user.

curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
    "sort" : [
        {
            "_geo_distance" : {
                "conference.coordinates" : {
                    "lon": 8.403697,
                    "lat": 49.006616
                },
                "order" : "asc",
                "unit" : "km"
            }
        }
    ]
}'

We are requesting to sort by _geo_distance and are passing in another location, this time Karlsruhe, where I live. Results should be sorted ascending so the closer results come first. As Stuttgart is not far from Karlsruhe it will be first in the list of results.

The score for the document will be empty. Instead there is a field sort that contains the distance of the locations from the one provided. This can be really handy when displaying the results to the user.

Filtering by Distance

For some usecase we would like to filter our results by distance. Some online real estate agencies for example provide the option to only display results that are in a certain distance from a point. We can do the same by passing in a geo_distance filter.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
   "filter": {
      "geo_distance": {
         "conference.coordinates": {
            "lon": 8.403697,
            "lat": 49.006616
         },
         "distance": "200km",
         "distance_type": "arc"
      }
   }
}'

We are again passing the location of Karlsruhe. We request that only documents in a distance of 200km should be returned and that the arc distance_type should be used for calculating the distance. This will take into account that we are living on a globe.

The resulting list will only contain one document, Stuttgart, as Nuremberg is just over 200km away. If we use the distance 210km both of the documents will be returned.

Geo Aggregations

Elasticsearch provides several useful geo aggregations that allow you to retrieve more information on the locations of your documents, e.g. for faceting. On the other hand as we do have the geohash as well as the prefix enabled we can retrieve all of the cells our results are in using a simple terms aggregation. This way you can let the user drill down on the results by filtering on the cell.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
    "aggregations" : {
        "conference-hashes" : {
            "terms" : {
                "field" : "conference.coordinates.geohash"
            }
        }
    }
}'

Depending on the precision we have chosen while indexing this will return a long list of prefixes for hashes but the most important part is at the beginning.

[...]
   "aggregations": {
      "conference-hashes": {
         "buckets": [
            {
               "key": "u",
               "doc_count": 2
            },
            {
               "key": "u0",
               "doc_count": 2
            },
            {
               "key": "u0w",
               "doc_count": 1
            },
            [...]
        }
    }

Stuttgart and Nuremberg both share the parent cells u and u0.

Alternatively to the terms aggregation you can also use specialized geo aggregations, e.g. the geo distance aggregation for forming buckets of distances.

Conclusion

Besides the features we have seen here Elasticsearch offers a wide range of geo features. You can index shapes and query by arbitrary polygons, either by passing them in or by passing a reference of an indexed polygon. When geohash prefixes are turned on you can also filter by geohash cell.

With the new HTML 5 location features location aware search and content delivery will become more important. Elasticsearch is a good fit for building this kind of applications.

Two users in the geo space are Foursquare, a very early user of Elasticsearch, and Gild, a recruitment agency that does some magic with locations.

Mittwoch, 13. August 2014

Resources for Freelancers

More than half a year ago I wrote a post on my first years as a freelancer. While writing the post I noticed that there are quite some resources I would like to recommend which I deferred to another post that never was written. Last weekend at SoCraTes we had a very productive session on freelancing. We talked about different aspects from getting started, kinds of and reasons for freelancing to handling your sales pipeline.

David and me recommended some resources on getting started so this is the perfect excuse to write the post I planned to write initially.

I will keep it minimal and only present a short description of each point.

Softwerkskammer Freelancer Group
Our discussion at SoCraTes lead to founding a new group in the Softwerkskammer community. We plan to exchange knowledge and probably even work opportunities.
The Freelancers' Show
A podcast on everything freelancing. Started as the Ruby Freelancers but the topics always were general. Fun to listen to, when it comes to software development you might also want to listen to the Ruby Rogues.
Book Yourself Solid
I read this when getting started with freelancing. Helps you with deciding what you want to do and with marketing yourself.
Get Clients Now
A workbook with daily tasks for improving your business. It's a 28 day program that contains some really good ideas and helps you working on getting more work.
Duct Tape Marketing
A book on improving your marketing activities. I took less out of this book than the other two.
Email Course by Eric Davis
Eric Davis, one of the hosts of the Freelancers' Show also provides a free email course for freelancers.
Mediafon Ratgeber Selbstständige
A German book on all practical issues you have to take care of.

There is also stuff that is only slightly related to freelancing but helped me on the way, either through learning or motivation.

Technical Blogging
A book that can help you getting started with blogging. Can be motivating but also contains some good tips.
My Blog Traffic Sucks.
A short book on blogging. This book lead to my very frequent blog publishing habit.
Confessions of a Public Speaker
A very good and entertaining read on presenting.
The 100$ Startup
Not exactly about freelancing but about small startups of all kinds, a very entertaining read about people working for themselves.

I am sure I forgot some of the things that helped me but I hope one of the resources can help you and your freelancing business. If you are missing something feel free to leave a comment.