In the previous posts we have seen that Elasticsearch can be used to store documents in JSON format and distribute the data across multiple nodes as shards and replicas. Lucene, the underlying library, provides the implementation of the inverted index that can be used to search the documents. Analyzing is a crucial step for building a good search application.
In this post we will look at a different feature that can be used for applications you would not immediately associate Elasticsearch with. We will look at the geo features that can be used to build applications that can filter and sort documents based on the location of the user.
Locations in Applications
Location based features can be useful for a wide range of applications. For merchants the web site can present the closest point of service for the current user. Or there is a search facility for finding points of services according to a location, often integrated with something like Google Maps. For classifieds it can make sense to sort them by distance from the user searching, the same is true for any search for locations like restaurants and the like. Sometimes it also makes sense to only show results that are in a certain area around me, in this case we need to filter by distance. Probably the user is looking for a new appartment and is not interested in results that are too far away from his workplace. Finally locations can also be of interest when doing analytics. Social media data can tell you where something interesting is happening just by looking at the amount of status messages sent from a certain area.
Most of the time locations are stored as a pair of latitude and longitude, which denotes a point. The combination of 48.779506, 9.170045 for example points to Liederhalle Stuttgart which happens to be the location for Java Forum Stuttgart. Geohashes are an alternative means to encode latitude and longitude. They can be stored in arbitrary precision so those can also refer to a larger area instead of a point.
When calculating a Geohash the map is divided into several buckets or cells. Each bucket is identified by a base 32 encoded value. The complete geohash then consists of a sequence of characters. Each following character marks the bucket in the previous bucket so you are zooming in to the location. The longer the geohash string the more precise the location is. For example u0wt88j3jwcp
is the geohash for Liederhalle Stuttgart. The prefix u0wt
on the other hand is the area of Stuttgart and some of the surrounding cities.
The hierarchical nature of geohashes and the possiblity to express them as strings makes them a good choice for storing them in the inverted index. You can create geohashes using the original geohash service or more visually appealing using the nice GeohashExplorer.
Locations in Elasticsearch
Elasticsearch accepts lat
and lon
for specifying latitude and longitude. These are two documents for a conference in Stuttgart and one in Nuremberg.
{
"title" : "Anwendungsfälle für Elasticsearch",
"speaker" : "Florian Hopf",
"date" : "2014-07-17T15:35:00.000Z",
"tags" : ["Java", "Lucene"],
"conference" : {
"name" : "Java Forum Stuttgart",
"city" : "Stuttgart",
"coordinates": {
"lon": "9.170045",
"lat": "48.779506"
}
}
}
{
"title" : "Anwendungsfälle für Elasticsearch",
"speaker" : "Florian Hopf",
"date" : "2014-07-15T16:30:00.000Z",
"tags" : ["Java", "Lucene"],
"conference" : {
"name" : "Developer Week",
"city" : "Nürnberg",
"coordinates": {
"lon": "11.115358",
"lat": "49.417175"
}
}
}
Alternatively you can use the GeoJSON format, accepting an array of longitude and latitude. If you are like me be prepared to hunt down why queries aren't working just to notice that you messed up the order in the array.
The field needs to be mapped with a geo_point field type.
{
"properties": {
[…],
"conference": {
"type": "object",
"properties": {
"coordinates": {
"type": "geo_point",
"geohash": "true",
"geohash_prefix": "true"
}
}
}
}
}'
By passing the optional attribute geohash
Elasticsearch will automatically store the geohash for you as well. Depending on your usecase you can also store all the parent cells of the geohash using the parameter geohash_prefix
. As the values are just strings this is a normal ngram index operation which stores the different substrings for a term, e.g. u, u0, u0w and u0wt for u0wt.
With our documents in place we can now use the geo information for sorting, filtering and aggregating results.
Sorting by Distance
First, let's sort all our documents by distance from a point. This would allow us to build an application that displays the closest location for the current user.
curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
"sort" : [
{
"_geo_distance" : {
"conference.coordinates" : {
"lon": 8.403697,
"lat": 49.006616
},
"order" : "asc",
"unit" : "km"
}
}
]
}'
We are requesting to sort by _geo_distance
and are passing in another location, this time Karlsruhe, where I live. Results should be sorted ascending so the closer results come first. As Stuttgart is not far from Karlsruhe it will be first in the list of results.
The score for the document will be empty. Instead there is a field sort that contains the distance of the locations from the one provided. This can be really handy when displaying the results to the user.
Filtering by Distance
For some usecase we would like to filter our results by distance. Some online real estate agencies for example provide the option to only display results that are in a certain distance from a point. We can do the same by passing in a geo_distance
filter.
curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
"filter": {
"geo_distance": {
"conference.coordinates": {
"lon": 8.403697,
"lat": 49.006616
},
"distance": "200km",
"distance_type": "arc"
}
}
}'
We are again passing the location of Karlsruhe. We request that only documents in a distance of 200km should be returned and that the arc
distance_type
should be used for calculating the distance. This will take into account that we are living on a globe.
The resulting list will only contain one document, Stuttgart, as Nuremberg is just over 200km away. If we use the distance 210km both of the documents will be returned.
Geo Aggregations
Elasticsearch provides several useful geo aggregations that allow you to retrieve more information on the locations of your documents, e.g. for faceting. On the other hand as we do have the geohash as well as the prefix enabled we can retrieve all of the cells our results are in using a simple terms aggregation. This way you can let the user drill down on the results by filtering on the cell.
curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
"aggregations" : {
"conference-hashes" : {
"terms" : {
"field" : "conference.coordinates.geohash"
}
}
}
}'
Depending on the precision we have chosen while indexing this will return a long list of prefixes for hashes but the most important part is at the beginning.
[...]
"aggregations": {
"conference-hashes": {
"buckets": [
{
"key": "u",
"doc_count": 2
},
{
"key": "u0",
"doc_count": 2
},
{
"key": "u0w",
"doc_count": 1
},
[...]
}
}
Stuttgart and Nuremberg both share the parent cells u and u0.
Alternatively to the terms aggregation you can also use specialized geo aggregations, e.g. the geo distance aggregation for forming buckets of distances.
Conclusion
Besides the features we have seen here Elasticsearch offers a wide range of geo features. You can index shapes and query by arbitrary polygons, either by passing them in or by passing a reference of an indexed polygon. When geohash prefixes are turned on you can also filter by geohash cell.
With the new HTML 5 location features location aware search and content delivery will become more important. Elasticsearch is a good fit for building this kind of applications.
Two users in the geo space are Foursquare, a very early user of Elasticsearch, and Gild, a recruitment agency that does some magic with locations.