Freitag, 14. März 2014

Building a Navigation from a Search Index

Though the classic use case for search engines is keyword search nowadays they are often used to serve content that doesn't resemble a search result list at all. As they are optimized for read access they make good candidates for serving any part of a website that needs a dynamic query mechanism. Faceting is most often used to display filters for refining the search results but it can also be used to build hierarchical navigations from results. I am using Solr in this post but the concept can also be used with Elasticsearch or even plain Lucene.

Browsing Search Results

What we are trying to achieve can often be seen on Ecommerce sites but is also useful for serving content. You will be familiar with the navigation that Amazon provides: You can browse the categories that are displayed in a hierarchical navigation.

Of course I am not familiar with the implementation details of how they are storing their content but search engines can be used to build something like this. You can index different types (editiorial content and products) and tag those with categories. The navigation and the page content is then built from the search results that are returned.

For a simple example we are indexing just products, two books and one record. Two Books by Haruki Murakami are in the category Books/Fiction and Books/Non-Fiction. The record by 13 & God is in the category Music/Downbeat. The resulting navigation should then be something like this:

  • Books
    • Non-Fiction
      • Haruki Murakami
    • Fiction
      • Haruki Murakami
  • Music
    • Downbeat
      • 13 & God

PathHierarchyTokenizer

Lucene provides the PathHierarchyTokenizer that can be used to split path like hierarchies. It takes a path as input and creates segments from it. For example when indexing Books/Non-Fiction/Haruki Murakami it will emit the following tokens:

  • Books
  • Books/Non-Fiction
  • Books/Non-Fiction/Haruki Murakami

What is important: It doesn't just split the string on a delimiter but creates a real hierarchy with all the parent paths. This can be used to build our navigation.

Solr

Let's see an example with Solr. The configuration and a unit test is also available on Github.

We are using a very simple schema with documents that only contain a title and a category

<fields>
    <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="category" type="category" indexed="true" stored="false"/>
</fields>

The title is just a string field but the category is a custom field that uses the PathHierarchyTokenizerFactory.

<fieldType name="category" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

When indexing data the path is split according to the rules of the PathHierarchyTokenizer. When querying we are taking the query term as it is so we can have exact matches.

Suppose we are now indexing three documents that are in the three categories we have seen above:

URL=http://localhost:8082/solr/update/json
curl $URL -H 'Content-type:application/json' -d '
[
  {"title":"What I Talk About When I Talk About Running", "category":"Books/Non-Fiction/Haruki Murakami"}, 
  {"title":"South of the Border, West of the Sun", "category":"Books/Fiction/Haruki Murakami"}, 
  {"title":"Own Your Ghost", "category":"Music/Downbeat/13&God"}
]'
curl "$URL?commit=true"

We can easily request the navigation using facets. We query on all documents but return no documents at all. A facet is returned that contains our navigation structure:

curl http://localhost:8082/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=category
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

When displaying the navigation you can now simply split the paths according to their hierarchies. The queries that are executed for displaying the content contain a filter query that filters the currently selected navigation item.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=category:Books"
{
  "responseHeader":{
    "status":0,
    "QTime":27},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  }}

Using tags and exclusions you can even build the navigation using the same request that queries for the filtered documents.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=%7B%21tag%3Dcat%7Dcategory:Books&facet=true&facet.field=%7B%21ex%3Dcat%7Dcategory"
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

If the navigation you are retrieving from the search engine is your single source of truth you might also have to add a sort mechanism; by default all facets are sorted by their count. This works in our simple case but not for the real world. To have them sorted in a defined way you can add numeric identifiers. You would then index paths like 100|Books/110|Non-Fiction/111|Haruki Murakami and request alphanumeric sorting via facet.sort=index. When displaying the result you just remove the front of the string.

As you now are using the search engine to build the navigation you will immediately have the benefits of its filtering mechanisms. Only use categories that have online articles? Add a filter query fq=online:true. Make sure that there are no categories with products that are out of stock? fq=inStock:true.

Conclusion

Search engines offer great flexibility when delivering content. A lot of their functionality can be used to build applications that pull most of their content from the index.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany and have written a German book about Elasticsearch. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Keine Kommentare:

Kommentar veröffentlichen

Elasticsearch - Der praktische Einstieg
Java Code Geeks