Dev Time: Reading term values for fields from a Lucene Index

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

Field	Term	Documents
Author	Florian Hopf	1
	Theo Tester	2
Category	Category1	1
	Category2	1, 2
	Category3	2

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany and have written a German book about Elasticsearch. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Florian Hopf Freiberuflicher Softwareentwickler

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

About Florian Hopf

Keine Kommentare:

Kommentar veröffentlichen

Labels

Blog-Archiv

Über mich

Florian Hopf Freiberuflicher Softwareentwickler

Samstag, 16. Juni 2012

Reading term values for fields from a Lucene Index

About Florian Hopf

Keine Kommentare:

Kommentar veröffentlichen

Labels

Blog-Archiv

Über mich

Feeds