Freitag, 7. Dezember 2012

Looking at a Plaintext Lucene Index

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 4
  term document
    doc 0
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
  term second
    doc 1
      freq 1
      pos 4
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END

The content that is marked as stored resides in the .fld file:

doc 0
  numfields 1
  field 0
    name title
    type string
    value The title of my first document
doc 1
  numfields 1
  field 0
    name title
    type string
    value The title of the second document
END

If you'd like to have a look at the rest of the files checkout the code at Github.

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany and have written a German book about Elasticsearch. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Keine Kommentare:

Kommentar veröffentlichen