Building a Blog Redux - Creating a Lucene.Net Index (Part 8)

Thursday, October 18, 2012

This is the eighth post in a series of posts about how I went about building my blogging application.

  1. Building a Blog Redux - Why Torture Myself (Part 1)
  2. Building a Blog Redux - The tools for the trade (Part 2)
  3. Building a Blog Redux - Entity Framework Code First (Part 3)
  4. Building a Blog Redux - Web fonts with @font-face and CSS3 (Part 4)
  5. Building a Blog Redux - Goodreads feed using Backbone.js (Part 5)
  6. Building a Blog Redux - Mapping View Models to Entities Using AutoMapper (Part 6)
  7. Building a Blog Redux - Setting Up Lucene.Net For Search (Part 7)

In the previous post in the series I talked about how I went about Lucene.Net in my blogging application using StructureMap. In this post, I am going continue with the searching feature and write about how the indexing is set up.

What Initiates Writing an Index

Researching other implementations of creating an index, I saw several different options of when an index was created and refreshed. I saw examples of applications reindexing when the application starts up, when the content is updated or on demand on an admin screen. For my first implementation, I chose to keep it simple, so I just create a button on my admin site that kicks off the reindexing when the button is clicked.

Here's the view. Pretty simple.

<h2>Manage Search Index</h2>
<div id="manage-index-panel">
    
    <input id="reindex" type="button" value="Re-index Posts" />
    <div id="index-waiting-panel" style="displaynone;">
        <img src="/Content/images/sb_wait.gif" alt="Waiting"/>
    </div>
    <div id="reindex-status-panel">
        <ul></ul>
    </div>
</div>

The event kicked off via a AJAX call. I was thinking the reindexing would be taking a long time, but actually it is pretty fast so I am not so sure I needed to do this but hey what the heck.

var indexAdmin = function ($) {
 
    var $waitPanel = $('#index-waiting-panel'),
        $reindex = $('#reindex'),
        $status = $('#reindex-status-panel ul'),
        postUrl = '/manage/index/reindex',
        displayIndexCompletionStatus = function (items) {
            console.log('data returned', items);
            $waitPanel.css({ display: 'none' });
            $.each(items, function () {
                $status.append('<li>' + this + '</li>');
            });
        },
        init = function () {
 
            $reindex.on('click'function () {
                $waitPanel.css({ display: 'inline' });
                $.ajax({
                    url: postUrl,
                    cache: false,
                    contentType: 'application/json; charset=utf-8',
                    dataType: "json",
                    type: "POST",
                    success: function (data) {
                        displayIndexCompletionStatus(data);
                    },
                    error: function (jqXHR, textStatus, errorThrown) {
                        console.log(jqXHR);
                        console.log(textStatus);
                        console.log(errorThrown);
                        displayIndexCompletionStatus(errorThrown);
                    }
                });
            });
        };
 
 
    return { init: init };
 
}(jQuery);
 
jQuery(document).ready(function() {
    indexAdmin.init();
});

Here is the MVC action that receives the API call.

public JsonResult Reindex()
        {
            var errors = _searchIndexService.RebuildIndex();
            if (errors != null && errors.Count > 0)
            {
                var errorList = errors.Select(error => error.Exception.Message).ToList();
                return Json(new { Status = errorList.ToArray() });
            }
            var status = new { Status = new[] { "Success" } };
            return Json(status);
        }

Pretty clean action. Just calling my Reindexing Service and if there were any errors then return a list of the errors. I am then sending a status message back in the response.

Creating the Index Document

Before I talk about the code that actually writes to the index let me first talk about the main Lucene.Net classes that are going to be involved in process.

  • The Directory is a flat list of files that may only be writing to once. What this means is once a file is created. It cannot be modified; rather, it can only be read from or deleted.
  • Documents are the primary units which stored within a directory. A document will have a collection of fields with will specify what content to index, store and compare with other content.
  • The IndexWriter is the class that actually writes out the index. It also has the ability to remove items from the index as well.
  • The Analyzer are the objects that tokenizes the content so it can be searched on later. There are various derivations of the analyzers that very in there complexity in searching. I am using the SnowballAnalyzer which does things like matches on root word on longer words. So for example is someone searching on "programming" would get match on the word "program". (Note: The SnowballAnalyzer is in a different Nuget package called Lucene.Contrib). 

The IndexWriter is the first thing that I setup. I really like how elegantly this object was set up in SubText so I am pretty much using the same implementation for my IndexWriter. You can see their version of the implementation by downloading their source code from their Google Code Repository. Essentially every time you need to call a IndexWriter function, you need to ensure the current thread is the only instance that is using object during the transaction. So rather then repeating the same code for every transaction, SubText creates a function that locks the object first and passes back a delegate where you can then call an action to executed.

Here is the locking function:

        private void EnsureWriterInstance()
        {
            if (writer != nullreturn;
 
            if (IndexWriter.IsLocked(_directory)) IndexWriter.Unlock(_directory);
 
            writer = new IndexWriter(_directory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
            writer.SetMergePolicy(new LogDocMergePolicy(writer));
            writer.SetMergeFactor(5);
        }

Then every time an action needs to be excuted the DoWriterAction is called.

        private void DoWriterAction(Action<IndexWriter> action)
        {
            lock (WriterLock)
            {
                EnsureWriterInstance();
            }
            action(writer);
        }

Be sure the writer is static:

private static IndexWriter writer;

The code is returning the current IndexWriter if one already exists and if it does not exists, it makes sure the document object and setting the merging policy. The merge factor is a setting that determines how often the segment indices are merged. According to the specs, "With smaller values, less RAM is used while indexing, and searches on unoptimzed values are faster, but indexing speed is slower. With larger, values, searches on unoptimized indices are slower, indexing is faster."

For my blog site, I would like the user to be able to search on the title, content, tags, and meta-description, but I would like to store other things in the document that I can use in the displaying the search result like the publishing date, and the slug for the link. Conversely, some field I just want to index, but I do not want to store like the post content.

//create the fields
            var postId = new Field(
                PostId,
                NumericUtils.IntToPrefixCoded(post.Id),
                Field.Store.YES,
                Field.Index.NOT_ANALYZED,
                Field.TermVector.NO);
 
            var title = new Field(
                Title,
                post.Title,
                Field.Store.YES,
                Field.Index.ANALYZED,
                Field.TermVector.YES);
 
            var body = new Field(
                Body,
                post.PostContent,
                Field.Store.NO,
                Field.Index.ANALYZED,
                Field.TermVector.YES);
 
            var tags = new Field(
                Tags,
                tagDelimited.ToString(),
                Field.Store.NO,
                Field.Index.ANALYZED,
                Field.TermVector.YES);
 
            var metaDescirption = new Field(
                MetaDescription,
                post.Description ?? string.Empty,
                Field.Store.NO,
                Field.Index.ANALYZED,
                Field.TermVector.YES
                );
 
            var published = new Field(
                IsPublished,
                post.IsPublished.ToString(),
                Field.Store.NO,
                Field.Index.NOT_ANALYZED,
                Field.TermVector.NO);
 
            var slug = new Field(
                Slug,
                post.Slug,
                Field.Store.YES,
                Field.Index.NO,
                Field.TermVector.NO);
 
            var pubdate = new Field(
                DatePublished,
                pubDate,
                Field.Store.YES,
                Field.Index.NOT_ANALYZED,
                Field.TermVector.NO);

So each document has a collection of fields and each field I specify the following:

  • The key (name) of the field.
  • The value of the field.
  • Whether or not I want to store the raw value in the field for later use,
  • Whether or not I want to index the value of this field to be used in searching.
  • Whether or not and/or how I want the field to use term vectors.

I can also weight some fields so that I get a higher match on a word if for example it is located in the title verses being located in the body. I can do this by calling the SetBoost function and giving it a weighting factor.

So then once my fields are created and weighted, I can then add them to a document.

            //boost some of the entries
            title.SetBoost(4);
            tags.SetBoost(2);
            body.SetBoost(1);
            slug.SetBoost(2);
            metaDescirption.SetBoost(1);
 
            //add the fields to the docuement
            doc.Add(postId);
            doc.Add(title);
            doc.Add(body);
            doc.Add(tags);
            doc.Add(published);
            doc.Add(pubdate);
            doc.Add(slug);
            doc.Add(metaDescirption);

Once the document is prepared then you add this current document to the index writer where the collection of documents are contained.

 DoWriterAction(indexWriter => indexWriter.AddDocument(CreateDocument(currentPost)));

Once all the documents are added then the commit function where all the pending changes.

DoWriterAction(
                indexWriter =>
                    {
                        indexWriter.Commit();
                        if (optimize) indexWriter.Optimize();
                    });

The Optimize function on the indexWriter reorganizes the indexes so that they can be queried faster and provide better performance.

Conclusion:

So now I have functionality that when I click a button, the current index is refreshed with the latest blog content from the database. In my next post in this series I will show you how I can now query this index.

As always you can check out the code at my GitHub repository.

Resources:

comments powered by Disqus