point
The Smart Techie was renamed Siliconindia India Edition starting Feb 2012 to continue the nearly two decade track record of excellence of our US edition.

Building next generation Search Applications

Sujay Koduri
Wednesday, January 2, 2008
Sujay Koduri
Search’ has become an integral part of every e-commerce application, be it searching for documentation online, searching for pages on the web or searching the archived emails in your mailbox. These are just examples and the list continues endlessly. More over e-commerce applications have very good reasons to build their own in house search engine to serve their proprietary data rather than depending on any third-party search tool, which may keep their data privacy at jeopardy. And the data that all these applications are serving daily is growing at an exponential rate. So, there should be a simple and efficient way to search all the archived and incoming data and also the data that may come up in the future.

The Lucene search engine is a free open source information retrieval library, originally implemented in java, tackles most of the search problems. Now it is ported to other languages also like Perl, PHP, and C++ also. Lucene is just an indexing and search library and does not contain crawling and HTML parsing functionalities. Lack of these features doesn’t stop you from building a full-fledged search engine. The Apache project Nutch is based on Lucene and provides this functionality. Another project “Solr” from Apache is a fully-fledged search engine, which is also based on Lucene.
One advantage of Lucene over other search engines is the way it stores the data. Most of the search engines use a B-tree data structure to maintain indexes whereas Lucene, instead of maintaining a single index, builds multiple index segments and merges them periodically. It is smart enough to determine the segments size (sometimes it merges multiple smaller ones into one, incase the updates on the index are infrequent!!). Lucene also has a good compression system of its own which it uses to optimize the storage of indexes. This helps reduce disk I/O, without using much of the CPU resources. It should be noted that there are other open source indexing systems like ‘The MG system’, which provides better performance, compared to Lucene but unfortunately it is written in C and not much of development is happening on that now. And obviously it carries with it all the traditional problems of a C code base.

One class in the Lucene library that needs special attention is the Analyzer class. Applications may differ in the language they store the documents, the kind of words they treat as generic. For example, a very simple search engine can treat the word ‘the’ as generic and doesn’t choose to index that. However a more complex search engine, which supports exact phrase matching, may choose to index the same word ‘the’. All these nuances of parsing the data are taken care by the Analyzer class. Analyzers are components that pre-process input text. They are also used when searching. Because the search string has to be processed the same way that the indexed text was processed, it is crucial to use the same Analyzer for both indexing and searching. Not using the same Analyzer will result in invalid search results.

Using Lucene
The best part about Lucene is its ease of use. In a matter of hours, you can build your own search engine. Below is an illustration of how to create, populate and search indexes in Lucene (Try to use a local file system for better performance)


Share on Twitter
Share on LinkedIn
Share on facebook