Introduction to Lucene

Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucene to provide full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, and so on). In this tutorial, we’ll go through the basics of using Lucene to add full-text search functionality to a fairly typical J2EE application: an online accommodation database. The main business object is the Hotel class. In this tutorial, a Hotel has a unique identifier, a name, a city, and a description.

Roughly, supporting full-text search using Lucene requires two steps: (1) creating a lucence index on the documents and/or database objects and (2) parsing the user query and looking up the prebuilt index to answer the query. In the first part of this tutorial, we learn how to create a lucene index. In the second part, we learn how to use the prebuilt index to answer user queries.

For your convenience, all of the code for this article’s Lucene demo is included in the lucene-tutorial.zip file. In this demo, the class Indexer in src/lucene/demo/search/Indexer.java is responsible for creating the index. The class SearchEngine in src/lucene/demo/search/SearchEngine.java is responsible for supporting user queries. The class Main in src/lucene/demo/Main.java has a test code that builds a Lucene index using a small dataset (the actual data is provided by the Hotel class stored in src/lucene/demo/business/HotelDatabase.java) and performs a simple keyword query on the data using the index. Briefly go over the two java source files, Indexer.java and SearchEngine.java, to get yourself familiar with the overall structure of the code.

1. Creating an Index

The first step in implementing full-text searching with Lucene is to build an index.
At the heart of Lucene is an Index. You pump your data into the Index, then do searches on the Index to get results out. Document objects are stored in the Index, and it is your job to “convert” your data into Document objects and store them to the Index. That is, you read in each data file (or Web document, database tuple or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you’re done building a Document, you write it to the Index using the IndexWriter. Now let us get into details on how this is done.

1.1 IndexWriter Class: Creating Index

To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows:

IndexWriter indexWriter = new IndexWriter(“index-directory”, new StandardAnalyzer(), true);

The first parameter specifies the directory in which the Lucene index will be created, which is index-directory in this case. The second parameter specifies the “document parser” or “document analyzer” that will be used when Lucene indexes your data. Here, we are using the StandardAnalyzer for this purpose. More details on lucene analyzers follow shortly. The third parameter tells Lucene to create a new index if an index has not been created in the directory yet.

1.2 Analyzer Class: Parsing the Documents
Most likely, the data that you want to index by Lucene is plain text English. The job of Analyzer is to “parse” each field of your data into indexable “tokens” or keywords. Several types of analyzers are provided out of the box.

1.3 Adding a Document/object to Index
Now you need to index your documents or business objects. To index an object, you use the Lucene Document class, to which you add the fields that you want indexed. As we briefly mentioned before, a Lucene Document is basically a container for a set of indexed fields. This is best illustrated by an example:

Document doc = new Document();
doc.add(new Field(“description”, hotel.getDescription(), Field.Store.YES, Field.Index.TOKENIZED));

To add a field to a document, you create a new instance of the Field class. A field is made up of a name and a value (the first two parameters in the class constructor). The value may take the form of a String, or a Reader if the object to be indexed is a file.

Log4J

Logging within the context of program development constitutes inserting statements into the program that provide some kind of output information that is useful to the developer. Inserting log statements into your code is a general method for debugging it. Examples of logging are trace statements, dumping of structures and the familiar System.out.println or printf debug statements

Log4j is a open source debugging tool developed for putting log statements into your application., written in Java, which logs statements to a file, a java.io.Writer, or a syslog daemon. Log4j offers a hierarchical way to insert logging statements within a Java program. Multiple output formats and multiple levels of logging information are available.

One of the distinctive features of log4j is the notion of inheritance in loggers. Using a logger hierarchy it is possible to control which log statements are output at arbitrarily fine granularity but also great ease. This helps reduce the volume of logged output and minimize the cost of logging.

By using a dedicated logging package, the overhead of maintaining thousands of System.out.println statements is alleviated as the logging may be controlled at runtime from configuration scripts.

It’s speed and flexibility allows log statements to remain in shipped code while giving the user the ability to enable logging at runtime without modifying any of the application binary. All of this while not incurring a high performance cost.

Logging does have its drawbacks. It can slow down an application. If too verbose, it can cause scrolling blindness. To alleviate these concerns, log4j is designed to be reliable, fast and extensible. Since logging is rarely the main focus of an application, the log4j API strives to be simple to understand and to use.

All enterprise architectures adopt a global logging framework. The logging framework provides classes/interfaces for logging messages, errors and debug statements to destinations like files, databases, e-mails and consoles throughout the system. The logged information is useful to an end-user of an application or to a system administrator. Internationalization in a logging framework is the ability to log the messages in different languages. All logging frameworks will be rated on internationalization and extensibility. The following logging frameworks are available:
• JavaTM Logging
• Jlog
• Log4j
• Protomatter
• BEA mechanism for logging (technically not a full feature framework)
• Message LFW

Follow

Get every new post delivered to your Inbox.