Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucene to provide full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, and so on). In this tutorial, we’ll go through the basics of using Lucene to add full-text search functionality to a fairly typical J2EE application: an online accommodation database. The main business object is the Hotel class. In this tutorial, a Hotel has a unique identifier, a name, a city, and a description.
Roughly, supporting full-text search using Lucene requires two steps: (1) creating a lucence index on the documents and/or database objects and (2) parsing the user query and looking up the prebuilt index to answer the query. In the first part of this tutorial, we learn how to create a lucene index. In the second part, we learn how to use the prebuilt index to answer user queries.
For your convenience, all of the code for this article’s Lucene demo is included in the lucene-tutorial.zip file. In this demo, the class Indexer in src/lucene/demo/search/Indexer.java is responsible for creating the index. The class SearchEngine in src/lucene/demo/search/SearchEngine.java is responsible for supporting user queries. The class Main in src/lucene/demo/Main.java has a test code that builds a Lucene index using a small dataset (the actual data is provided by the Hotel class stored in src/lucene/demo/business/HotelDatabase.java) and performs a simple keyword query on the data using the index. Briefly go over the two java source files, Indexer.java and SearchEngine.java, to get yourself familiar with the overall structure of the code.
1. Creating an Index
The first step in implementing full-text searching with Lucene is to build an index.
At the heart of Lucene is an Index. You pump your data into the Index, then do searches on the Index to get results out. Document objects are stored in the Index, and it is your job to “convert” your data into Document objects and store them to the Index. That is, you read in each data file (or Web document, database tuple or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you’re done building a Document, you write it to the Index using the IndexWriter. Now let us get into details on how this is done.
1.1 IndexWriter Class: Creating Index
To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows:
IndexWriter indexWriter = new IndexWriter(“index-directory”, new StandardAnalyzer(), true);
The first parameter specifies the directory in which the Lucene index will be created, which is index-directory in this case. The second parameter specifies the “document parser” or “document analyzer” that will be used when Lucene indexes your data. Here, we are using the StandardAnalyzer for this purpose. More details on lucene analyzers follow shortly. The third parameter tells Lucene to create a new index if an index has not been created in the directory yet.
1.2 Analyzer Class: Parsing the Documents
Most likely, the data that you want to index by Lucene is plain text English. The job of Analyzer is to “parse” each field of your data into indexable “tokens” or keywords. Several types of analyzers are provided out of the box.
1.3 Adding a Document/object to Index
Now you need to index your documents or business objects. To index an object, you use the Lucene Document class, to which you add the fields that you want indexed. As we briefly mentioned before, a Lucene Document is basically a container for a set of indexed fields. This is best illustrated by an example:
Document doc = new Document();
doc.add(new Field(“description”, hotel.getDescription(), Field.Store.YES, Field.Index.TOKENIZED));
To add a field to a document, you create a new instance of the Field class. A field is made up of a name and a value (the first two parameters in the class constructor). The value may take the form of a String, or a Reader if the object to be indexed is a file.