Posted by: semeru2007 on: October 17, 2007
The following is a writing about Lucene, written in sept ‘05, when I still do some code that uses lucene
——— ——— ——— ——— ——— ——— ——— ———
Lucene, an Introduction
[What is Lucene?]
Lucene is java based text search engine library provides powerful indexing and querying features. This library is not a ready to use tool which you can put the jar into your classpath and use it right away (like log4j). We still have to write some code in order to use lucene.
[What to write?]
There are two parts of code we need to write to make our application become search enabled. The first part, create index of data on which we want to provide the search facility. FYI, this index work as inverted index which means it contains token of all document, and for each token it has information in which document this token has occurence. In other word, it doesn’t say: “Document A1 contains token ax, ay, and az”, but it says “token ax has occurence in document A1, A2, and A3″.
The second part, querying into the index using lucene query features. Either programmatically by API or by querying using lucene query syntax.
[The Components]
The following are buzz words in lucene-related talks, so understanding these terms will helps in getting clear in knowing what lucene is.
In indexing part, we have:
[1] IndexWriter
The purpose of this class is to give write access to lucene index. It has functionality to create index, add lucene Document to index, and optimize index.
[2] Directory
Directory is representation of physical lucene index files. Any operation accessing index will go through this class. For example, this class is supplied as constructor argument of IndexWriter, IndexSearcher, and as argument for IndexReader’s static method IndexReader.open(). This is an abstract class which has two concrete implementation, FSDirectory for File Systems based index, and RAMDirectory for in-memory based index.
[3] Analyzer
An analyzer is an encapsulation of the analysis process. An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form. This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms.
[4] Document
Document is a logical way on how Lucene organize collection of information. It represent a single unit of information. For example, indexing customer database will resulting a number of Lucene Document, where each document represents a customer. Later on, document also have collection of field, pairs of key/value for storing the information itself.
[5] Field
Field represent fields in document to store data. In our example, each customer Document can have field name, email, address, etc. There are different types of Field we can use based on how the field store data. Field can have its data being: – indexed : if yes, we can search on it, otherwise we can’t, we just store data without being able to search on this field. – stored : if yes, we can retrieve data from Document for later use, otherwise, we can only search on it without access to the content within document. – analyzed/tokenized : if yes, the text is processed/tokenized into terms, otherwise store them as is.
Then, there are four type of Field provided by Lucene API:
In searching part, we have:
[1] IndexSearcher
IndexSearcher is the object we use to search on index. It opens the index in a read-only mode, and search on it. the simplest use of IndexSearcher is taking a single Query object as a parameter and returns a Hits object.
[2] Term and Query
Term is basic unit criteria for searching, consists of pair key/value. Just like field object. Query is an abstract class with several concrete class represent -as its name- a query. We can build query either programmatically or by using lucene query syntax with QueryParser.
The simplest one is TermQuery, useful when we search for a simple value in certain field.
in query syntax: “fieldName:keywordSearch”
RangeQuery used when we need to search in certain range. in query syntax: “fieldName:[lowerValue TO upperValue]“. Use “[" for inclusive, otherwise use "{".
PrefixQuery, used for search value which start with a string.
in query syntax: "fieldName:keywordSearch*"
BooleanQuery, used for combining queries. Equivalent with adding logical operator AND, OR, and NOT between queries. lucene use three static constants in class org.apache.lucene.search.BooleanClause.Occur for specifying this combination:
It's slightly different with common logical operator. In Lucene we combine queries as pair of operator-Query, not Query-operator-Query. for example, it is common to write :
not ( (Query1 and Query2) or Query3)
but in lucene, we write :
- ((+Query1 +Query2) Query3)
in lucene version earlier than 2.0.0, it uses method add(Query queryToAdd, boolean mandatory, boolean prohibited) we can translate this combination into:
| mandatory | prohibited | equivalent constants |
| true | true | invalid combination |
| true | false | MUST(and) |
| false | true | MUST_NOT(not) |
| false | false | SHOULD(or) |
PhraseQuery: this query is used to search with multiple words. The advantage over using multipe TermQuery is its feature of calculating the slop and distance factor for scoring (order of occurence, in simple words). String representation: "fieldName:\keyword1 keyword2 keyword3\~slopFactor"
WildcardQuery: use this query if you want to allow wildcard * and ? in your search keyword.
FuzzyQuery: This query allow us to search "similar" word. The similarity is calculated using Levehstein algorithm. in query syntax: "fieldName:~keywordSearch"
[3] QueryParser
This is an alternative way in constructing query by using QueryParser.parse(queryString) method. for example, we can construct a Prefix query by calling QueryParser.parse(”keyword*”) please consult docs/queryparsersyntax.html file in your distribution copy forsyntax reference.
[4] Hits
Hits is a container for search result storing score and pointer to the Lucene Document.
[...] Lucene, an introduction [...]
Recent Comments