2008-01-26: Lucene 2.3 Released

Category: search
Posted by: dnaber
I posted a simple benchmark before and now that Lucene 2.3 has been released, I tried it again, to make sure there were no regressions in performance. The old results were confirmed, so if you use Lucene to index big documents with StandardAnalyzer you can expect indexing to become about 4 times faster.

An interesting Lucene search benchmark has recently been published by the Danish State and University Library. Amongst others, it shows how much you can speed up search using Solid State Disks (in their benchmark they still use Lucene 2.1).

Category: search
Posted by: dnaber

People have been reporting great performance improvements with a patch that improves Lucene's memory usage. Now another patch has been committed that speeds up the way documents are tokenized. I made a quick test to see if the performance improvements of both these patches can not only be realized with smaller documents but also with larger ones. Here are the results:

Indexing time with Lucene 2.2: 58 sec
Indexing time with Lucene trunk (2007-08-09): 17 sec

Thus indexing is 4 times faster now in my test case!

Details about the small test collection:
Document format: plain text
Total size of documents: 28 MB
Number of documents: 71 (i.e. about 400 KB average document size)
Index size: 11 MB
Heap memory for JVM: 10 MB (i.e. -Xmx10M)

Notable API calls I used: indexwriter.setRAMBufferSizeMB(5) (to make it work with just 10 MB of heap memory) and indexwriter.setMaxFieldLength(Integer.MAX_VALUE) (to make sure nothing gets cut off)

Category: search
Posted by: dnaber

Jazoon'07 was quite a nice conference in a nice city with unfortunately bad weather. Here are the slides of my presenation -- an overview and short introduction into Lucene, Solr, and Nutch:

The next interesting conference is already on the radar: OOoCon 2007 in Barcelona, 19th - 21st September 2007.

Category: search
Posted by: dnaber

Some interesting search and search-related software has been released recently:

  • Mindquarry 1.1 is an Open Source Collaboration Software with integrated Wiki, task management, and file sharing. It's web-based but it also offers an additional Java desktop client. Internally it uses Solr for its search. I work at Mindquarry and I hope that we can further improve the search features in the next versions. If you want to check out Mindquarry for free without downloading and installing it I suggest you have a look at Mindquarry GO, the beta version of the hosted Mindquarry service launched today. You can apply for one of 333 free beta test accounts.
  • Solr 1.2 includes faster faceted search, a spell checker ("Did you mean..."), easier embedding into Java without using HTTP, and more.
  • Lucene 2.2, the Java search engine library, adds performance improvements for both indexing and searching. It now also features payloads, which is a kind of arbitrary meta-data on the term-level. This also lets you boost certain parts of documents, e.g. headlines.

By the way, if you are interested in an overview of Lucene, Solr, and Nutch you might want to visit my talk at Jazoon07 next week.

2007-02-20: Lucene 2.1 released

Category: search
Posted by: dnaber
A new version of Lucene, the powerful fulltext search library, has been released. Some highlights:
  • Documents can now be deleted using IndexWriter (not only IndexReader, as before).
  • IndexReaders don't need short-time locks anymore, i.e. they are really read-only.
  • Support for leading wildcard characters in QueryParser (needs to be activated explicitly).
  • An NGramTokenizer has been added.
  • Several optimizations, e.g. for compressed fields (faster indexing) and non-optimized indexes (faster searching).

This release includes index format changes that are not readable by older versions. It can both read and update older Lucene indexes. Adding to an index with an older format will cause it to be converted to the newer format.

For all the details, check out the changelog (which has become quite long).

2006-05-28: Lucene 2.0 released

Category: search
Posted by: dnaber
Lucene 2.0 has just been released. As planned, it contains bug fixes against Lucene 1.9.1, and deprecated methods and classes have been removed. If you're still using Lucene 1.4 you should consider an update, the list of fixes and improvements is very long.

2006-03-03: Lucene 1.9.1 released

Category: search
Posted by: dnaber
Version 1.9.1 fixes a bug in Lucene's I/O methods, an update for everybody using 1.9 is strongly recommended. See the Lucene homepage.

2006-02-28: Lucene 1.9 released

Category: search
Posted by: dnaber
Lucene (the powerful text search engine library) version 1.9 has just been released. It contains a large amount of improvements. Some methods and classes have been deprecated and will be removed in Lucene 2.0. Actually Lucene 2.0 is planned to be the same as 1.9 with the deprecated stuff removed (and it will probably be released quite soon, i.e. in a few weeks).