2008-01-26: Lucene 2.3 Released
An interesting Lucene search benchmark has recently been published by the Danish State and University Library. Amongst others, it shows how much you can speed up search using Solid State Disks (in their benchmark they still use Lucene 2.1).
2007-08-10: Lucene indexing just got much faster
People have been reporting great performance improvements with a patch that improves Lucene's memory usage. Now another patch has been committed that speeds up the way documents are tokenized. I made a quick test to see if the performance improvements of both these patches can not only be realized with smaller documents but also with larger ones. Here are the results:
Indexing time with Lucene 2.2: 58 sec
Indexing time with Lucene trunk (2007-08-09): 17 sec
Thus indexing is 4 times faster now in my test case!
Details about the small test collection:
Document format: plain text
Total size of documents: 28 MB
Number of documents: 71 (i.e. about 400 KB average document size)
Index size: 11 MB
Heap memory for JVM: 10 MB (i.e. -Xmx10M)
Notable API calls I used: indexwriter.setRAMBufferSizeMB(5) (to make it work with just 10 MB of heap memory) and indexwriter.setMaxFieldLength(Integer.MAX_VALUE) (to make sure nothing gets cut off)
2007-06-29: Lucene Presentation Slides
Jazoon'07 was quite a nice conference in a nice city with unfortunately bad weather. Here are the slides of my presenation -- an overview and short introduction into Lucene, Solr, and Nutch:
The next interesting conference is already on the radar: OOoCon 2007 in Barcelona, 19th - 21st September 2007.
2007-06-20: Search and Collaboration Software News
Some interesting search and search-related software has been released recently:
- Mindquarry 1.1 is an Open Source Collaboration Software with integrated Wiki, task management, and file sharing. It's web-based but it also offers an additional Java desktop client. Internally it uses Solr for its search. I work at Mindquarry and I hope that we can further improve the search features in the next versions. If you want to check out Mindquarry for free without downloading and installing it I suggest you have a look at Mindquarry GO, the beta version of the hosted Mindquarry service launched today. You can apply for one of 333 free beta test accounts.
- Solr 1.2 includes faster faceted search, a spell checker ("Did you mean..."), easier embedding into Java without using HTTP, and more.
- Lucene 2.2, the Java search engine library, adds performance improvements for both indexing and searching. It now also features payloads, which is a kind of arbitrary meta-data on the term-level. This also lets you boost certain parts of documents, e.g. headlines.
By the way, if you are interested in an overview of Lucene, Solr, and Nutch you might want to visit my talk at Jazoon07 next week.
2007-02-20: Lucene 2.1 released
- Documents can now be deleted using IndexWriter (not only IndexReader, as before).
- IndexReaders don't need short-time locks anymore, i.e. they are really read-only.
- Support for leading wildcard characters in QueryParser (needs to be activated explicitly).
- An NGramTokenizer has been added.
- Several optimizations, e.g. for compressed fields (faster indexing) and non-optimized indexes (faster searching).
This release includes index format changes that are not readable by older versions. It can both read and update older Lucene indexes. Adding to an index with an older format will cause it to be converted to the newer format.
For all the details, check out the changelog (which has become quite long).