Thursday, May 10, 2012

jTool: Apache Lucene

I had a full-day client briefing today in St. Louis on Big Data analytics for enterprise research.  During the briefing, Lucene was mentioned by architects and developers from both sides over a dozen times as an useful tool for text indexing even at the enterprise level. 

As an open-source project, it says a lot for Lucene when two 100-year companies agreed on its importance. 


So I am now adding the 2nd Apache project (the first being Apache Hadoop) to the jTool catalog. 


Overview of Apache Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene offers powerful features through a simple API: 

Scalable, High-Performance Indexing
  • over 95GB/hour on modern hardware
  • small RAM requirements -- only 1MB heap
  • incremental indexing as fast as batch indexing
  • index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

  • ranked searching -- best results returned first
  • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
  • fielded searching (e.g., title, author, contents)
  • date-range searching
  • sorting by any field
  • multiple-index searching with merged results
  • allows simultaneous update and searching

Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible

Links

No comments:

Post a Comment