Tuesday, July 24, 2012

Apache Hadoop as top batch processing framework / Solr indexing

Hadoop is increasingly gathering attention from Open Source Java community as top rated, highly scalable, in-expensive, batch processing framework. Some of the advantages of Hadoop are as follows:
(1) Apache Open Source
(2) Very high community support
(3) In-expensive
(4) Works on standard Windows / Unix server hardware, no special hardware requirements
(5) A company - Cloudera.com, providing training, commercial support and solutions.

Based on above advantages, Hadoop has become a very good tool for creating huge scale business applications that can process several peta bytes of data. In addition to this, Hadoop can also be integrated with database like HBase. More recently, Cloudera offers solution for connecting Hadoop jobs to Oracle database as well.

Hadoop integrated very well with Apache Solr for a complete indexing of documents for large scale search solutions. There are many references of integrating Solr with Hadoop on internet. Hadoop job can be used to upload data to Solr using SolrJ Java driver. Apache Solr search engine stores indexes on separate servers. A separate front end application can provide search interface directly from Apache Solr using SolrJ java driver to interface to Apache Solr. So on one side Hadoop uploads data (or documents) to Apache Solr, and on other side Java front end can provide search interface to Apache Solr.

Hadoop is definitely the technology of future and something all Java Architects should keep an eye on.

Tejas Bavishi

1 comment: