“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets”.
It’s almost synonymous with big data, anyone doing anything about any kind of data probably heard about Hadoop. It’s hardly a single technology for a single specific problem. It’s a collection of technologies attacking different parts of the data processing problem: HDFS providing a redundant storage over multiple machines and MapReduce processing the data where they reside was the first solution. And a dozen other projects for collecting, efficiently storing and processing data in near real-time or scheduling data flows and ways to present the results.
Our first encounter was a bit different. We had stored feature vectors of ballistic images in HBase, and compared them on the GPUs by running jobs with MapReduce. That was back in 2010 where it was required to compile a specific branch (branch-0.20-append) of hadoop to run a reliable HBase cluster.
A more “conventional” use of the system was collecting events for an ad network and processing them for periodic reports. Then it evolved into a full-scale analytics platform, collecting data from all over the business and finding relations where you couldn’t normally investigate.
It’s fair to say it’s been some time that Hadoop and related big data talk has passed it’s second phase of hype cycle. It’s even argued that it’s completely gone as a trend or a buzzword now and already became the industry norm. Nevertheless, it’s definitely not as popular as before. So now it’s time for the experts to pick what’s necessary for a real world solution and make a real difference, instead of just buying a “full-version” and then blaming it on technology.