Top 5 Reasons Not to Use Hadoop for Analytics

最新推荐文章于 2025-11-26 14:12:29 发布

原创最新推荐文章于 2025-11-26 14:12:29 发布 · 153 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#数据库 #大数据 #java

nosql 专栏收录该内容

4 篇文章

订阅专栏

本文探讨了为何Hadoop在数据分析领域可能不是最佳方案。它指出Hadoop仅是一个框架而非解决方案，存在部署容易但维护成本高，且在复杂分析问题上表现不佳的问题。此外，尽管在大规模数据处理方面有效，但在即兴分析和交互式分析方面表现不佳。

原文地址：[url]http://www.quantivo.com/blog/top-5-reasons-not-use-hadoop-analytics[/url]
As a former diehard fan of Hadoop, I LOVED the fact that you can work on up to Petabytes of data. I loved the ability to scale to thousands of nodes to process a large computation job. I loved the ability to store and load data in a very flexible format. In many ways, I loved Hadoop, until I tried to deploy it for analytics. That’s when I became disillusioned with Hadoop (it just "ain't all that").

At Quantivo, we’ve explored many ways to deploy Hadoop to answer analytical queries (trust me – I made every attempt to include it in my day job). At the end of the day, it became an exercise much like trying to build a house with just a hammer - Conceivably, it’s possible, but it’s unnecessarily painful and ridiculously cost-inefficient to do.

Let me share with you my top reasons why Hadoop should not be used for Analytics.

1 - Hadoop is a framework, not a solution – For many reasons, people have an expectation that Hadoop answers Big Data analytics questions right out of the box. For simple queries, this works. For harder analytics problems, Hadoop quickly falls flat and requires you to directly develop Map/Reduce code directly. For that reason, Hadoop is more like J2EE programming environment than a business analytics solution.

2 - Hive and Pig are good, but do not overcome architectural limitations – Both Hive and Pig are very well thought-out tools that enable the lay engineer to quickly being productive with Hadoop. After all, Hive and Pig are two tools that are used to translate analytics queries in common SQL or text into Java Map/Reduce jobs that can be deployed in a Hadoop environment. However, there are limitations in the Map/Reduce framework of Hadoop that prohibit efficient operation, especially when you require inter-node communications (as is the case with sorts and joins).

3 - Deployment is easy, fast and free, but very costly to maintain and develop – Hadoop is very popular because within an hour, an engineer can download, install, and issue a simple query. It’s also an open source project, so there are no software costs, which makes it a very attractive alternative to Oracle and Teradata. The true costs of Hadoop become obvious when you enter maintenance and development phase. Since Hadoop is mostly a development framework, Hadoop-proficient engineers are required to develop an application as well as optimize it to execute efficiently in a Hadoop cluster. Again, it’s possible but very hard to do.

4 - Great for data pipelining and summarization, horrible for AdHoc Analysis – Hadoop is great at analyzing large amounts of data and summarizing or “data pipelining” to transform the raw data into something more useful for another application (like search or text mining) – that’s what’s it’s built for. However, if you don’t know the analytics question you want to ask or if you want to explore the data for patterns, Hadoop becomes unmanageable very quickly. Hadoop is very flexible at answering many types of questions, as long as you spend the cycles to program and execute MapReduce code.

5 - Performance is great, except when it’s not – By all measures, if you wanted speed and you are required to analyze large quantities of data, Hadoop allows you to parallelize your computation to thousands of nodes. The potential is definitely there. But not all analytics jobs can easily be parallelized, especially when user interaction drives the analytics. So, unless the Hadoop application is designed and optimized for the question that you want to ask, performance can quickly become very slow – as each map/reduce job has to wait until the previous jobs are completed. Hadoop is always as slow as the slowest compute MapReduce job.

That said, Hadoop is a phenomenal framework for doing some very sophisticated data analysis. Ironically, it’s also a framework that requires a lot of programming effort to get those questions answered.