MapReduce is a parallel computing model proposed by Google for large data sets, it’s proved to have high availability, good scalability and fault tolerance, and has been widely used in recent years. However, there is a voice from traditional database community, D. J. DeWitt et al argue that MapReduce loses the performance and efficiency from DBMS, and it’s a step backwards of data analysis techniques. After that, HadoopDB article presents a new type of MapReduce based distributed database implementation, but it has the following disadvantages: (1) It doesn’t add any query execution engine for Hadoop client, HadoopDB is only a pilot project; (2) HadoopDB uses PostgreSQL as its underlying database, where redundancies and backups which Hadoop’s underlying file system HDFS provides is short of, thus it loses Hadoop’s original high availability; (3) There is not a complete partition mechanism, its tables are manually partitioned, this is not practical; (4) Table joining in HadoopDB is assumed in the ideal state, that two tables’ partitions are on the same node, but it’s not the case in the real world.
This paper overcomes the above problems, our implementation named Anthill can keep both advantages of shared-nothing MPP architecture databases and the MapReduce. Experiments show that: (1) Anthill provides better scalability than MPP databases, it can deploy more than 100 nodes, even 500 nodes; (2) Since Anthill uses the column-oriented database MonetDB as its underlying database, where I/O is effectively reduced, it achieves better performance than Hadoop and HadoopDB; (3) Anthill adds a complete query execution engine with parser, optimizer and planner, thus it can more intelligently handle the diversity of data and produce better query plans compared to HadoopDB. In short, Anthill is a new type of distributed database system with high commercial values.
Keywords: Distributed Database; Anthill; MapReduce; Hadoop
[flash=425,355]http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=anthill-100511101418-phpapp01&stripped_title=anthill-a-distributed-dbms-based-on-mapreduce[/flash]
This paper overcomes the above problems, our implementation named Anthill can keep both advantages of shared-nothing MPP architecture databases and the MapReduce. Experiments show that: (1) Anthill provides better scalability than MPP databases, it can deploy more than 100 nodes, even 500 nodes; (2) Since Anthill uses the column-oriented database MonetDB as its underlying database, where I/O is effectively reduced, it achieves better performance than Hadoop and HadoopDB; (3) Anthill adds a complete query execution engine with parser, optimizer and planner, thus it can more intelligently handle the diversity of data and produce better query plans compared to HadoopDB. In short, Anthill is a new type of distributed database system with high commercial values.
Keywords: Distributed Database; Anthill; MapReduce; Hadoop
[flash=425,355]http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=anthill-100511101418-phpapp01&stripped_title=anthill-a-distributed-dbms-based-on-mapreduce[/flash]
本文介绍了一种名为Anthill的新型分布式数据库系统,它结合了共享无状态MPP架构数据库的优点与MapReduce模型,显著提高了可扩展性。通过使用MonetDB作为底层数据库,Anthill在读取性能上优于Hadoop和HadoopDB,并且通过引入完整的查询执行引擎,能够更智能地处理多样化数据,生成更优查询计划。
9742

被折叠的 条评论
为什么被折叠?



