Anthill: 一种基于MapReduce的分布式DBMS

最新推荐文章于 2022-03-17 20:30:25 发布

原创最新推荐文章于 2022-03-17 20:30:25 发布 · 194 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Mapreduce #Hadoop #SQL #Flash #performance

mapreduce&parallel 专栏收录该内容

29 篇文章

订阅专栏

本文介绍了一种名为Anthill的新型分布式数据库系统，它结合了共享无状态MPP架构数据库的优点与MapReduce模型，显著提高了可扩展性。通过使用MonetDB作为底层数据库，Anthill在读取性能上优于Hadoop和HadoopDB，并且通过引入完整的查询执行引擎，能够更智能地处理多样化数据，生成更优查询计划。

MapReduce is a parallel computing model proposed by Google for large data sets, it’s proved to have high availability, good scalability and fault tolerance, and has been widely used in recent years. However, there is a voice from traditional database community, D. J. DeWitt et al argue that MapReduce loses the performance and efficiency from DBMS, and it’s a step backwards of data analysis techniques. After that, HadoopDB article presents a new type of MapReduce based distributed database implementation, but it has the following disadvantages: (1) It doesn’t add any query execution engine for Hadoop client, HadoopDB is only a pilot project; (2) HadoopDB uses PostgreSQL as its underlying database, where redundancies and backups which Hadoop’s underlying file system HDFS provides is short of, thus it loses Hadoop’s original high availability; (3) There is not a complete partition mechanism, its tables are manually partitioned, this is not practical; (4) Table joining in HadoopDB is assumed in the ideal state, that two tables’ partitions are on the same node, but it’s not the case in the real world.
This paper overcomes the above problems, our implementation named Anthill can keep both advantages of shared-nothing MPP architecture databases and the MapReduce. Experiments show that: (1) Anthill provides better scalability than MPP databases, it can deploy more than 100 nodes, even 500 nodes; (2) Since Anthill uses the column-oriented database MonetDB as its underlying database, where I/O is effectively reduced, it achieves better performance than Hadoop and HadoopDB; (3) Anthill adds a complete query execution engine with parser, optimizer and planner, thus it can more intelligently handle the diversity of data and produce better query plans compared to HadoopDB. In short, Anthill is a new type of distributed database system with high commercial values.

Keywords: Distributed Database; Anthill; MapReduce; Hadoop

[flash=425,355]http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=anthill-100511101418-phpapp01&stripped_title=anthill-a-distributed-dbms-based-on-mapreduce[/flash]