2023-04-13 MonetDB/X100: Hyper-Pipelining Query Execution-优快云博客

本文链接：https://blog.youkuaiyun.com/adofsauron/article/details/130131591

这篇论文深入研究了现代CPU在决策支持、OLAP和多媒体检索等计算密集型应用中为何数据库系统往往只达到较低的IPC（指令每周期）效率，特别是针对TPC-H基准。作者分析了多种关系系统和MonetDB，并提出了新的查询处理器设计指南。论文的第二部分介绍了MonetDB/X100的新查询引擎，它遵循这些指南，表面上类似于经典的Volcano风格引擎，但通过基于向量处理的所有执行方式实现了高CPU效率。在100GB的TPC-H上，MonetDB/X100的原始执行能力比以前的技术高出一个到两个数量级。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://www.cidrdb.org/cidr2005/papers/P19.pdf

MonetDB/X100: Hyper-Pipelining Query Execution Peter Boncz, Marcin Zukowski, Niels Nes CWI Kruislaan 413 Amsterdam, The Netherlands {P.Boncz,M.Zukowski,N.Nes}@cwi.nl Abstract Database systems tend to achieve only low IPC (instructions-per-cycle) efficiency on modern CPUs in compute-intensive application areas like decision support, OLAP and multimedia retrieval. This paper starts with an in-depth investigation to the reason why this happens, focusing on the TPC-H benchmark. Our analysis of various relational systems and MonetDB leads us to a new set of guidelines for designing a query processor. The second part of the paper describes the architecture of our new X100 query engine for the MonetDB system that follows these guidelines. On the surface, it resembles a classical Volcano-style engine, but the crucial difference to base all execution on the concept of vector processing makes it highly CPU efficient. We evaluate the power of MonetDB/X100 on the 100GB version of TPC-H, showing its raw execution power to be between one and two orders of magnitude higher than previous technology. 1 Introduction Modern CPUs can perform enormous amounts of calculations per second, but only if they can find enough independent work to exploit their parallel execution capabilities. Hardware developments during the past decade have significantly increased the speed difference between a CPU running at full throughput and minimal throughput, which can now easily be an order of magnitude. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 2005 CIDR Conference One would expect that query-intensive database workloads such as decision support, OLAP, datamining, but also multimedia retrieval, all of which require many independent calculations, should provide modern CPUs the opportunity to get near optimal IPC (instructions-per-cycle) efficiencies. However, research has shown that database systems tend to achieve low IPC efficiency on modern CPUs in these application areas [6, 3]. We question whether it should really be that way. Going beyond the (important) topic of cache-conscious query processing, we investigate in detail how relational database systems interact with modern super-scalar CPUs in queryintensive workloads, in particular the TPC-H decision support benchmark. The main conclusion we draw from this investigation is that the architecture employed by most DBMSs inhibits compilers from using their most performancecritical optimization techniques, resulting in low CPU efficiencies. Particularly, the common way to implement the popular Volcano [10] iterator model for pipelined processing, leads to tuple-at-a-time execution, which causes both high interpretation overhead, and hides opportunities for CPU parallelism from the compiler. We also analyze the performance of the main memory database system MonetDB1 , developed in our group, and its MIL query language [4]. MonetDB/MIL uses a column-at-a-time execution model, and therefore does not suffer from problems generated by tupleat-a-time interpretation. However, its policy of full column materialization causes it to generate large data streams during query execution. On our decision support workload, we found MonetDB/MIL to become heavily constrained by memory bandwidth, causing its CPU efficiency to drop sharply. Therefore, we argue for combining the column-wise execution of MonetDB with the incremental materialization offered by Volcano-style pipelining. We designed and implemented from scratch a new query engine for the MonetDB system, called X100, 1MonetDB is now in open-source, see monetdb.cwi.nl that employs a vectorized query processing model. Apart from achieving high CPU efficiency, MonetDB/X100 is intended to scale up towards non mainmemory (disk-based) datasets. The second part of this paper is dedicated to describing the architecture of MonetDB/X100 and evaluating its performance on the full TPC-H benchmark of size 100GB. 1.1 Outline This paper is organized as follows. Section 2 provides an introduction to modern super-scalar (or hyperpipelined) CPUs, covering the issues most relevant for query evaluation performance. In Section 3, we study TPC-H Query 1 as a micro-benchmark of CPU efficiency, first for standard relational database systems, then in MonetDB, and finally we descend into a standalone hand-coded implementation of this query to get a baseline of maximum achievable raw performance. Section 4 describes the architecture of our new X100 query processor for MonetDB, focusing on query execution, but also sketching topics like data layout, indexing and updates. In Section 5, we present a performance comparison of MIL and X100 inside the Monet system on the TPCH benchmark. We discuss related work in Section 6, before concluding in Section 7. 2 How CPUs Work Figure 1 displays for each year in the past decade the fastest CPU available in terms of MHz, as well as highest performance (one thing does not necessarily equate the other), as well as the most advanced chip manufacturing technology in production that year. The root cause for CPU MHz improvements is progress in chip manufacturing process scales, that typically shrink by a factor 1.4 every 18 months (a.k.a. Moore’s law [13]). Every smaller manufacturing scale means twice (the square of 1.4) as many, and twice smaller transistors, as well as 1.4 times smaller wire distances and signal latencies. Thus one would expect CPU MHz to increase with inverted signal latencies, but Figure 1 shows that clock speed has increased even further. This is mainly done by pipelining: dividing the work of a CPU instruction in ever more stages. Less work per stage means that the CPU frequency can be increased. While the 1988 Intel 80386 CPU executed one instruction in one (or more) cycles, the 1993 Pentium already had a 5-stage pipeline, to be increased in the 1999 PentiumIII to 14 while the 2004 Pentium4 has 31 pipeline stages. Pipelines introduce two dangers: (i) if one instruction needs the result of a previous instruction, it cannot be pushed into the pipeline right after it, but must wait until the first instruction has passed through the pipeline (or a significant fraction thereof), and (ii) in case of IF-a-THEN-b-ELSE-c branches, the CPU must 130nm 250nm 500nm pipelining hyper−pipelining Alpha21164A 350nm Athlon Pentium4 Alpha21164 Alpha21164B POWER4 Itanium2 Alpha21064A Alpha21064 1000 10000 1994 1996 1998 2000 2002 1000 10000 1994 1996 1998 2000 2002 1000 10000 1994 1996 1998 2000 2002 1000 10000 1994 1996 1998 2000 2002 inverted gate distance CPU Performance (SPECcpu int+fp) CPU MHz Figure 1: A Decade of CPU Performance predict whether a will evaluate to true or false. It might guess the latter and put c into the pipeline, just after a. Many stages further, when the evaluation of a finishes, it may determine that it guessed wrongly (i.e. mispredicted the branch), and then must flush the pipeline (discard all instructions in it) and start over with b. Obviously, the longer the pipeline, the more instructions are flushed away and the higher the performance penalty. Translated to database systems, branches that are data-dependent, such as those found in a selection operator on da