Is it time to start using high-level synthesis?

报告通过两个案例评估了AutoESL的AutoPilot高阶合成工具。一是视频运动分析算法,二是DQPSK接收器。结果显示,在实现相似功能的同时,使用AutoPilot和FPGA比采用TI DSP处理器的成本和功耗更低,且开发工作量相当。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://blogs.cancom.com/elogic_920000692/2010/04/30/is-it-time-to-start-using-high-level-synthesis/

 

One big question people have about high-level synthesis (HLS) is whether or not it is ready for mainstream use. In other words, does it really work (yet)? HLS has had a long history starting with products like Synopsys’s Behavioral Compiler and Cadence’s Visual Architect which never achieved any serious adoption. Then there was a next generation with companies like Synfora, Forte and Mentor’s Catapult. More recently still AutoESL and Cadence’s CtoSilicon.

I met Atul, CEO of AutoESL, last week and he gave me a copy of an interesting report that they had commissioned from Berkeley Design Technology (BDT) who set out to answer the question “does it work?” at least for the AutoESL product, AutoPilot. Since HLS is a competitive market, and the companies in the space are constantly benchmarking at customers and all are making some sales, I think it is reasonable to take this report as a proxy for all the products in the space. Yes, I’m sure each product has its own strengths and weaknesses and different products have different input languages (for instance, Forte only accepts SystemC, Synfora only accepts C and AutoESL accepts C, C++ and SystemC).

BDT ran two benchmarks. One was a video motion analysis algorithm and the other was a DQPSK (think wireless router) receiver. Both were synthesized using AutoPilot and then Xilinx’s tool-chain to create a functional FPGA implementation.

The video algorithm was examined in two ways: first, with a fixed workload at 60 frames per second, how “small” a design could be achieved. Second, given the limitations of the FPGA, how high a frame rate could be achieved. The wireless receiver had a spec of 18.75 megasamples/second, and was synthesized to see what the minimal resources required were to meet the required throughput.

For comparison, they implemented the video algorithms using Texas Instruments TMS320 DSP processors. This is a chip that costs roughly the same as the FPGA they were using, Xilinx’s XC3SD3400A, in the mid $20s.

The video algorithm used 39% of the FPGA but to achieve the same result using the DSPs required at least 12 of them working in parallel, obviously a much more costly and power hungry solution. When they looked at how high a frame rate could be achieved, the AutoPilot/FPGA solution could achieve 183 frames per second, versus 5 frames per second for the DSP. The implementation effort for the two solutions was roughly the same. This is quite a big design, using ¾ of the FPGA. Autopilot read 1600 lines of C and turned it into 38,000 lines of Verilog in 30 seconds.

For the wireless receiver they also had a hand-written RTL implementation for comparison. AutoPilot managed to get the design into 5.6% of the FPGA, and the hand-written implementation achieved 5.9%. I don’t think the difference is significant, and I think it is fair to say that AutoPilot is on a par with hand-coded RTL (at least for this example, ymmv). Using HLS also reduces the development effort by at least 50%.

BDT’s conclusion is that they “were impressed with the quality of results that AutoPilot was able to produce given that this has been a historic weakness for HLS tools in general.” The only real negative is that the tool chain is more expensive (since AutoESL doesn’t come bundled with your FPGA or your DSP).

It would, of course, be interesting to see the same reference designs put through other HLS tools, to know whether these results generalize. But it does look as if HLS is able to achieve results comparable with hand-written RTL at least for this sort of DSP algorithm. But, to be fair to hand-coders, these sort of DSP algorithms where throughput is more important than latency, is a sort of sweet-spot for HLS.

If you want to read the whole report, it’s here.

ISBN-13: 978-1461417903 ISBN-10: 1461417902 Edition: 1st ed. 2013. Corr. 2nd printing 2014 Buy New Price: $265.05 亚马逊卖价。这个书的价值,不多说了吧。 High-Performance Computing using FPGA covers the area of high performance reconfigurable computing (HPRC). This book provides an overview of architectures, tools and applications for High-Performance Reconfigurable Computing (HPRC). FPGAs offer very high I/O bandwidth and fine-grained, custom and flexible parallelism and with the ever-increasing computational needs coupled with the frequency/power wall, the increasing maturity and capabilities of FPGAs, and the advent of multicore processors which has caused the acceptance of parallel computational models. The Part on architectures will introduce different FPGA-based HPC platforms: attached co-processor HPRC architectures such as the CHREC’s Novo-G and EPCC’s Maxwell systems; tightly coupled HRPC architectures, e.g. the Convey hybrid-core computer; reconfigurably networked HPRC architectures, e.g. the QPACE system, and standalone HPRC architectures such as EPFL’s CONFETTI system. The Part on Tools will focus on high-level programming approaches for HPRC, with chapters on C-to-Gate tools (such as Impulse-C, AutoESL, Handel-C, MORA-C++); Graphical tools (MATLAB-Simulink, NI LabVIEW); Domain-specific languages, languages for heterogeneous computing(for example OpenCL, Microsoft’s Kiwi and Alchemy projects). The part on Applications will present case from several application domains where HPRC has been used successfully, such as Bioinformatics and Computational Biology; Financial Computing; Stencil computations; Information retrieval; Lattice QCD; Astrophysics simulations; Weather and climate modeling.
****** Vitis HLS - High-Level Synthesis from C, C++ and OpenCL v2024.1 (64-bit) **** SW Build 5069499 on May 21 2024 **** IP Build 5075265 on Wed May 22 21:45:21 MDT 2024 **** SharedData Build 5076995 on Wed May 22 18:29:18 MDT 2024 **** Start of session at: Sun Mar 9 17:45:30 2025 ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. ** Copyright 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved. source F:/Xilinx_Vitis/Vitis_HLS/2024.1/scripts/vitis_hls/hls.tcl -notrace INFO: [HLS 200-10] For user '13578' on host 'garfieldgu' (Windows NT_amd64 version 10.0) on Sun Mar 09 17:45:33 +0800 2025 INFO: [HLS 200-10] In directory 'C:/Users/13578/AppData/Roaming/Xilinx/Vitis' WARNING: [HLS 200-2053] The vitis_hls executable is deprecated. Consider using vitis INFO: [HLS 200-10] Bringing up Vitis Unified IDE: F:/Xilinx_Vitis/Vitis/2024.1/bin/vitis.bat '\gnu\microblaze\lin\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 '\gnu\microblaze\nt\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 '\gnuwin\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 'cmd' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 '\gnu\microblaze\lin\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 '\gnu\microblaze\nt\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 '\gnuwin\bin\' 不是内部或外部命令,也不是可运行的程序 或批处理文件。
03-10
Chapter 4: Processor Architecture. This chapter covers basic combinational and sequential logic elements, and then shows how these elements can be combined in a datapath that executes a simplified subset of the x86-64 instruction set called “Y86-64.” We begin with the design of a single-cycle datapath. This design is conceptually very simple, but it would not be very fast. We then introduce pipelining, where the different steps required to process an instruction are implemented as separate stages. At any given time, each stage can work on a different instruction. Our five-stage processor pipeline is much more realistic. The control logic for the processor designs is described using a simple hardware description language called HCL. Hardware designs written in HCL can be compiled and linked into simulators provided with the textbook, and they can be used to generate Verilog descriptions suitable for synthesis into working hardware. Chapter 5: Optimizing Program Performance. This chapter introduces a number of techniques for improving code performance, with the idea being that programmers learn to write their C code in such a way that a compiler can then generate efficient machine code. We start with transformations that reduce the work to be done by a program and hence should be standard practice when writing any program for any machine. We then progress to transformations that enhance the degree of instruction-level parallelism in the generated machine code, thereby improving their performance on modern “superscalar” processors. To motivate these transformations, we introduce a simple operational model of how modern out-of-order processors work, and show how to measure the potential performance of a program in terms of the critical paths through a graphical representation of a program. You will be surprised how much you can speed up a program by simple transformations of the C code. Bryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiii (front) Windfall Software, PCA ZzTEX 16.2 xxiv Preface Chapter 6: The Memory Hierarchy. The memory system is one of the most visible parts of a computer system to application programmers. To this point, you have relied on a conceptual model of the memory system as a linear array with uniform access times. In practice, a memory system is a hierarchy of storage devices with different capacities, costs, and access times. We cover the different types of RAM and ROM memories and the geometry and organization of magnetic-disk and solid state drives. We describe how these storage devices are arranged in a hierarchy. We show how this hierarchy is made possible by locality of reference. We make these ideas concrete by introducing a unique view of a memory system as a “memory mountain” with ridges of temporal locality and slopes of spatial locality. Finally, we show you how to improve the performance of application programs by improving their temporal and spatial locality. Chapter 7: Linking. This chapter covers both static and dynamic linking, including the ideas of relocatable and executable object files, symbol resolution, relocation, static libraries, shared object libraries, position-independent code, and library interpositioning. Linking is not covered in most systems texts, but we cover it for two reasons. First, some of the most confusing errors that programmers can encounter are related to glitches during linking, especially for large software packages. Second, the object files produced by linkers are tied to concepts such as loading, virtual memory, and memory mapping. Chapter 8: Exceptional Control Flow. In this part of the presentation, we step beyond the single-program model by introducing the general concept of exceptional control flow (i.e., changes in control flow that are outside the normal branches and procedure calls). We cover examples of exceptional control flow that exist at all levels of the system, from low-level hardware exceptions and interrupts, to context switches between concurrent processes, to abrupt changes in control flow caused by the receipt of Linux signals, to the nonlocal jumps in C that break the stack discipline. This is the part of the book where we introduce the fundamental idea of a process, an abstraction of an executing program. You will learn how processes work and how they can be created and manipulated from application programs. We show how application programmers can make use of multiple processes via Linux system calls. When you finish this chapter, you will be able to write a simple Linux shell with job control. It is also your first introduction to the nondeterministic behavior that arises with concurrent program execution. Chapter 9: Virtual Memory. Our presentation of the virtual memory system seeks to give some understanding of how it works and its characteristics. We want you to know how it is that the different simultaneous processes can each use an identical range of addresses, sharing some pages but having individual copies of others. We also cover issues involved in managing and manipulating virtual memory. In particular, we cover the operation of storage allocators such as the standard-library malloc and free operations. CovBryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiv (front) Windfall Software, PCA ZzTEX 16.2 Preface xxv ering this material serves several purposes. It reinforces the concept that the virtual memory space is just an array of bytes that the program can subdivide into different storage units. It helps you understand the effects of programs containing memory referencing errors such as storage leaks and invalid pointer references. Finally, many application programmers write their own storage allocators optimized toward the needs and characteristics of the application. This chapter, more than any other, demonstrates the benefit of covering both the hardware and the software aspects of computer systems in a unified way. Traditional computer architecture and operating systems texts present only part of the virtual memory story. Chapter 10: System-Level I/O. We cover the basic concepts of Unix I/O such as files and descriptors. We describe how files are shared, how I/O redirection works, and how to access file metadata. We also develop a robust buffered I/O package that deals correctly with a curious behavior known as short counts, where the library function reads only part of the input data. We cover the C standard I/O library and its relationship to Linux I/O, focusing on limitations of standard I/O that make it unsuitable for network programming. In general, the topics covered in this chapter are building blocks for the next two chapters on network and concurrent programming. Chapter 11: Network Programming. Networks are interesting I/O devices to program, tying together many of the ideas that we study earlier in the text, such as processes, signals, byte ordering, memory mapping, and dynamic storage allocation. Network programs also provide a compelling context for concurrency, which is the topic of the next chapter. This chapter is a thin slice through network programming that gets you to the point where you can write a simple Web server. We cover the client-server model that underlies all network applications. We present a programmer’s view of the Internet and show how to write Internet clients and servers using the sockets interface. Finally, we introduce HTTP and develop a simple iterative Web server. Chapter 12: Concurrent Programming. This chapter introduces concurrent programming using Internet server design as the running motivational example. We compare and contrast the three basic mechanisms for writing concurrent programs—processes, I/O multiplexing, and threads—and show how to use them to build concurrent Internet servers. We cover basic principles of synchronization using P and V semaphore operations, thread safety and reentrancy, race conditions, and deadlocks. Writing concurrent code is essential for most server applications. We also describe the use of thread-level programming to express parallelism in an application program, enabling faster execution on multi-core processors. Getting all of the cores working on a single computational problem requires a careful coordination of the concurrent threads, both for correctness and to achieve high performance翻译以上英文为中文
最新发布
08-05
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值