Processor Accelerators

High Level Synthesis (HLS) enables developers to create domain specific accelerators to deliver higher performance and greater efficiency. Designers can no longer rely on silicon scaling to deliver these improvements. By offloading computationally complex algorithms from the processor to hardware accelerators, performance, cost, power, and area can all be improved. Algorithms running on general purpose processors can be migrated to either co-processors or bus-based accelerators. This can enable the design to use a smaller, more efficient processor and increases parallelism. HLS makes the transition fast, easy, and minimizes execution and schedule risks.

Russell Klein

Acceleration Introduction

Acceleration Introduction
Semiconductor scaling is no longer delivering the CPU performance boosts that system designers used to be expecting. This is especially true for single threaded software execution for general purpose computing. But systems continue to see demand for increased compute power. This means developers need to be creative about how to deliver that performance. One option is to offload computationally intense algorithms into hardware accelerators. This is really only an option for developers who are developing SoCs, or systems with available FPGA capacity. But with the ability to move functions off the processor and into hardware, developers can gain significant performance and efficiency benefits.
Semiconductor scaling
In the glory days of Moore’s lay developers could regularly expect get more silicon that ran faster and consumed less power. While Moore’s law carries on, it is not delivering faster CPU processing for a number of reasons. At about the 65 nm node, Dennard’s law began to break down. Up until then, as feature size shrank, power was correspondingly reduced, too. So a given area of silicon consumed the same amount of power regardless of feature size of the transistors. But beyond 65 nm, power stopped reducing at the same rate. This meant that power dissipation became a real problem for designs. The net effect was that the processors clock rate stalled, somewhere at around 2-3 GHz.Without being able to increase the processor’s clock rate, CPU architects could still take advantage of the increase in available gates. They created ever more performance enhancing features like branch prediction, out-of-order execution, speculative execution, and more. But even those features are starting to reach a point diminishing return.There is little more that can be done to a processor to improve single threaded general purpose compute performance.
And there is the economic side of the equation. While transistors were getting exponentially cheaper to build, that broke down at 28 nms. Below 28 nms, the per transistor production costs stopped going down. Meaning that as transistor count goes up, there are both more transistors and a higher cost per transistor on a chip. This, along with the increasing costs of mask preparation and NRE, means that for many products it will not be practical to move to smaller geometries, simply for business reasons.
Domains Specific Computing
Without faster processors, how can developers meet the increasing needs for performance? David Patterson, the Vice-Chair of the board of directors of the RISC-V foundation, believes that “Domain Specific Computing” will answer the call. Domain specific computing means to augment the capabilities of the general-purpose processor with domain specific acceleration hardware. This augmentation can come in the form of new instructions, co-processors, or bus-based accelerators.
Both ARM and RISC-V have ample facilities for developing extensions to the instructions sets. And while new instructions can certainly boost performance, any instruction need to maintain a certain level of serialization, to support what is known as the “programmer’s model”.That is the concept that the programmer can think of each instruction executing serially and atomically – even thought that is not what really happens in modern processors.
Co-processors and external accelerators can bring greater parallelism to bear, as they are not bound by the same restrictions that apply to instructions. They can also operate on larger sets of data, while instructions are limited to a sub-set of the processor’s registers. Profiling an embedded processor on a computationally intensive routine, like a convolution, can show just how limited the parallelism is. For a RISC-V Rocket running a convolution, the processor is able to perform one multiply operation every 22 clocks, on average. While a dedicated convolution accelerator could easily perform 1 multiply per clock. If constructed correctly an accelerator, with parallel multipliers, could easily perform 10 or 20 multiplications per clock. This provides a dramatic speed-up over software.
Building an accelerator
The first step in building an accelerator is to understand what can and should be accelerated. Profiling the application or system will show where software bottlenecks are. Profiling can be done in a number of ways. One is to run the system on a virtual prototype. It does not need to be a clock cycle accurate model of the system, as any significant bottleneck will show up. Another option is to use a development board with the target processor in a configuration similar to the system being designed. For systems with relatively little hardware content or dependencies, an instruction set simulator like the ARMulator or QEMU would work.
Once the function (or functions) to move to accelerators is defined, the next step is partitioning and defining the interfaces. This involves selecting the specific code to be moved. Identifying how it will be interfaced to the processor and remaining body of software.And how data will be passed between the software and the accelerator. If the processing of the data is to be shared between hardware and software, then the accelerator should be created as a co-processor. If the accelerator processing is at least somewhat independent of the processing in the software, or it is accessing a significant amount of data, then a bus-based accelerator is preferred.
Creating the RTL
Of course, the accelerator could be created by writing Verilog RTL. And with the function to be accelerated available, it can be used both as an executable specification during development and as a golden reference during verification. But this does require that a human interpret the code to create the Verilog. Doing this is slow and error prone.
With the algorithm in its original software form, it can be leveraged as input to an HLS tool. If it is in a language supported by the HLS tool, typically C and C++, it can be used directly.
Details on the basics of designing hardware with HLS can be found in the HLS 101 topic.
Interfacing
With the core of the accelerator completed, the next stage is to create the interfaces in both hardware and software to integrate its capabilities into the system. For co-processors, the hardware interface needs to connect to the co-processor signals on the processor core. These are often ready-valid-data triplets, so they map to ac_channels. For RISC-V based designs, co-processors can access memory through the AXI in port. Transactions going through this port will use the core’s MMU, virtual-to-physical translation logic, and the core’s cache hierarchy. Co-processor software interfacing is done through co-processor instructions. There will be instructions for reading and writing data, as well as control of the co-processor.
For bus-based peripherals, on the slave side a register bank needs to be created. This register bank will interface to control and status signals, as well as data channels. The logic for the register bank is relatively simple, but some HLS tools will create it from the top-level function description. The register bank will be connected to the system bus and be mapped to a specific address range. Accessing the software is done through address references. On bare metal C programs this is a simple pointer dereference. For OS based systems this would be done in driver code.
On the master side of a bus based peripheral, a bus master interface needs to be created.HLS tools will creation of common SoC buses and add them to the top-level module or the created RTL. This permits the code in the accelerator to reach into system memory to access data to be processed and write back results.
Integrating and testing
Before synthesis, the software can make calls to the HLS input code. This verifies that the HLS code both works in the context of the larger body of software, and that it matches the behavior and function of the original code. These verification runs go thousands of times faster than HDL and post synthesis verification. Any problems found and fixed at this level will eliminate debugging in later verification stages.
With the accelerator and its interfaces designed and implemented, they can be added to the processor design and verified. The most straight forward way to undertake validation is to run the original software function in parallel with the accelerator, and compare the results.This can also be used as a basis for determining the performance differences between the original software version and the hardware implementation.
For systems that contain large bodies of software, such as an embedded Linux operating system or large applications, an FPGA prototype or emulation system would be appropriate for speeding up verification at the RTL level.

HLS Hackathon

Wednesday, Jul 02nd-8:00 AM PDT

Accelerating Inferencing Using HLS Hackathon

Processor Accelerators

Acceleration Introduction

Acceleration Introduction

Semiconductor scaling

Domains Specific Computing

Building an accelerator

Creating the RTL

Interfacing

Integrating and testing