1. AI/ML Accelerators & Design

    AI Acceleration

    AI inferencing algorithms are ideal for hardware acceleration.They are computationally complex, run slowly in software, and have a high degree of data parallelism.Small neural networks can take millions of multiply accumulate operations (MACs) to produce a result. Larger ones can take billions. Large language models, and similarly complex networks, can take trillions. To meet growing expectations of capability and prediction accuracy, neural networks are growing larger and more complex. They are growing faster than Moore’s law and the advancement of silicon technology. For most neural networks, this level of computation is beyond what can be delivered by embedded processors and is driving developers to move from more generalized compute resources to more specialized platforms, and even custom-built hardware acceleration.

    In some cases, the computation of these inferences can be off-loaded over a network to a data center. Increasingly, devices have fast and reliable network connections – making this a viable option for many systems. However, there are also a lot of systems that have hard real time requirements that cannot be met by even the fastest and most reliable networks. For example, any system that has autonomous mobility – self-driving cars or self-piloted drones – needs to make decisions faster than could be done through an off-site data center. There are also systems where sensitive data is being processed that should not be sent over networks. And anything that goes over a network introduces an additional attach surface for hackers.For all of these reasons - performance, privacy, and security - some inferencing will need to be done on embedded systems.

    When deploying a custom AI accelerator, there are two fundamental things that need to be done to optimize the implementation. One is to reduce the size of the neural network. This reduces the number of computations, the area needed to store the coefficients, and the communication bandwidth to move the weights and intermediate results. The second is to bring as much parallelism to the implementation as possible. This increases performance by executing multiple operations at the same time.

    Reducing the size of the neural network is done by “Pruning.”Pruning involves reducing the size of the network by removing weight connections, features, channels, and even entire layers. Counterintuitively, pruning can increase the accuracy of the network. However, beyond a certain point, it will reduce the accuracy of the model. But the reduction in accuracy can be quite small in comparison to the reduction in the size of the network. The best pruning techniques can reduce the size of the network by 95%, with less than a 1% reduction in accuracy.

    Another approach to reducing the size of the network is “quantization”. General purpose accelerators usually use floating point numbers. This is because virtually all neural networks are developed in Python on general purpose computers using floating point numbers. To ensure correct support of those neural networks, a general-purpose accelerator must, of course, support floating point numbers. However, most neural networks use numbers close to 0, and require a lot of precision there. Numbers that are large require less precision. And floating-point multipliers are huge. If they are not needed, omitting them from the design saves a lot of area and power. Moving from a floating-point representation to a fixed-point representation will reduce operator area significantly.

    Some NPUs support integer or fixed-point representations, and sometimes with a variety of sizes. But supporting multiple numeric representation formats adds circuitry, which consumes power and adds propagation delays. Choosing one representation and using that exclusively enables a smaller faster implementation.Configurable IP allows the selection of just one numeric representation, but usually maintain variable sizes for images and filters, limiting overall customization and thus performance and efficiency.

    When building a bespoke accelerator, one is not limited to 8 bits or 16 bits, any size can be used. Picking the correct numeric representation, or “quantizing” a neural network, allows the data and the operators to be optimally sized. Quantization can significantly reduce the data needed to be stored, moved, and operated on. Reducing the memory footprint for the weight database and shrinking the multipliers can really improve the area and power of a design. For example, a 10-bit fixed-point multiplier is about 20 times smaller than a 32-bit floating-point multiplier, and, correspondingly, will use about 1/20th the power. This means the design can either be much smaller and energy efficient. By using the smaller multiplier. Or, the designer can opt to use the area and deploy 20 multipliers that can operate in parallel, producing much higher performance using the same resources.

    Building a bespoke AI accelerator can drastically improve both performance and power consumption as compared to software implementation. But it can also outperform off-the-shelf accelerators and even configurable IP.

    One of the key challenges in building a bespoke machine learning accelerator is that the data scientists who created the neural network usually do not understand hardware design, and the hardware designers do not understand data science. In a traditional design flow, they would use “meetings” and “specifications” to transfer knowledge and share ideas. But, honestly, no one likes meetings or specifications. And they are not particularly good at affecting an information exchange.

    High-Level Synthesis (HLS) allows an implementation produced by the data scientists to be used, not just as an executable reference, but as a machine-readable input to the hardware design process. This eliminates the manual reinterpretation of the algorithm in the design flow, which is slow and extremely error prone. HLS synthesizes an RTL implementation from an algorithmic description. Usually, the algorithm is described in C++ or SystemC, but a number of design flows like HLS4ML are enabling HLS tools to take neural network descriptions directly from machine learning frameworks.

    HLS enables a practical exploration of quantization in a way that is not yet practical in machine learning frameworks. To fully understand the impact of quantization requires a bit accurate implementation of the algorithm, including the characterization of the effects of overflow, saturation, and rounding. Today this in only practical in hardware description languages (HDLs) or HLS bit accurate data types (https://hlslibs.org).

    As machine learning becomes ubiquitous more embedded systems will need to deploy inferencing accelerators. HLS is a practical and proven way to create bespoke accelerators, optimized for a very specific application, that deliver higher performance and efficiency than general purpose NPUs.