The Hard Problems of Edge AI Hardware

Feb 15, 2023

Video is below

Per some feedback that I received, I have edited the script below. taking out the part about "CPUs and GPUs are not seen as acceptable hardware choices for edge AI solutions". This is not true, as CPUs are commonly used for small, sub-$100 items. And GPUs are frequently used in lieu of FPGAs due to their ease of programming. Thanks to Patron Gavin for his input.

I wanted to call out a pretty cool recent episode from the podcast that I guest-host. I talked to a co-founder of the mobile app company Kdan Mobile, based out of my familial hometown of Tainan. Check it out.

Recent deep learning AI models are approaching levels of performance that verge on the uncanny.

When used correctly, tools like GitHub Copilot, GPT-3 and Dall-E 2 can produce occasionally stunning results.

Good for us. But I have a question.

What use is a great model if we can’t use it where it is most needed? Like in the field with the customer.

In this video, we look at the challenges associated with running AI on the edge.

The Edge

First, we need to explain the Edge. Edge devices are connected to the internet, but are much closer to the actual users than data center devices.

It is a very broad term that covers things like robotics, unmanned aircraft, remote sensing satellites, home entertainment devices like your Amazon Echo, or wearables like your watch. I might also include a smartphone, but would exclude a laptop.

These devices all have differing constraints centering around power management, processing speed, and memory. For instance, an edge AI chip in a self driving car might prioritize latency - how fast the AI model can come to a result.

A small commercial aerial drone on the other hand only has enough energy for 20-30 minutes in flight. Therefore it can only allocate less than 5% of its overall power budget for computing and data processing.

Trade-offs

And what kind of jobs are these edge AI devices usually doing? The vast majority of the time, computer vision or audio processing jobs. For instance, natural language processing, facial recognition, traffic prediction, or something akin to that.

These models have gotten larger over the years. In 2012, AlexNet was the state-of-the-art computer vision model and it had 61 million weights.

By contrast, the winning ImageNet model in 2021 - CoCa - now has 2.1 billion weights.

Neural networks inherently consume more energy than other non-deep learning alternatives. An object detection neural network model requires up to 13,500 times more energy than a Histogram of oriented gradients. Having so many weights and layers only worsen that burden.

We also have memory constraints. In order to run inference in a meaningful amount of time, we fetch and store several values into the device memory. This includes the model's weights, the inputs, and so on. This is one of the most energy intensive actions a chip can do - using up to 200x more energy than a standard multiply accumulate operation.

So stuffing a model into an edge AI context requires tradeoffs. The model has to be smaller and require less computation, which means worse performance. How much worse depends on various factors, but it will be worse.

Thin Client

Most early tech companies tried side-stepping these formidable issues by offloading everything to the cloud - like Siri or Amazon Echo. The device simply becomes a thin client relaying information between the user and server.

This approach has its upsides, but it also leads to brand new issues. For instance, issues relating to data transmission latency, connection stability, and of course privacy.

There is also a middle hybrid approach in which both the edge and the server share the computational workload. Imagine the edge AI hardware making a first pass of the raw data, and then uploading its results to the cloud for final confirmation.

This can work too. But I also feel that this hybrid approach shares the downsides of both a thin client and onboard processing. You have to maintain models in both edge and server environments.

Optimizations: From the Ground Up

Alright so you cannot run a modern AI model on Edge hardware, I’ve made that clear. But are there things we can do on the software side to make that AI model more suited for an edge environment?

This is a field known as neural network model optimizations and it has gotten to be a very hot space.

The first set of approaches center on the idea of training a compact model from the ground up. Specific examples of neural networks of this type include SqueezeNet or MobileNet.

They often replace traditional neural network structures with new ones in order to reduce the amount of weights in the model. The less weights the model has, the smaller its overall size and memory footprint.

In their paper, Squeezenet authors claim Alexnet level accuracy with 50x fewer weights and compressed down to half a megabyte.

Optimizations: Post-Processing

There have been some interesting studies as of late hinting that neural network models might be larger than they necessarily have to be. So this leads into the second set of approaches: Post-processing a model that has already been trained.

Neural networks are all ultimately exercises in matrix multiplication. So if you can somehow shrink the matrices in a trained model, they take up less memory.

This is the idea behind a methodology called weight quantization. Here we change the way we store a trained model's weights in memory to save space - changing from maybe a 32-bit floating point to 8-bit fixed point.

Another approach is to try to reduce a model’s complexity through pruning. For instance, removing redundant weights. One paper claims that 95% of a neural network's weights are highly correlated to a few key weights. Conceptually, you can remove those and still retain much of the accuracy.

Regardless of what you do, there is no such thing as a free lunch. You will have tradeoffs between accuracy and memory/power usage. The aforementioned 32-bit floating to 8-bit fixed point trick apparently results in an accuracy loss of over 12%.

And unfortunately, it seems like the actual outcomes of several optimization methods still consistently miss expectations. It can be difficult to predict an optimization's effect on the model's performance and resource usage. One might meet the spec while the other doesn't.

Device Options: CPUs

Edge AI solution providers need to find the right hardware suited for their particular solution. There are four widely available hardware types capable of being edge AI pieces - CPUs, GPUs, FPGAs, and ASICs. As always, nothing is perfect. They all have their own uses and drawbacks.

The first are CPUs - a category that also includes microcontrollers or MCUs. These need little introduction. CPUs like the Raspberry Pi are easy to program, versatile, low power, and best of all cheap.

The most significant downside of CPUs however is that they are not very parallel, even modern ones with multiple cores. And modern neural network models require a great deal of parallel operations.

With that being said, if the model is small enough to fit into its memory, then even a tiny MCU with something like 100 kilobytes of RAM can run it.

There are some interesting projects like TensorFlow Lite for Microcontrollers.

Which has enabled fun things like a voice-enabled turn controller for cyclists.

Microcontrollers are a big unheralded part of the semiconductor world, with some 250 billion already deployed in the field.

So fields of AI that seek to put relatively sophisticated machine learning on these extremely constrained hardware environments like TinyML have a lot of impact potential.

Device Options: GPUs

Second are the GPUs. Originally designed for gaming, they are massively parallel and easily programmable due to widely used programming platforms like Nvidia's CUDA. This makes them great for training new AI models.

However their parallelism also tend to make them very energy-hungry, which as we talked about makes them less suitable for edge AI inference jobs.

One example of an edge GPU is the Nvidia Jetson. The Jetson Nano is a small, relatively cheap ($99) embedded computer that is kind of like a Raspberry Pi.

Device Options: FPGAs

There are two other edge AI hardware options - FPGAs and ASICs - which are very intriguing.

Field programmable gate arrays or FPGAs have a lot of potential. These integrated circuits are made up of programmable logic blocks and routing interconnections. And like GPUs, they are inherently parallel.

Using hardware description languages like VHDL and Verilog, you can make those logic blocks simulate certain logic operations. So you can configure and reconfigure them as needed.

Their flexibility is very useful in certain AI fields like the autonomous car industry, where rules and algorithms can change relatively quickly.

Another advantage of using an FPGA has to do with energy efficiency. As I said, during inference, a neural network model spends the most energy when it accesses memory outside of the chip.

Most modern FPGAs have something called Block RAM - a set of embedded memory blocks - to reduce latency and power consumption. Models that can fit into these sets save significantly more power.

The big downside of FPGAs is that they have less available memory, bandwidth and computing resources than a GPU. It ranges depending on the device but in some cases, as little as 10 megabytes of on-chip memory.

Furthermore, using them requires a certain design expertise.

CUDA works with popular programming languages like C and C++.

Not as many people are familiar with VHDL and Verilog.

Device Options: ASICs

And finally, we have the ASICs. These are custom processors designed for a very specific task. For instance, AI Chips or AI Accelerators as they are sometimes called are a class of ASICs.

I discussed AI accelerators in a previous video that you might want to watch.

The biggest downside of the ASIC is obvious. You have to invest substantial upfront financial and human resources to design and produce the chip. Designing and producing semiconductor chips with the latest leading edge process nodes can cost millions of dollars.

Furthermore, you cannot change certain architectures after fabrication like you can with FPGAs. Most ASIC makers would try to get around this by building for more generic functionality.

## Vendors

Not many companies will ever consider the option of designing their own edge AI chip from scratch. Luckily, there are a plethora of interesting edge AI Accelerator products available from vendors.

On the tech titan side, you have Intel's Movidius Myriad X VPU, which was released in 2020. VPU stands for Vision Processing Unit.

It can be used for drones, robotics, smart cameras and so on. Movidius had been a long-running Irish startup specializing in visual processing at the edge before its acquisition by Intel in 2016.

Google has their own Edge TPU, which it says is purpose-built for running inference on the edge.

Various iterations of the product - USB sticks and developer boards - are sold through an initiative called Coral.

Nvidia for their part has the Tegra, a series of system on chips. It probably shouldn't be called an edge AI accelerator, but it is mobile.

The tech giants are joined by a number of small and medium sized companies. It would take way too long to discuss them all, but here are a sampling of a few others that have caught my eye.

Rockchip is a Chinese fabless semiconductor maker based in Fuzhou of the Fuijian Province. One of their speciality AI chip products is the Rockchip RK1808, a standalone Neural Processing Unit or NPU.

The RK1808 is a chip, but it is also sometimes sold as a USB package called the Toybrick. I reckon that makes it easier to integrate into small projects and what not.

Gyrfalcon Technology out in Milpitas, California makes small, low power, and low cost chips.

Their neural accelerators are meant to pair with another processor to handle complicated image recognition and object detection jobs.

Another small one that I want to mention is Kneron out in Hsinchu.

They have been around since 2015. They offer a range of AI chips that can be used for speech and body motion recognition.

Hardware Aware Model & Co-Design

One of the big challenges with delivering Edge AI solutions is that we have to balance between the Hardware and the Model.

They are extremely tightly bound together. Tweaking one messes with the other, and this greatly slows iteration and the rate of progress.

We need to get faster at this. So why not use neural networks to help with it? Recently, there has been some interesting research made on something called hardware-aware Neural Architecture Search.

This is where you include certain hardware variables into the neural network model itself so that it can optimally run on a specific hardware - usually a GPU or FPGA.

With ASICs, this search doesn't work so well because the hardware itself can be widely customized. But ASICs open up the intriguing possibility of simultaneously co-designing both the hardware and algorithms together.

This is kind of like Design Technology Co-Optimization, the process of crafting both the chip's process node and chip design with an eye towards shared success. It has a lot of potential for the edge AI hardware space too.

Conclusion

Massive models are more powerful than ever. We can see what they might be capable of. But Edge AI hardware makers face challenging economic and possibly physical limits in accommodating these models.

During the second half of the 20th century, computers helped unlock unprecedented benefits in industry and commerce. AI has the potential to do the same.

But if the edge hardware never gets to a satisfactory point, then I fear that AI's full potential will remain locked away in the ephemeral cloud. I hope the industry continues to evolve and push forward.

The Asianometry Newsletter