EDN: AI at the Edge: It’s Just Getting Started

Artificial intelligence (AI) is expanding rapidly to the edge. This generalization conceals many more specific advances—many kinds of applications, with different processing and memory requirements, moving to different kinds of platforms.

Date Published: January 8, 2025

Figure 1 The TinyML inference models are being embedded at the extreme edge in smart sensors and small consumer devices.

One of the most exciting instances, happening soonest and with the most impact on users, is the appearance of TinyML inference models embedded at the extreme edge—in smart sensors and small consumer devices.

This innovation is enabling valuable functions such as keyword spotting (detecting spoken keywords) or performing environmental-noise cancellation (ENC) with a single microphone. Users treasure the lower latency, reduced energy consumption, and improved privacy.

Local execution of TinyML models depends on the convergence of two advances. The first is the TinyML model itself. While most of the world’s attention is focused on enormous—and still growing—large language models (LLMs), some researchers are developing really small neural-network models built around hundreds of thousands of parameters instead of millions or billions. These TinyML models are proving very capable on inference tasks with predefined inputs and a modest number of inference outputs.

The second advance is in highly efficient embedded architectures for executing these tiny models. Instead of a server board or a PC, think of a die small enough to go inside an earbud and efficient enough to not harm battery life.

Several approaches

There are many important tasks involved in neural-network inference, but the computing workload is dominated by matrix multiplication operations. The key to implementing inference at the extreme edge is to perform these multiplications with as little time, power, and silicon area as possible. The key to launching a whole successful product line at the edge is to choose an approach that scales smoothly, in small increments, across the whole range of applications you wish to cover.

It is the nature of the technology that models get larger over time.

System designers are taking different approaches to this problem. For the tiniest of TinyML models in applications that are not particularly sensitive to latency, a simple microcontroller core will do the job. But even for small models, MCUs with their constant fetching, loading, and storing are not an energy-efficient approach. And scaling to larger models may be difficult or impossible.

For these reasons many choose DSP cores to do the processing. DSPs typically have powerful vector-processing subsystems that can perform hundreds of low-precision multiply-accumulate operations per cycle. They employ automated load/store and direct memory access (DMA) operations cleverly to keep the vector processors fed. And often DSP cores come in scalable families, so designers can add throughput by adding vector processor units within the same architecture.

But this scaling is coarse-grained, and at some point, it becomes necessary to add a whole DSP core or more to the design, and to reorganize the system as a multicore approach. And, not unlike the MCU, the DSP consumes a great deal of energy in shuffling data between instruction memory and instruction cache and instruction unit, and between data memory and data cache and vector registers.

For even larger models and more latency-sensitive applications, designers can turn to dedicated AI accelerators. These devices, generally either based on GPU-like SIMD processor arrays or on dataflow engines, provide massive parallelism for the matrix operations. They are gaining traction in data centers, but their large size, their focus on performance over power, and their difficulty in scaling down significantly make them less relevant for the TinyML world at the extreme edge.

Another alternative

There is another architecture that has been used with great success to accelerate matrix operations: processing-in-memory (PiM). In this approach, processing elements, rather than being clustered in a vector processor or pipelined in a dataflow engine, are strategically dispersed at intervals throughout the data memory. This has important benefits.

First, since processing units are located throughout the memory, processing is inherently highly parallel. And the degree of parallel execution scales smoothly: the larger the data memory, the more processing elements it will contain. The architecture needs not change at all.

In AI processing, 90–95% of the time and energy is consumed by matrix multiplication, as each parameter within a layer is computed with those in subsequent layers. PiM addresses this inefficiency by eliminating the constant data movement between memory and processors.

By storing AI model weights directly within memory elements and performing matrix multiplication inside the memory itself as input data arrives, PiM significantly reduces data transfer overhead. This approach not only enhances energy efficiency but also improves processing speed, delivering lower latency for AI computations.

To fully leverage the benefits of PiM, a carefully designed neural network processor is crucial. This processor must be optimized to seamlessly interface with PiM memory, unlocking its full performance potential and maximizing the advantages of this innovative technology.

Design case study

The theoretical advantages of PiM are well established for TinyML systems at the network edge. Take the case of Listen VL130, a voice-activated wake word inference chip,which is also PIMIC’s first product. Fabricated on TSMC’s standard 22-nm CMOS process, the chip’s always-on voice-detection circuitry consumes 20 µA.

This circuit triggers a PiM-based wake word-inference engine that consumes only 30 µA when active. In operation, that comes out to a 17-times reduction in power compared to an equivalent DSP implementation. And the chip is tiny, easily fitting inside a microphone package.

Figure 2 Listen VL130, connected to external MCU in the above diagram, is an ultra-low-power keyword-spotting AI chip designed for edge devices.

IMIC’s second chip, Clarity NC100, takes on a more ambitious TinyML model: single-microphone ENC. Consuming less than 200 µA, which is up to 30 times more efficient than a DSP approach, it’s also small enough for in-microphone mounting. It is scheduled for engineering samples in January 2025.

Both chips depend for their efficiency upon a TinyML model fitting entirely within an SRAM-based PiM array. But this is not the only way to exploit PiM architectures for AI, nor is it anywhere near the limits of the technology.

LLMs at the far edge?

One of today’s undeclared grand challenges is to bring generative AI—small language models (SLMs) and even some LLMs—to edge computing. And that’s not just to a powerful PC with AI extensions, but to actual edge devices. The benefit to applications would be substantial: generative AI apps would have greater mobility while being impervious to loss of connectivity. They could have lower, more predictable latency; and they would have complete privacy. But compared to TinyML, this is a different order of challenge.

To produce meaningful intelligence, LLMs require training on billions of parameters. At the same time, the demand for AI inference compute is set to surge, driven by the substantial computational needs of agentic AI and advanced text-to-video generation models like Sora and Veo 2. So, achieving significant advancements in performance, power efficiency, and silicon area (PPA) will necessitate breakthroughs in overcoming the memory wall—the primary obstacle to delivering low-latency, high-throughput solutions.

Figure 3 Here is a view of the layout of Listen VL130 chip, which is capable of processing 32 wake words and keywords while operating in the tens of microwatts, delivering energy efficiency without compromising performance.

At this technology crossroads, PiM technology is still important, but to a lesser degree. With these vastly larger matrices, the PiM array acts more like a cache, accelerating matrix multiplication piecewise. But much of the heavy lifting is done outside the PiM array, in a massively parallel dataflow architecture. And there is a further issue that must be resolved.

At the edge, in addition to facilitate model execution, it’s of primary importance to resolve the bandwidth and energy issues that come with scaling to massive memory sizes. Meeting all these challenges can improve an edge chip’s power-performance-area efficiency by more than 15 times.

PIMIC’s studies indicate that models with hundreds of millions to tens of billions of parameters can in fact be executed on edge devices. It will require 5-nm or 3-nm process technology, PiM structures, and most of all a deep understanding of how data moves in generative-AI models and how it interacts with memory.

PiM is indeed a silver bullet for TinyML at the extreme edge. But it’s just one tool, along with dataflow expertise and deep understanding of model dynamics, in reaching the point where we can in fact execute SLMs and some LLMs effectively at the far edge.

Subi Krishnamuthy is the founder and CEO of PIMIC, an AI semiconductor company developing processing-in-memory (PiM) technology for ultra-low-power AI solutions.

 

Previous
Previous

EE Times: PIMIC Adds Tiny AI to Microphones, Eyes Big AI Chips

Next
Next

PIMIC Partners with ZillTek to Launch the Clarity™ NC100 Chip