Led by industry icon Jim Keller, the startup Tenstorrent has assembled a top-notch team of AI and CPU engineers and set ambitious plans involving general-purpose processors and AI accelerators.
Currently, the company is developing the industry's first 8-wide decode RISC-V core capable of handling both client and HPC workloads, which will initially be used in a 128-core high-performance CPU for data centers. The company also has a roadmap for multiple generations of processors, which we will introduce below.
Why RISC-V?
We recently spoke with Tenstorrent's Chief CPU Architect Wei-Han Lien about the company's vision and roadmap. Lien has an impressive background, having worked at NexGen, AMD, PA-Semi, and Apple, most notably for his work on the A6, A7 (the world's first 64-bit Arm SoC), and the M1 CPU microarchitecture and implementation at Apple.
Advertisement
The company has many world-class engineers with extensive experience in x86 and Arm design, and one might ask why Tenstorrent decided to develop a RISC-V CPU, as this instruction set architecture (ISA) does not have as comprehensive a data center software stack as x86 and Arm. Tenstorrent's answer is simple: x86 is controlled by AMD and Intel, and Arm is controlled by Arm Holdings, which limits the pace of innovation.
"There are really only two companies in the world that can produce x86 CPUs," said Wei-Han Lien. "Due to x86 licensing restrictions, innovation is essentially controlled by one or two companies. When companies become very large, they become bureaucratic, and the pace of innovation [slows down]. [...] Arm is a bit similar. They claim they are like a RISC-V company, but if you look at their specifications, [it] becomes so complex. It is actually a bit architect-led. [...] Arm somewhat dictates all possible scenarios, even the architectural [licensing] partners."
In contrast, RISC-V is developing rapidly. According to Tenstorrent, because it is an open-source ISA, it is easier and faster to innovate with it, especially when it comes to emerging and rapidly developing AI solutions.
"I have been looking for a processor solution to match [Tenstorrent's] AI solutions, and then we wanted the BF16 data type, and then we went to Arm and said, 'Hey, can you support us?' They said 'no,' which might take two years of internal discussions and discussions with partners, etc.," Lien explained. "But we talked to SiFive; they just put it out there. So, there are no restrictions, they built it for us, and it is free."
On one hand, Arm Holdings' approach ensures high-quality standards and a comprehensive software stack, but this also means that the pace of ISA innovation slows down, which could be a problem for emerging applications such as AI processors, which are designed for rapid development.
A Microarchitecture, Five CPU IPs in One YearDue to Tenstorrent's focus on and resolution of the entire AI application, it requires not only different system-on-chip (SoC) or system-in-package (SiP) implementations but also various CPU microarchitecture implementations and system-level architectures to achieve different power and performance goals. This is precisely the issue that Wei-Han Lien's department is committed to addressing.
Unassuming consumer electronics SoCs and powerful server processors have little in common, but they can share the same ISA and microarchitecture (with different tube implementation methods). This is where Lien's team comes into play. Tenstorrent states that the company's CPU team has developed an out-of-order RISC-V microarchitecture and implemented it in five different ways to address various application issues.
Tenstorrent now has five different RISC-V CPU core IPs—featuring two-wide, three-wide, four-wide, six-wide, and eight-wide decoding— for its own processors or for licensing to interested parties. For potential customers who need a very basic CPU, the company can provide a small core with a two-wide execution, but for those who require higher performance on the edge, client PCs, and high-performance computing, it has six-wide Alastor and eight-wide Ascalo cores.
Each out-of-order Ascalon (RV64ACDHFMV) core with eight-bit decoding has six ALUs, two FPUs, and two 256-bit vector units, making it very powerful. Considering that modern x86 designs use four-wide (Zen 4) or six-wide (Golden Cove) decoders, we are looking at a very powerful core.
Wei-Han Lien is one of the designers responsible for Apple's "wide" CPU microarchitecture, which can execute up to eight instructions per clock cycle. For example, Apple's A14 and M1 SoCs have eight-wide high-performance Firestorm CPU cores, which remain one of the most energy-efficient designs in the industry even two years after their launch. Lien may be one of the best experts in the industry for "wide" CPU microarchitecture, and as far as we know, he is the only processor designer who has led a team of engineers to develop an eight-wide RISC-V high-performance CPU core.
In addition to various RISC-V general-purpose cores, Tenstorrent also has a proprietary Tensix core tailored for neural network inference and training. Each Tensix core includes five RISC cores, an array math unit for tensor operations, a SIMD unit for vector operations, 1MB or 2MB of SRAM, and fixed-function hardware for accelerating network packet operations and compression/decompression. The Tensix core supports multiple data formats, including BF4, BF8, INT8, FP16, BF16, and even FP64.
Impressive roadmap
Currently, Tenstorrent has two products: a machine learning processor called Grayskull, which provides about 315 INT8 TOPS performance and can be plugged into a PCIe Gen4 slot, and the Wormhole ML processor for networking, with about 350 INT8 TOPS performance and using a GDDR6 memory subsystem, a PCIe Gen4 x16 interface, and featuring 400GbE connections to other machines.
Both devices require a host CPU, which can be used as an additional board or in a pre-built Tenstorrent server. A 4U Nebula server containing 32 Wormhole ML cards provides about 12 INT8 TOPS performance at 6kW.
Later this year, the company plans to launch its first standalone CPU+ML solution—Black Hole—which combines 24 SiFive X280 RISC-V cores and multiple third-generation Tensix cores, interconnected by a 2D toroidal network running in opposite directions for learning workloads. The device will provide 1 INT8 TOPS of computational throughput (about three times the performance of its predecessor), eight GDDR6 memory channels, 1200 Gb/s Ethernet connections, and PCIe Gen5 channels.In addition, the company is looking forward to adding a 2TB/s die-to-die interface for dual-chip solutions and future use. The chip will be manufactured using a 6nm process technology (we anticipate it to be TSMC N6, but Tenstorrent has not confirmed this), but at 600mm², it will be smaller than the previous generation produced by TSMC's 12nm process node. One thing to keep in mind is that Tenstorrent has not yet developed its Blackhole, and its final feature set may differ from what the company has disclosed today.
Next year, the company will release its ultimate product: a multi-chiplet solution called Grendel, which features its own Ascalon general-purpose cores with their own RISC-V microarchitecture, an eight-bit decoder, and a Tensix-based chiplet for ML workloads.
Grendel is Tenstorrent's ultimate product set to be released next year: a multi-chiplet solution that includes an Aegis chiplet with high-performance Ascalon general-purpose cores and one or more chiplets with Tensix cores for ML workloads. Depending on business needs (and the company's financial capabilities), Tenstorrent could implement the AI chiplet using a 3nm process technology, leveraging higher transistor density and Tensix core counts, or it could continue using the Black Hole chiplet for AI workloads (even distributing some work to the 24 SiFive X280 cores, which the company has indicated). The chiplets will communicate with each other using the aforementioned 2TB/s interconnect.
The Aegis chiplet features 128 general-purpose RISC-V eight-wide Ascalon cores, organized in four 32-core clusters with inter-cluster coherence, and will be manufactured using a 3nm process technology. In fact, the Aegis CPU chiplet will be the first to use a 3-nanometer manufacturing process, potentially positioning the company at the forefront of high-performance CPU design.
Meanwhile, Grendel will utilize an LPDDR5 memory subsystem, PCIe, and Ethernet connections, thus providing significantly higher inference and training performance than the company's existing solutions. Speaking of Tensix cores, it should be noted that although all of Tenstorrent's AI cores are called Tensix, these cores are actually evolving.
"[Tensix] changes are gradual, but they do exist," explained the company's founder, Ljubisa Bajic. "[They have added] new data formats, changes in the ratio of FLOPS/SRAM capacity, SRAM bandwidth, on-chip network bandwidth, new sparsity features, and general features."
Interestingly, different Tenstorrent slides mention different memory subsystems for the Black Hole and Grendel products. This is because the company has been searching for the most efficient memory technology and has obtained licensing for DRAM controllers and physical interfaces (PHYs). As a result, it has some flexibility in choosing the exact memory type. In fact, Lien stated that Tenstorrent is also developing its own memory controller for future products, but for solutions in 2023-2024, it intends to use third-party memory controllers and PHYs. For the time being, Tenstorrent does not plan to use any exotic memory, such as HBM, for this consideration.
Business Model: Selling Solutions and Licensing IP
Although Tenstorrent has five different CPU IPs (although based on the same microarchitecture), it only has AI/ML products in the pipeline (if not considering fully configured servers) using SiFive's X280 or Tenstorrent's eight-wide Ascalon CPU cores. Therefore, it is reasonable to ask why it needs so many CPU core implementations.
The short answer to this question is that Tenstorrent has a unique business model that includes IP licensing (in the form of RTL, hard macros, and even GDS), selling chiplets, selling additional ML acceleration cards or ML solutions with CPU and ML chiplets, and selling fully configured servers containing these cards.Companies that build their own SoCs can license the RISC-V cores developed by Tenstorrent, a broad CPU IP portfolio that enables companies to compete in solutions requiring different levels of performance and power.
Server vendors can use Tenstorrent's Grayskull and Wormhole accelerator cards or Blackhole and Grendel ML processors to build their machines. At the same time, entities that do not want to build hardware can purchase pre-built Tenstorrent servers and deploy them.
This business model seems somewhat controversial because, in many cases, Tenstorrent competes with its own customers and brings competition. However, in the end, manufacturers such as Nvidia provide add-on cards and pre-built servers based on these motherboards, and companies like Dell or HPE do not seem too concerned about this, as they provide solutions for specific customers, not just building blocks.
Summary
About two years ago, with the hiring of Jim Keller, Tenstorrent became a focus of attention. Within two years, the company has recruited a group of top engineers who are developing high-performance RISC-V cores for data center-level AI/ML solutions and systems. The achievements of the development team include the world's first eight-bit RISC-V general-purpose CPU core and appropriate system hardware architecture for AI and HPC applications.
The company has a comprehensive roadmap, including high-performance CPU chiplets based on RISC-V and advanced AI accelerator chiplets, which are expected to provide powerful solutions for machine learning. Remember, AI and HPC are the main big trends expected to achieve explosive growth, and providing AI accelerators and high-performance CPU cores seems to be a very flexible business model.
The AI and HPC market is highly competitive, so when you want to compete with established competitors (AMD, Intel, Nvidia) and emerging players (Cerebras, Graphcore), you must hire some of the best engineers in the world. Like large chip developers, Tenstorrent has its own general-purpose CPU and AI/ML accelerator hardware, which is a great advantage. At the same time, because the company uses the RISC-V ISA, it currently cannot address some markets and workloads, at least in terms of CPUs.
Post Comment