Source: The content is compiled by Semiconductor Industry Observation (ID: icbank), thank you.
Recently, media reports have indicated that Tesla has established a joint venture with Swiss automotive semiconductor company Annex in Jinan, named Annasis Semiconductor.
The report points out that the registered capital of the joint venture is $150 million, with Annex holding 55% of the shares, Jinan Zurich Annasis Equity Investment Fund Partnership holding 40%, and Tesla holding a 5% stake.
However, regarding this matter, Wall Street Journal quoted a spokesperson for Tesla China as stating that the electric vehicle manufacturer headquartered in Texas, USA, has "no relationship" with the Annasis Semiconductor joint venture.
Earlier, foreign media outlet Electrek wrote an article questioning the companies involved.
Advertisement
The author of Electrek stated that after inquiring with some sources in the semiconductor industry, no one had heard of Annex Semiconductor. He further pointed out that after researching Annex, he could not find any information about the company other than its own website and the new report on the Tesla blog.
Inspiration from Tesla's Self-developed Chip Architecture
Saying that Tesla is only interested in machine learning is an understatement. In fact, this electric vehicle manufacturer has built an internal supercomputer named Dojo, which is optimized for training its machine learning models.
Unlike many other supercomputers, Dojo does not use off-the-shelf CPUs and GPUs, such as those from AMD, Intel, or Nvidia. Tesla has designed its own microarchitecture based on their needs, allowing them to make trade-offs that a more general architecture cannot.
In this article, we will learn about this architecture based on Tesla's presentation at Hot Chips. The architecture does not have a separate name, so for simplicity, whenever we refer to Dojo later, we are talking about the architecture.From a high-level perspective, Dojo is an 8-wide core with four-way SMT, operating conservatively at 2 GHz, featuring a CPU-style pipeline that makes it more tolerant of diverse algorithms and branch-heavy code compared to GPUs. Dojo's instruction set is similar to RISC-V in scalar aspects, but Tesla's engineers have added a set of custom vector instructions focused on accelerating machine learning.
Tesla describes Dojo as a "high-throughput, general-purpose CPU." From a performance standpoint, this certainly makes sense. However, to increase computational density, Tesla has made sacrifices, and compared to the CPUs we are familiar with in desktops, laptops, and smartphones, Dojo cores are very difficult to use. In some ways, the processing approach of Dojo cores is more similar to the SPEs in IBM's Cell than to traditional general-purpose CPU cores.
Like Cell's SPE?
The IBM Cell processor, introduced in the mid-2000s, has eight "Synergistic Processing Elements" (SPEs) controlled by a fully functional CPU core ("Power Processing Element" or PPE: Power Processing Element). At first glance, Dojo has many similarities with SPEs.
Both Dojo and SPEs are optimized for vector processing and rely on a separate host processor for workload distribution. Code running on Dojo or SPEs cannot directly access system memory. Instead, applications are expected to primarily work in a small portion of local SRAM. This local SRAM is managed by software and cannot be used as a cache. If data from the main memory is needed, it must be brought in using DMA operations.
Finally, both Dojo and Cell's SPEs lack support for virtual memory. We will delve into what this means later, but in short, it makes multitasking very difficult.
Dojo differs in several important ways. Since Dojo was not designed for small-scale deployment, the host processor resides on a separate host system. These host systems have PCIe cards with interface processors that then connect to Dojo chips through high-speed network links. In contrast, the main processor of the Cell resides on the same chip. This makes it possible to deploy a single Cell chip separately – something that is not possible with Dojo. Dojo's 1.25 MB local block is SRAM, which is much larger and has higher bandwidth than the 256 KB SRAM of the Cell SPE. The Cell's 256 KB SRAM has a single port, capable of delivering 128B per cycle. Dojo's SRAM has five 64B ports. Of course, the architectural goals are very different. Dojo is wide and slow, while Cell SPEs have narrow and deep pipelines designed for high clock speeds.Front-end: CPU Comforts, etc.
Let's start with a brief introduction of Dojo's pipelines from the front-end. There is some kind of branch predictor, as Tesla's diagrams show a BTB (Branch Target Buffer). Its predictive capability may not reach the level we see in high-performance cores from AMD, ARM, and Intel, because Dojo needs to prioritize spending die area on vector execution. But even a basic branch predictor is a significant improvement over having no predictor, and Dojo's branch prediction capability should provide better performance than GPUs when dealing with branch code or larger instruction footprints.
Once the branch predictor generates the next instruction fetch pointer, Dojo can fetch 32 bytes from the "small" instruction cache into each thread's fetch buffer every cycle. This instruction cache may help reduce the instruction bandwidth pressure on local SRAM, ensuring that the data side can access SRAM with as little contention as possible. Additionally, the instruction cache is incoherent. If new code is loaded into local SRAM, the instruction cache must be flushed before branching to this new code.
From the fetch buffer, Dojo's decoder can process eight instructions from two threads every cycle. I'm a bit confused by the meaning of "two threads per cycle," as CPUs with SMT typically process one thread per cycle and switch threads at cycle boundaries. Perhaps Dojo divides the decoder into two clusters and selects two threads to provide data for them in each cycle. This could reduce the decoding throughput loss of the adopted branches.
During decoding, some instructions (such as branches, predicated operations, and immediate loads ("list parsing")) can be executed in the front-end and removed from the pipeline. This is somewhat like updated x86 CPUs that have eliminated register-to-register copies in the renamer. But you heard right—Dojo does not track "eliminated" instructions through the pipeline to maintain in-order retirement. Other processors track everything to exit so that they can stop at any instruction boundary and maintain all the state required to resume execution. This capability is called "precise exceptions," and modern operating systems use it to provide various goodies, such as paging to disk, or telling you exactly where your code went wrong.
Tesla does not care about precise exceptions. Dojo does have a debug mode in which more instructions pass through the pipeline to provide "more precise" exceptions, but there is no in-order retirement logic like a regular out-of-order CPU.
Dojo's Execution Engine
It might seem a bit strange to see a 4-wide integer execution engine with only two ALUs and two AGUs after seeing a wide front-end. But this funnel-shaped pipeline makes sense because some instructions are executed and discarded in the front-end.
Dojo is also not entering client systems where scalar integer performance is important. Therefore, the integer side provides enough throughput to handle control flow and address generation to keep the vector and matrix units fed.The vector and matrix execution units of Dojo are placed after the scalar execution engine in the core pipeline, and there are two execution pipelines. Two pipelines may not sound like much, but Dojo has very wide execution units behind these pipelines. One pipeline can perform 512-bit vector execution, while the other pipeline performs 8x8x4 matrix multiplication. Therefore, as long as the instructions expose enough explicit parallelism, Dojo can achieve very high throughput—especially when using matrix units. Tesla claims that a chip with 354 Dojo cores can reach 362 BF16 TFLOPS at 2 GHz, indicating that each core can perform 512 BF16 FLOPS per cycle.
We are not sure if Dojo can execute completely out of order. However, Tesla did state that the integer end can run far ahead of the vector end, indicating that it can execute instructions that were previously stalled until one of the schedulers is filled. The lack of in-order retirement also points to out-of-order execution capabilities.
Generally, implementing out-of-order execution brings a lot of complexity. This is because the CPU must execute instructions in order. High-performance CPUs from AMD, ARM, and Intel use large reorder buffers (and other structures) to track instructions so that their results can be committed in program order. This means that if a program does something foolish, such as dividing by zero, these cores can accurately show which instruction went wrong. And they can display the CPU state, which reflects the actions of all instructions before the failure, but none after. This means you can fix any cause that led to the instruction error and resume execution. Dojo gives up this capability. In exchange, Dojo can avoid the power and area overhead associated with tracking each instruction through its pipelines to ensure results are committed in program order.
SRAM Access
Usually, we would discuss caches here. However, Dojo cannot directly access system memory, so we will discuss the 1.25 MB SRAM block. It can handle two 512-bit loads per cycle, matching the bandwidth per cycle of an Intel CPU supporting AVX-512. Tesla states that the SRAM has five 512-bit ports (2 load ports, 1 store port, and two to the grid stop ports). However, the scalar end only has two AGUs, which may mean that the core cannot sustain two 512-bit loads and one 512-bit store per cycle.
Since Dojo's local SRAM block is not a cache, it does not require tags and state bits stored with the data. There is also no L1D cache in front of the SRAM, so it must be fast enough to handle all load and store instructions without causing a bottleneck, even though its size is closer to an L2 cache. Not implementing SRAM as a cache may be Tesla's way of maintaining low latency. If we look back at AMD's Hammer architecture from a long time ago, we can see that accessing 1 MB of L2 after detecting an L1D miss takes 8 cycles (total latency of 12 cycles). If directly addressing 1 MB of SRAM instead of using it as a cache, it might eliminate three stages, reducing the latency to 5 cycles:
A slide from an earlier Hot Chips presentation, showing the pipeline stages involved in L2 cache access. The stages that can be skipped if L2 is not a cache are marked in red.
Considering decades of process node improvements, and even lower clock speed targets than Athlon, it is easy to see how Tesla can access an L2-sized SRAM block with similar L1 latency. Skipping a level of cache certainly saves area and power.
To further reduce latency, area, and core complexity, Dojo does not have virtual memory support. Therefore, it does not have a TLB or page walk mechanisms. Modern operating systems use virtual memory to provide each process with its own memory view. The memory addresses used by a program are not directly accessing physical memory addresses, but are translated into physical addresses by the CPU using paging structures set by the operating system. This is how modern operating systems isolate programs from each other and prevent a misbehaving application from causing the entire system to crash.
Virtual memory is also the way you can run more programs than there is physical memory. When you run out of actual memory, the operating system unmaps pages, writes them to disk, and provides the memory needed for your programs. When another bad program tries to access that memory, the CPU attempts to translate the virtual address to a physical address but finds that the translation does not exist. The CPU throws a page fault exception, and the operating system handles the exception by reading the evicted page back into physical memory and filling in the page table entries.On the Dojo, these are not feasible. The core's 4-way SMT functionality is more about allowing a single application to explicitly expose parallelism, rather than enhancing multitasking performance. For example, one thread can perform vector computations while another thread asynchronously loads data from system memory to SRAM (via DMA).
To further simplify the design, the Dojo can address SRAM with only 21 address bits, which can simplify Dojo's AGU and addressing bus. These trade-offs may allow Tesla to access this SRAM with sufficiently low latency to avoid implementing a separate L1 data cache in front of it.
Memory Access
Speaking of system memory, the Dojo chip is not directly connected to memory. Instead, they are connected to interface processors equipped with HBM. These interface processors are also responsible for communicating with the host system.
A Dojo tile with 25 independent chips can access 160 GB of HBM memory.
Tesla states that they can transfer 900 GB/s from the edge of each chip across tile boundaries, which means that the interface processor and its HBM can be accessed with a link bandwidth of 4.5 TB/s. Since accessing HBM requires going through a separate chip, access latency may be very high.
How is it small, how is it done?
Dojo is an 8-wide core with some degree of OoO execution capability, decent vector throughput, and a matrix multiplication unit. But even with 1.25 MB of local SRAM, it is ultimately a very small core. In contrast, Fujitsu's A64FX occupies more than twice the area on the same process node.
Processor design is about making the right trade-offs. Tesla hopes to maximize the throughput of machine learning by encapsulating a large number of cores into the chip, so the individual core must be very small. To achieve its area efficiency, Dojo uses some familiar techniques. It operates conservatively at 2 GHz. Lower clock circuits tend to occupy less area. It may have a basic branch predictor and a small instruction cache. This sacrifices some performance if the program has a large code footprint or many branches.
But Tesla also further reduces power consumption and area usage by cutting features that are not needed for running internal workloads. They do not perform data end caching, do not support virtual memory, and do not support precise exceptions.The result is a processor core that offers the performance flexibility of a modern CPU core, while being less user- and programmer-friendly in many aspects than the Intel 8086. In addition to the core, Tesla also saves chip area on a large scale by designing the Dojo chip specifically for deployment.
Physical Implementation
The Dojo core is implemented on a very large 645 square millimeter die, known as D1. Unlike other chips we are familiar with, a single Dojo chip is not self-sufficient. It does not have a DDR or PCIe controller. There are IO interfaces around the edges of the die that allow the die to communicate with adjacent dies, with a latency of about 100 ns.
To access system memory, the Dojo D1 chip must communicate with an interface processor with on-board HBM. This interface processor is then connected to the host system via PCIe (the interface processor is mounted on a PCIe card). Theoretically, the smallest functional Dojo deployment would involve one Dojo chip, one interface processor card, and a host system. However, Tesla deploys Dojo dies in modules containing 25 dies each to provide a sense of scale. The Dojo D1 die is specifically designed to be a building block for supercomputers, and nothing more.
This specialization can save even more die area. The Dojo D1 does not spend space on DDR and PCIe controllers. Most of the die is occupied by a large number of Dojo cores, except for the external design for custom IO connectors used to interface with adjacent dies.
In contrast, chips designed with more deployment flexibility spend a lot of space on IO. AMD's Zen 1 "Zeppelin" chip is a good example of this. Zeppelin can directly connect to DDR4 memory, PCIe devices, SATA drives, and USB devices - very suitable to meet customer requirements. In the server, the IFOP interface allows it to communicate with adjacent chips. A large block of SRAM near the IFOP may be snoop filters, which help to maintain cache consistency effectively in high-core setups. Dojo does not attempt to maintain cache consistency across cores and does not use any SRAM for snoop filtering.AMD pays the price for this flexibility by dedicating approximately 44% of the "Zeppelin" die area to logic outside of the core and cache. In contrast, Dojo only allocates 28.9% of the die area to SRAM and things beyond the core.
The microarchitecture behind Tesla's Dojo supercomputer demonstrates how to achieve a very high computational density while still maintaining the CPU's ability to handle branch code. To get there, you need to forgo most of the amenities that define our modern computing experience. If you can assume building your desktop around a Dojo core, anyone who has used MS-DOS might find the experience familiar. You can't run multiple applications at the same time. A single misbehaving application might force you to restart the system. If you don't have enough RAM to run a program, you can completely forget about running it (no paging to disk).
But these trade-offs make a lot of sense in Tesla's supercomputer. Tesla doesn't need Dojo cores to handle multiple running applications simultaneously; Dojo only needs to run internal trusted code. So, Tesla doesn't care about virtual memory support. Similarly, machine learning programs running on Dojo will be written with that specific system in mind. You won't have a batch of arbitrary programs that might demand more memory than is available. This means you don't need precise exceptions (and virtual memory) to allow techniques such as overcommitting memory, memory compression, or swapping memory pages between disks. Precise exceptions are also useful for debugging, but Tesla debugs in a cheaper way through a separate debug mode.
It's certain that the trade-offs Tesla has made for high computational density would be impossible in consumer or server CPUs. But they are very interesting in action, and we must thank Tesla for taking the time to showcase at Hot Chips.
Over the past two decades, process node improvements have been slowing down, leading to slower progress in single-thread performance. Over the past five years, power and cooling limits have been reducing multi-thread performance. But the demand for more computational power has not slowed down, so companies are turning to more specialized hardware to keep up.
The architecture in Tesla's Dojo supercomputer is a great example of how to make trade-offs to increase computational density and how current trends favor the introduction of specialized hardware for throughput-limited applications.
Post Comment