Published at 2024-05-26 | Last Update 2024-05-28
This post summarizes bandwidths for local storage media, networking
infra, as well as remote storage systems. Readers may find this helpful when
identifying bottlenecks in IO-intensive applications
(e.g. AI training and LLM inference).
Fig. Peak bandwidth of storage media, networking, and distributed storage solutions.
Note: this post may contain inaccurate and/or stale information.
~400GB/s
~1000GB/s
1~5 TB/s
20+ TB/s
400GB/s
2+ TB/s
Before delving into the specifics of storage, let’s first go through some fundamentals about data transfer protocols.
From wikepedia SATA:
SATA (
Serial AT Attachment
) is a computerbus interface
that connects host bus adapters to mass storage devices such as hard disk drives, optical drives, and solid-state drives.
Fig. SATA interfaces and cables on a computer motherboard. Image source wikipedia
The SATA standard has evolved through multiple revisions.
The current prevalent revision is 3.0, offering a maximum IO bandwidth of 600MB/s
:
Table: SATA revisions. Data source: wikipedia
Spec | Raw data rate | Data rate | Max cable length |
---|---|---|---|
SATA Express | 16 Gbit/s | 1.97 GB/s | 1m |
SATA revision 3.0 | 6 Gbit/s | 600 MB/s |
1m |
SATA revision 2.0 | 3 Gbit/s | 300 MB/s | 1m |
SATA revision 1.0 | 1.5 Gbit/s | 150 MB/s | 1m |
From wikipedia PCIe (PCI Express):
PCI Express is high-speed serial computer expansion bus standard.
PCIe (Peripheral Component Interconnect Express
)
is another kind of system bus, designed to connect a variety of peripheral
devices, including GPUs, NICs, sound cards
,
and certain storage devices.
Fig.
Various slots on a computer motherboard, from top to bottom:
PCIe x4 (e.g. for NVME SSD)
PCIe x16 (e.g. for GPU card)
PCIe x1
PCIe x16
Conventional PCI (32-bit, 5 V)
Image source wikipedia
As shown in the above picture,
PCIe electrical interface is measured by the number of lanes
.
A lane is a single data send+receive line
,
functioning similarly to a “one-lane road” with traffic in both directions
.
Each new PCIe generation doubles the bandwidth of a lane than the previous generation:
Table: PCIe Unidirectional Bandwidth. Data source: trentonsystems.com
Generation | Year of Release | Data Transfer Rate | Bandwidth x1 | Bandwidth x16 |
---|---|---|---|---|
PCIe 1.0 | 2003 | 2.5 GT/s | 250 MB/s | 4.0 GB/s |
PCIe 2.0 | 2007 | 5.0 GT/s | 500 MB/s | 8.0 GB/s |
PCIe 3.0 | 2010 | 8.0 GT/s | 1 GB/s | 16 GB/s |
PCIe 4.0 | 2017 | 16 GT/s | 2 GB/s | 32 GB/s |
PCIe 5.0 | 2019 | 32 GT/s | 4 GB/s | 64 GB/s |
PCIe 6.0 | 2021 | 64 GT/s | 8 GB/s | 128 GB/s |
Currently, the most widely used generations are Gen4 and Gen5
.
Note: Depending on the document you’re referencing, PCIe bandwidth may be presented
as either unidirectional or bidirectional
,
with the latter indicating a bandwidth that is twice that of the former.
With the above knowledge, we can now proceed to discuss the performance characteristics of various storage devices.
~200 MB/s
From wikipedia HDD:
A hard disk drive (HDD) is an
electro-mechanical
data storage device that stores and retrieves digital data usingmagnetic storage
with one or more rigidrapidly rotating platters
coated with magnetic material.
A real-world picture is shown below:
Fig. Internals of a real world HDD. Image source hardwaresecrets.com
HDDs connect to a motherboard over one of several bus types, such as,
SATA
Below is a SATA HDD:
Fig. A real world SATA HDD. Image source hardwaresecrets.com
and how an HDD connects to a computer motherboard via SATA cables:
Fig. An HDD with SATA cables. Data source datalab247.com
machanical factors
HDDs are machanical devices, and their peak IO performance is inherently
limited by various mechanical factors, including the speed at which the
actuator arm can function. The current upper limit of HDDs is ~200MB/s
,
which is significantly below the saturation point of a SATA 3.0
interface (600MB/s).
Table. Latency characteristics typical of HDDs. Data source: wikipedia
Rotational speed (rpm) | Average rotational latency (ms) |
---|---|
15,000 | 2 |
10,000 | 3 |
7,200 | 4.16 |
5,400 | 5.55 |
4,800 | 6.25 |
~600MB/s
What’s a SSD? From wikipedia SSD:
A solid-state drive (SSD) is a solid-state storage device. It provides persistent data storage using
no moving parts
.
Like HDDs, SSDs support several kind of bus types:
Let’s see the first one: SATA-interfaced SSD, or SATA SSD for short.
SSDs are usually smaller than HDDs,
Fig. Size of different drives, left to right: HDD, SATA SSD, NVME SSD. Image source avg.com
SATA bus
The absence of mechanical components (such as rotational arms) allows SATA SSDs
to fully utilize the capabilities of the SATA bus.
This results in an upper limit of 600MB/s
IO bandwidth,
which is 3x faster than that of SATA HDDs
.
~7GB/s, ~13GB/s
Let’s now explore another type of SSD: the PCIe-based
NVME SSD.
NVME SSDs are even smaller than SATA SSDs,
and they connect directly to the PCIe bus with 4x lanes
instead of SATA cables,
Fig. Size of different drives, left to right: HDD, SATA SSD, NVME SSD. Image source avg.com
PCIe bus
NVME SSDs has a peak bandwidth of 7.5GB/s over PCIe Gen4, and ~13GB/s over PCIe Gen5.
We illustrate the peak bandwidths of afore-mentioned three kinds of local storage media in a graph:
Fig. Peak bandwidths of different storage media.
These (HDDs, SSDs) are commonly called non-volatile or persistent storage
media.
And as the picture hints, in next chapters we’ll delve into some other kinds of storage devices.
DDR SDRAM nowadays serves mainly as the main memory
in computers.
Fig. Front and back of a DDR RAM module for desktop PCs (DIMM). Image source wikipedia
Fig. Corsair DDR-400 memory with heat spreaders. Image source wikipedia
DDR memory connects to the motherboard via DIMM slots:
Fig. Three SDRAM DIMM slots on a ABIT BP6 computer motherboard. Image source wikipedia
memory clock, bus width, channel
, etcSingle channel bandwidth:
Transfer rate | Bandwidth | |
---|---|---|
DDR4 | 3.2GT/s | 25.6 GB/s |
DDR5 | 4–8GT/s | 32–64 GB/s |
if Multi-channel memory architecture is enabled, the peak (aggreated) bandwidth will be increased by multiple times:
Fig. Dual-channel memory slots, color-coded orange and yellow for this particular motherboard. Image source wikipedia
Such as [4],
358GB/s
)307GB/s
)DDR5 bandwidth in the hierarchy:
Fig. Peak bandwidths of different storage media.
Now let’s see another variant of DDR, commonly used in graphics cards (GPUs)
.
From wikipedia GDDR SDRAM:
Graphics DDR SDRAM (GDDR SDRAM) is a type of synchronous dynamic random-access memory (SDRAM) specifically designed for applications requiring high bandwidth, e.g. graphics processing units (
GPUs
).GDDR SDRAM is
distinct from the more widely known types of DDR SDRAM
, such as DDR4 and DDR5, although they share some of the same features—including double data rate (DDR) data transfers.
Fig. Hynix GDDR SDRAM. Image Source: wikipedia
lanes & clock rates
Unlike DDR, GDDR is directly integrated with GPU
devices,
bypassing the need for pluggable PCIe slots. This integration
liberates GDDR from the bandwidth limitations imposed by the PCIe bus
.
Such as,
NVIDIA RTX 4090
With GDDR included:
Fig. Peak bandwidths of different storage media.
If you’d like to achieve even more higher bandwidth than GDDR, then there is an option: HBM (High Bandwidth Memory).
A great innovation but a terrible name.
HBM is designed to provide a larger memory bus width
than GDDR, resulting in larger data transfer rates.
Fig. Cut through a graphics card that uses HBM. Image Source: wikipedia
HBM sits inside the GPU die
and is stacked – for example NVIDIA A800
GPU has 5 stacks of 8 HBM DRAM dies (8-Hi) each with two 512-bit
channels per die, resulting in a total width of 5120-bits (5 active stacks * 2 channels * 512 bits
) [3].
As another example, HBM3 (used in NVIDIA H100) also has a 5120-bit bus
,
and 3.35TB/s memory bandwidth,
Fig. Bandwidth of several HBM-powered GPUs from NVIDIA. Image source: nvidia.com
The 4 squares in left and right are just HBM chips:
Fig. AMD Fiji, the first GPU to use HBM. Image Source: wikipedia
lanes & clock rates
From wikipedia HBM,
Bandwidth | Year | GPU | |
---|---|---|---|
HBM | 128GB/s/package | ||
HBM2 | 256GB/s/package | 2016 | V100 |
HBM2e | ~450GB/s | 2018 | A100, ~2TB/s ; Huawei Ascend 910B |
HBM3 | 600GB/s/site | 2020 | H100, 3.35TB/s |
HBM3e | ~1TB/s | 2023 | H200 , 4.8TB/s |
HBM is not exclusive to GPU memory
; it is also integrated into some CPU models, such as the
Intel Xeon CPU Max Series.
This chapter concludes our exploration of dynamic RAM technologies, which includes
Fig. Peak bandwidths of different storage media.
In the next, let’s see some on-chip static RAMs.
The term “on-chip” in this post refers to memory storage that's integrated within the same silicon as the processor unit
.
From wikipedia SRAM:
Static random-access memory (static RAM or SRAM) is a type of random-access memory that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.
The term static
differentiates SRAM from DRAM:
SRAM | DRAM | |
---|---|---|
data freshness | stable in the presence of power | decays in seconds, must be periodically refreshed |
speed (relative) | fast (10x) | slow |
cost (relative) | high | low |
mainly used for | cache |
main memory |
SRAM requires
more transistors per bit to implement
, so it is less dense and more expensive than DRAM and also has ahigher power consumption
during read or write access. The power consumption of SRAM varies widely depending on how frequently it is accessed.
In the architecture of multi-processor (CPU/GPU/…) systems, a multi-tiered static cache structure is usually used:
NVIDIA H100 chip layout (L2 cache in the middle, shared by many SM processors). Image source: nvidia.com
Groq LPU
: eliminating memory bottleneck by using SRAM as main memoryFrom the official website:
Groq is the AI infra company that builds the world’s fastest AI inference technology
with both software and hardware.
Groq LPU is designed to overcome two LLM bottlenecks
: compute density and memory bandwidth
.
Eliminating external memory bottlenecks
(using on-chip SRAM instead) enables the LPU Inference Engine to
deliver orders of magnitude better performance on LLMs compared to GPUs
.Regarding to the chip:
Fig. Die photo of 14nm ASIC implementation of the Groq TSP. Image source: groq paper [2]
The East and West hemisphere of on-chip memory
module (MEM)
SRAM
and provides the memory concurrency necessary to fully utilize the 32 streams in each direction.220 MiBytes of on-chip SRAM
.clock rates
, etcThis chapter ends our journey to various physical storage media, from machanical devices like HDDs all the way to on-chip cache.
We illustrate their peak bandwidth in a picture, note that the Y-axis is log10
scaled:
Fig. Speeds of different storage media.
These are the maximum IO bandwidths when performing read/write operations on a local node.
Conversely, when considering remote I/O operations, such as those involved in distributed storage systems like Ceph, AWS S3, or NAS, a new bottleneck emerges: networking bandwidth.
2*{25,100,200}Gbps
For traditional data center workloads, the following per-server networking configurations are typically sufficient:
6.25GB/s
unidirectional bandwidth when operating in active-active mode;25GB/s
unidirectional bandwidth when operating in active-active mode;50GB/s
unidirectional bandwidth when operating in active-active mode.8*{100,400}Gbps
This type of networking facilitates inter-GPU communication and is not intended for general data I/O. The data transfer pathway is as follows:
HBM <---> NIC <---> IB/RoCE <---> NIC <--> HBM
Node1 Node2
Now we add networking bandwidths into our storage performance picture:
Fig. Speeds of different storage media, with networking bandwidth added.
If remote storage solutions (such as distributed file systems) is involved, and networking is fast enough, IO bottleneck would shift down to the remote storage solutions, that’s why there are some extremely high performance storage solutions dedicated for today’s AI trainings.
AlibabaCloud’s Cloud Parallel File Storage (CPFS) is an exemplar of such
high-performance storage solutions. It claims
to offer up to 2TB/s of aggregated bandwidth
.
But, note that the mentioned bandwidth is an aggregate across multiple nodes,
no single node can achieve this
level of IO speed.
You can do some calcuatations to understand why, with PCIe bandwidth, networking bandwidth, etc;
An open-source counterpart is Ceph, which also delivers impressive results.
For instance, with a cluster configuration of 68 nodes * 2 * 100Gbps/node
,
a user achieved aggregated throughput of 1TB/s
,
as documented.
Now adding distributed storage aggregated bandwidth into our graph:
Fig. Peak bandwidth of storage media, networking, and distributed storage solutions.
This post compiles bandwidth data for local storage media, networking infrastructure, and remote storage systems. With this information as reference, readers can evaluate the potential IO bottlenecks of their systems more effectively, such as GPU server IO bottleneck analysis [1]:
Fig. Bandwidths inside a 8xA100 GPU node