─ Latest accelerators offer market leading HBM3E memory capacity and are supported by partners and customers including Dell Technologies, HPE, Lenovo, Supermicro and others ─
─ AMD Pensando Salina DPU offers 2X generational performance and AMD Pensando Pollara 400 is industry’s first UEC ready NIC─
SAN FRANCISCO, Oct. 10, 2024 (GLOBE NEWSWIRE) — Today, AMD (NASDAQ: AMD) announced the latest accelerator and networking solutions that will power the next generation of AI infrastructure at scale: AMD Instinct
Built on the AMD CDNA
“AMD continues to deliver on our roadmap, offering customers the performance they need and the choice they want, to bring AI infrastructure, at scale, to market faster,” said Forrest Norrod, executive vice president and general manager, Data Center Solutions Business Group, AMD. “With the new AMD Instinct accelerators, EPYC processors and AMD Pensando networking engines, the continued growth of our open software ecosystem, and the ability to tie this all together into optimized AI infrastructure, AMD underscores the critical expertise to build and deploy world class AI solutions.”
AMD Instinct MI325X Extends Leading AI Performance
AMD Instinct MI325X accelerators deliver industry-leading memory capacity and bandwidth, with 256GB of HBM3E supporting 6.0TB/s offering 1.8X more capacity and 1.3x more bandwidth than the H2001. The AMD Instinct MI325X also offers 1.3X greater peak theoretical FP16 and FP8 compute performance compared to H2001.
This leadership memory and compute can provide up to 1.3X the inference performance on Mistral 7B at FP162, 1.2X the inference performance on Llama 3.1 70B at FP83 and 1.4X the inference performance on Mixtral 8x7B at FP16 of the H2004.
AMD Instinct MI325X accelerators are currently on track for production shipments in Q4 2024 and are expected to have widespread system availability from a broad set of platform providers, including Dell Technologies, Eviden, Gigabyte, Hewlett Packard Enterprise, Lenovo, Supermicro and others starting in Q1 2025.
Continuing its commitment to an annual roadmap cadence, AMD previewed the next-generation AMD Instinct MI350 series accelerators. Based on AMD CDNA 4 architecture, AMD Instinct MI350 series accelerators are designed to deliver a 35x improvement in inference performance compared to AMD CDNA 3-based accelerators5.
The AMD Instinct MI350 series will continue to drive memory capacity leadership with up to 288GB of HBM3E memory per accelerator. The AMD Instinct MI350 series accelerators are on track to be available during the second half of 2025.
AMD Next-Gen AI Networking
AMD is leveraging the most widely deployed programmable DPU for hyperscalers to power next-gen AI networking. Split into two parts: the front-end, which delivers data and information to an AI cluster, and the backend, which manages data transfer between accelerators and clusters, AI networking is critical to ensuring CPUs and accelerators are utilized efficiently in AI infrastructure.
To effectively manage these two networks and drive high performance, scalability and efficiency across the entire system, AMD introduced the AMD Pensando
The AMD Pensando Salina DPU is the third generation of the world’s most performant and programmable DPU, bringing up to 2X the performance, bandwidth and scale compared to the previous generation. Supporting 400G throughput for fast data transfer rates, the AMD Pensando Salina DPU is a critical component in AI front-end network clusters, optimizing performance, efficiency, security and scalability for data-driven AI applications.
The UEC-ready AMD Pensando Pollara 400, powered by the AMD P4 Programmable engine, is the industry’s first UEC-ready AI NIC. It supports the next-gen RDMA software and is backed by an open ecosystem of networking. The AMD Pensando Pollara 400 is critical for providing leadership performance, scalability and efficiency of accelerator-to-accelerator communication in back-end networks.
Both the AMD Pensando Salina DPU and AMD Pensando Pollara 400 are sampling with customers in Q4’24 and are on track for availability in the first half of 2025.
AMD AI Software Delivering New Capabilities for Generative AI
AMD continues its investment in driving software capabilities and the open ecosystem to deliver powerful new features and capabilities in the AMD ROCm
Within the open software community, AMD is driving support for AMD compute engines in the most widely used AI frameworks, libraries and models including PyTorch, Triton, Hugging Face and many others. This work translates to out-of-the-box performance and support with AMD Instinct accelerators on popular generative AI models like Stable Diffusion 3, Meta Llama 3, 3.1 and 3.2 and more than one million models at Hugging Face.
Beyond the community, AMD continues to advance its ROCm open software stack, bringing the latest features to support leading training and inference on Generative AI workloads. ROCm 6.2 now includes support for critical AI features like FP8 datatype, Flash Attention 3, Kernel Fusion and more. With these new additions, ROCm 6.2, compared to ROCm 6.0, provides up to a 2.4X performance improvement on inference6 and 1.8X on training for a variety of LLMs7.
Supporting Resources
- Follow AMD on LinkedIn
- Follow AMD on Twitter
- Read more about AMD Next Generation AI Networking here
- Read more about AMD Instinct Accelerators here
- Visit the AMD Advancing AI: 2024 event page
About AMD
For more than 50 years AMD has driven innovation in high-performance computing, graphics, and visualization technologies. Billions of people, leading Fortune 500 businesses, and cutting-edge scientific research institutions around the world rely on AMD technology daily to improve how they live, work, and play. AMD employees are focused on building leadership high-performance and adaptive products that push the boundaries of what is possible. For more information about how AMD is enabling today and inspiring tomorrow, visit the AMD (NASDAQ: AMD) website, blog, LinkedIn, and X pages.
CAUTIONARY STATEMENT
This press release contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) such as the features, functionality, performance, availability, timing and expected benefits of AMD products including the AMD Instinct
AMD, the AMD Arrow logo, AMD CDNA, AMD Instinct, Pensando, ROCm, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.
________________________________
1MI325-002 -Calculations conducted by AMD Performance Labs as of May 28th, 2024 for the AMD Instinct
Published results on Nvidia H200 SXM (141GB) GPU: 989.4 TFLOPS peak theoretical half precision tensor (FP16 Tensor), 989.4 TFLOPS peak theoretical Bfloat16 tensor format precision (BF16 Tensor), 1,978.9 TFLOPS peak theoretical 8-bit precision (FP8), 1,978.9 TOPs peak theoretical INT8 floating-point performance. BFLOAT16 Tensor Core, FP16 Tensor Core, FP8 Tensor Core and INT8 Tensor Core performance were published by Nvidia using sparsity; for the purposes of comparison, AMD converted these numbers to non-sparsity/dense by dividing by 2, and these numbers appear above.
Nvidia H200 source: https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 and https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-accelerator-with-hbm3e-and-jupiter-supercomputer-for-2024
Note: Nvidia H200 GPUs have the same published FLOPs performance as H100 products https://resources.nvidia.com/en-us-tensor-core/.
2 Based on testing completed on 9/28/2024 by AMD performance lab measuring overall latency for Mistral-7B model using FP16 datatype. Test was performed using input length of 128 tokens and an output length of 128 tokens for the following configurations of AMD Instinct
1x MI325X at 1000W with vLLM performance: 0.637 sec (latency in seconds)
Vs.
1x H200 at 700W with TensorRT-LLM: 0.811 sec (latency in seconds)
Configurations:
AMD Instinct
1x AMD Ryzen
Vs
NVIDIA H200 HGX platform:
Supermicro SuperServer with 2x Intel Xeon® Platinum 8468 Processors, 8x Nvidia H200 (140GB, 700W) GPUs [only 1 GPU was used in this test], Ubuntu 22.04), CUDA 12.6 Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations. MI325-005
3 MI325-006: Based on testing completed on 9/28/2024 by AMD performance lab measuring overall latency for LLaMA 3.1-70B model using FP8 datatype. Test was performed using input length of 2048 tokens and an output length of 2048 tokens for the following configurations of AMD Instinct
1x MI325X at 1000W with vLLM performance: 48.025 sec (latency in seconds)
Vs.
1x H200 at 700W with TensorRT-LLM: 62.688 sec (latency in seconds)
Configurations:
AMD Instinct
1x AMD Ryzen
Vs
NVIDIA H200 HGX platform:
Supermicro SuperServer with 2x Intel Xeon® Platinum 8468 Processors, 8x Nvidia H200 (140GB, 700W) GPUs, Ubuntu 22.04), CUDA 12.6
Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.
4 MI325-004: Based on testing completed on 9/28/2024 by AMD performance lab measuring text generated throughput for Mixtral-8x7B model using FP16 datatype. Test was performed using input length of 128 tokens and an output length of 4096 tokens for the following configurations of AMD Instinct
1x MI325X at 1000W with vLLM performance: 4598 (Output tokens / sec)
Vs.
1x H200 at 700W with TensorRT-LLM: 2700.7 (Output tokens / sec)
Configurations:
AMD Instinct
1x AMD Ryzen
Vs
NVIDIA H200 HGX platform:
Supermicro SuperServer with 2x Intel Xeon® Platinum 8468 Processors, 8x Nvidia H200 (140GB, 700W) GPUs [only 1 GPU was used in this test], Ubuntu 22.04) CUDA® 12.6
Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.
5 CDNA4-03: Inference performance projections as of May 31, 2024 using engineering estimates based on the design of a future AMD CDNA 4-based Instinct MI350 Series accelerator as proxy for projected AMD CDNA
6 MI300-62: Testing conducted by internal AMD Performance Labs as of September 29, 2024 inference performance comparison between ROCm 6.2 software and ROCm 6.0 software on the systems with 8 AMD Instinct
ROCm 6.2 with vLLM 0.5.5 performance was measured against the performance with ROCm 6.0 with vLLM 0.3.3, and tests were performed across batch sizes of 1 to 256 and sequence lengths of 128 to 2048.
Configurations:
1P AMD EPYC
vs.
1P AMD EPYC 9534 CPU server with 8x AMD Instinct
Server manufacturers may vary configurations, yielding different results. Performance may vary based on factors including but not limited to different versions of configurations, vLLM, and drivers.
7 MI300-61: Measurements conducted by AMD AI Product Management team on AMD Instinct
System Configurations:
– AMD EPYC 9654 96-Core Processor, 8 x AMD MI300X, ROCm
Performance may vary on factors including but not limited to different versions of configurations, vLLM, and drivers.
Contact:
Aaron Grabein
AMD Communications
+1 737-256-9518
aaron.grabein@amd.com
Mitch Haws
AMD Investor Relations
+1 512-944-0790
mitch.haws@amd.com
Wall St Business News, Latest and Up-to-date Business Stories from Newsmakers of Tomorrow