h19611 Nvidia Gen Ai WP
h19611 Nvidia Gen Ai WP
h19611 Nvidia Gen Ai WP
H19611
White Paper
Abstract
This white paper presents an overview of Generative AI, and introduces
Project Helix, a collaboration between Dell Technologies and NVIDIA to
enable high performance, scalable, and modular full-stack generative AI
solutions for large language models in the enterprise.
The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with respect
to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular
purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright ©2023 Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Published in
the USA May 2023 White Paper H19611.
Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to change
without notice.
Contents
Introduction ...................................................................................................................................5
Executive summary ....................................................................................................................5
About this document ...................................................................................................................6
Audience ....................................................................................................................................6
Benefits ........................................................................................................................................13
Generative AI benefits ..............................................................................................................13
Dell and NVIDIA advantages ....................................................................................................13
Conclusion...................................................................................................................................32
Generative AI advantage ..........................................................................................................32
We value your feedback ...........................................................................................................32
References ...................................................................................................................................33
Dell Technologies documentation .............................................................................................33
NVIDIA documentation .............................................................................................................33
Introduction
Executive The growth of artificial intelligence (AI) applications and use cases is astounding, with
summary impacts across nearly all facets of business and personal lives. Generative AI, the branch
of AI that is designed to generate new data, images, code, or other types of content that
humans do not explicitly program, is becoming particularly impactful and influential.
According to one analyst, the global generative AI market size was already estimated at
USD 10.79 billion in 2022. It is projected to approach USD 118 billion by 2032, growing at
a compound annual growth rate (CAGR) of 27 percent from 2023 to 2032 1.
While public generative AI models such as ChatGPT, Google Bard AI, DALL-E, and other
and more specialized offerings are intriguing, there are valid concerns about their use in
the enterprise. These concerns include ownership of output, which encompasses issues
of accuracy, truthfulness, and source attribution.
Therefore, there is a compelling need for enterprises to develop their own Large
Language Models (LLMs) that are trained on proprietary datasets or developed and fine-
tuned from known pretrained models.
pretrained foundation models including the NeMo framework, and the expertise to
build, customize, and run generative AI.
We are now partnering on a new generative AI project called Project Helix, a joint initiative
between Dell Technologies and NVIDIA, to bring generative AI to the world’s enterprise
data centers. Project Helix is a full-stack solution that enables enterprises to create and
run custom AI models, built with the knowledge of their business. We have designed a
scalable, modular, and high-performance infrastructure that enables enterprises
everywhere to create a wave of generative AI solutions that will reinvent their industries
and give them competitive advantage.
Generative AI is one of the most exciting and rapidly evolving fields in AI today. It is a
transformative technology and the combination of powerful infrastructure and software
from Dell Technologies together with accelerators, AI software, and AI expertise from
NVIDIA is second to none.
About this In this whitepaper, readers can gain a comprehensive overview of generative AI, including
document its underlying principles, benefits, architectures, and techniques. They can also learn
about the various types of generative AI models, and how they are used in real-world
applications.
This white paper also explores the challenges and limitations of generative AI, such as the
difficulty in training large-scale models, the potential for bias and ethical concerns, and the
trade-off between generating realistic outputs and maintaining data privacy.
This white paper also provides guidance about how to develop and deploy generative AI
models effectively. It includes considerations about hardware and software infrastructure
from Dell Technologies and NVIDIA, data management, and evaluation metrics – all
leading to a scalable, high-performance, production architecture for generative AI in the
enterprise.
Audience This white paper is intended for business leaders Chief Technology Officers (CTOs),
Chief Information Officers (CIOs), IT infrastructure managers, and systems architects who
are interested in, involved with, or considering implementation of generative AI.
Background AI has gone through several phases of development since its inception in the mid-20th
century. The major phases of AI development, along with approximate timeframes, are:
1. Rule-based systems (1950s-1960s)—The first phase of AI development
addressed the creation of rule-based systems, in which experts encoded their
knowledge into a set of rules for the computer to follow. These systems were limited
in their ability to learn from new data or adapt to new situations.
While these phases are not strictly defined or mutually exclusive, they represent major
milestones in the development of AI and demonstrate the increasing complexity and
sophistication of AI algorithms and applications over time.
Definition and Generative AI is a branch of artificial intelligence that builds models that can generate
overview content (such as images, text, or audio) that is not explicitly programmed by humans and
that is similar in style and structure to existing examples. Generative AI techniques use
deep learning algorithms to learn from large datasets of examples, learn patterns, and
generate new content that is similar to the original data.
One of the significant aspects of generative AI is its ability to create content that is
indistinguishable from content created by humans, which has numerous applications in
industries such as entertainment, design, and marketing. For example, generative AI can
create realistic images of products that do not exist yet, generate music that mimics the
style of a particular artist, or even generate text that is indistinguishable from content
written by humans.
Overall, generative AI has the potential to transform the way we create and consume
content. It has the potential to generate new knowledge and insights in various fields,
making it an exciting area of development in AI.
Evolution Advances in deep learning algorithms and the availability of large datasets of natural
language text have driven the evolution of NLG to generative AI. Early NLG systems
relied on rule-based or template-based approaches, which were limited in their ability to
generate diverse and creative content. However, with the rise of deep learning techniques
such as recurrent neural networks (RNNs) and transformers, it has become possible to
build models that can learn from large datasets of natural language text and generate new
text that is more diverse and creative.
the original data. Subsequent versions of the model, including GPT-2 and GPT-3, have
pushed the boundaries of what is possible with NLG, generating text that is increasingly
diverse, creative, and even human-like in some cases.
Transformer Transformer models are a type of deep learning model that are commonly used in NLP
models and other applications of generative AI. Transformers were introduced in a seminal paper
by Vaswani and others in 2017. They have since become a key building block for many
state-of-the-art NLP models.
The success of transformer models has led to the development of large-scale, pretrained
language models, referred to as generative pretraining transformers (GPTs), such as
OpenAI's GPT series and Google's Bidirectional Encoder Representations from
Transformers (BERT) model. These pretrained models can be fine-tuned for specific NLP
tasks with relatively little additional training data, making them highly effective for a wide
range of NLP applications.
Overall, transformer models have revolutionized the field of NLP and have become a key
building block for many state-of-the-art generative AI models. Their ability to learn
contextual relationships between words in a text sequence has offered new possibilities
for language generation, text understanding, and other NLP tasks.
Workload Generative AI workloads can be broadly categorized into two types: training and
characteristics inferencing. Training uses a large dataset of examples to train a generative AI model,
while inference uses a trained model to generate new content based on an input. Data
preparation before training can also be a significant task in creating custom models. All
these workloads have characteristics that must be considered in the design of solutions
and their infrastructure.
Types of There are several specific types of generative AI workloads; each has different
workloads requirements. The system configurations described later in this white paper reflect these
requirements.
Inferencing
Inferencing is the process of using a generative AI model to generate new predictive
content based on input. A pretrained model is trained on a large dataset, and when new
data is fed into the model, it makes predictions based on what it has learned during
training. This training involves feeding an input sequence or image into the model and
receiving an output sequence or image as the result. Inferencing is typically faster and
less computationally intensive than training because it does not involve updating the
model parameters.
Model customization
Pretrained model customization is the process of retraining an existing generative AI
model for task-specific or domain-specific use cases. For large models, it is more efficient
to customize than to train the model on a new dataset. Customization techniques in use
today include fine-tuning, instruction tuning, prompt learning (including prompt tuning and
P-tuning), reinforcement learning with human feedback, transfer learning, and use of
adapters (or adaptable transformers).
The most useful types of customizations are fine-tuning, prompt learning, and transfer
learning.
Fine-tuning
Fine-tuning retrains a pretrained model on a specific task or dataset, adapting its
parameters to improve performance and make it more specialized. This traditional method
of customization either freezes all but one layer and adjusts weights and biases on a new
dataset or adds another layer to the neural network and re-recalculates weights and
biases on a new dataset.
Prompt learning
Prompt learning is a strategy that allows pretrained language models to be repurposed for
different tasks without adding new parameters or fine-tuning with labeled data. These
techniques can also be used on large generative AI image models.
Prompt leaning can be further categorized into two broader techniques: prompt tuning and
P-tuning.
In this solution design, the configurations related to customization are optimized for fine-
tuning and P-tuning. However, the scalability and overall architecture design
considerations still apply to other customization techniques and for datasets other than
text.
Training
Training is the process of using a dataset to train a generative AI model initially. Training
feeds the model examples from the dataset and adjusts the model parameters to improve
its performance on the task. Training can be a computationally intensive process,
particularly for large-scale models like GPT-3.
In an end-to-end workflow for generative AI, the exact sequence of these steps depends
on the specific application and requirements. For example, a common workflow for LLMs
might involve:
• Preprocessing and cleaning the training data
• Training a generative AI model on the data
• Evaluating the performance of the trained model
• Fine-tuning the model on a specific task or dataset
• Evaluating the performance of the fine-tuned model
• Deploying the model for inferencing in a production environment
Transfer learning can also be used at various points in this workflow to accelerate the
training process or improve the performance of the model. Overall, the key is to select the
appropriate techniques and tools for each step of the workflow and to optimize the
process for the specific requirements and constraints of the application.
Types of outputs The type of data used and the generative AI outcome varies depending on the type of
data being analyzed. While the focus of this project is on LLMs, other types of generative
AI models can produce other types of output.
• Text—LLMs can be used to generate new text based on a specific prompt or to
compile long sections of text into shorter summaries. For example, ChatGPT
can generate news articles or product descriptions from a few key details.
• Image—Generative AI models for images can be used to create realistic
images of people, objects, or environments that do not exist. For example,
StyleGAN2 can generate realistic portraits of nonexistent people.
• Audio—Generative AI models for audio can be used to generate new sounds or
music based on existing audio samples or to create realistic voice simulations.
For example, Tacotron 2 can generate speech that sounds like a specific
person, even if that person never spoke the words.
• Video—Generative AI models for video can be used to create videos based on
existing footage or to generate realistic animations of people or objects. For
example, DALL-E can generate images of objects that do not exist, and these
images can be combined to create animated videos.
In each case, the generative AI model must be trained on large datasets of the
appropriate datatype. The training process is tailored to the requirements of the datatype
and to the specific datatype because different input and output formats are required for
each type of data. Recent advancements are now capable of integrating differing
datatypes, for example, using a text entry to generate an image.
The following examples show the challenges that businesses might face when
implementing generative AI models, and potential solutions for addressing those
challenges. It is important to approach each challenge on a case-by-case basis and work
with experts in the field to develop the best possible solutions.
Ownership of There are valid concerns in the enterprise about ownership of output and intellectual
content property when using some generative AI models. These concerns include issues of
accuracy, truthfulness, and source attribution. Data used for training public models, while
extensive, might be based on incomplete or outdated knowledge or lead to the inability to
verify facts or access real-time information.
Data quality One of the biggest challenges with any machine learning model is ensuring that the
training data is of high quality. This need is especially important for generative AI models,
which might require large amounts of training data to generate accurate results. To
address this challenge, businesses must ensure that their data is clean, well-labeled, and
representative of the problem they are trying to solve.
Model Generative AI models can be complex and require significant computational resources to
complexity train and run. This requirement can be a challenge for businesses that do not have
access to powerful hardware or that are working with large datasets.
Ethical Generative AI models can have ethical implications, especially if they are used to create
considerations content or make decisions that affect people's lives. To address this challenge,
businesses must carefully consider the potential ethical implications of their generative AI
models and work to ensure that they cause no harm.
Sustainability Large-scale generative AI models require significant computational resources and power
to operate. Training and inference processes for such models can consume substantial
amounts of energy, contributing to increased carbon emissions, cooling demands, and
environmental impact.
Regulatory Depending on the industry and application, there might be regulatory requirements that
compliance businesses must meet when implementing generative AI models. For example, in
healthcare, there might be regulations for patient privacy and data security. To address
this challenge, businesses must work closely with legal and compliance teams to ensure
that their generative AI models meet all regulatory requirements.
Benefits
Dell and NVIDIA The advantages that Dell Technologies and NVIDIA provide are significant, as together
advantages we:
• Deliver full-stack generative AI solutions built on the best of Dell infrastructure
and software, with the latest NVIDIA accelerators, NVIDIA AI software, and AI
expertise
• Deliver validated designs that reduce the time and effort to design and specify
AI solutions, accelerating the time to value
• Provide sizing and scaling guidance so that your infrastructure is efficiently
tailored to your needs but can also grow as those needs expand
• Enable enterprises to build, customize, and run purpose-built generative AI on-
premises to solve specific business challenges, and use the same accelerated
computing platform to create leading models
• Assist enterprises with the entire generative AI life cycle, from infrastructure
provisioning, large model training, pretrained model fine-tuning, multisite model
deployment, and large model inferencing
• Enable custom generative AI models that focus on the wanted operating
domain, have the up-to-date knowledge of your business, have the necessary
skills, and can continuously improve in production
• Include state-of-the-art, pretrained foundation models to rapidly accelerate the
creation of custom generative AI models
• Ensure security and privacy of sensitive and proprietary company data, as well
as compliance with government regulations
• Powerful yet performance-optimized server and storage hardware designs
coupled with GPU acceleration, plus system management software that
includes advanced power management, thermal optimization, and overall
energy utilization monitoring
• Include the ability to develop safer and more trustworthy AI with known models
and datasets—a fundamental requirement of enterprises today
Use cases
Generative AI models have the potential to address a wide range of use cases and solve
numerous business challenges across different industries. Generative AI models can be
used for:
• Customer service—To improve chatbot intent identification, summarize
conversations, answer customer questions, and direct customers to appropriate
resources.
• Content creation—To create content such as product descriptions, social
media posts, news articles, and even books. This ability can help businesses
save time and money by automating the content creation process.
• Sales and marketing—To create personalized experiences for customers,
such as customized product recommendations or personalized marketing
messages.
• Product design—To design new products or improve existing products. For
example, a generative AI model can be trained on images of existing products
to generate new designs that meet specific criteria.
• Education—To create personal learning experiences, similar to tutors, and
generate learning plans and custom learning material.
• Fraud detection—To detect and prevent fraud in financial transactions or other
contexts. For example, a generative AI model can be trained to recognize
patterns of fraudulent behavior and flag suspicious transactions.
• Healthcare—To analyze medical images or patient data to aid in diagnosis or
treatment. For example, a generative AI model can be trained to analyze
medical images to identify cancerous cells or analyze protein structures for new
drug discovery.
• Gaming—To create more realistic and engaging gaming experiences. For
example, a generative AI model can be trained to create more realistic
animations or to generate new game levels.
• Software development—To write code from human language, convert code
from one programming language to another, correct erroneous code, or explain
code.
These examples show the many business challenges that generative AI models can help
solve. The key is to identify the specific challenges that are most pressing for a specific
business or industry, and then to determine how generative AI models can be used to
address those challenges.
High-level Dell Technologies and NVIDIA have been leading the way in delivering joint innovations
architecture for AI and high-performance computing for years. With this project, we have jointly
designed a full-stack workflow-centric solution that enables enterprises to create and run
generative AI models at any scale—from AI experimentation to AI production.
The architecture is modular, scalable, and balances performance with efficiency. The
modularity enables the architecture to support numerous different AI workflows, as
explained in the following sections.
Spirit of The cornerstone of this joint architecture is modularity, offering a flexible design that
modularity caters to a multitude of use cases, sectors, and computational requirements. A truly
modular AI infrastructure is designed to be adaptable and future-proof, with components
that can be mixed and matched based on specific project requirements. The Dell-NVIDIA
solution uses this approach, enabling businesses to focus on certain aspects of
generative AI workloads when building their infrastructure. This modular approach is
accomplished through specific use-case designs for training, model tuning, and inference
that make efficient use of each compute type. Each design starts with the minimum unit
for each use case, with options to expand.
A modular software stack is also critical to allow AI researchers, data scientists, data
engineers, and other users to design their infrastructure quickly and achieve rapid time to
value. The Dell-NVIDIA solution uses the best of NVIDIA AI software, with partner
solutions to build an AI platform that is adaptable and supported at each layer—from the
operating system to the scheduler to multiple AI Operations (AIOps) and Machine
Learning Operations (MLOps) solutions.
The following figure shows a high-level view of the solution architecture, with emphasis on
the software stack, from the infrastructure layer up through the AI application software:
At a high level, the solution architecture starts with the base hardware components from
Dell Technologies and NVIDIA, which are combined in permutations that focus on specific
AI workloads, such as training, fine-tuning, and inferencing. This white paper describes
the individual hardware components in a later section.
Each control plane or compute element supports either Red Hat Enterprise Linux or
Ubuntu as the operating system, which is preloaded with NVIDIA GPU drivers and
Compute Unified Device Architecture (CUDA) for bare metal use.
NVIDIA Base Command Manager (BCM) serves as the cluster manager by installing
software on the host systems in the cluster, deploying Kubernetes, and monitoring the
cluster state. Host provisioning is core to a well-functioning cluster, with the ability to load
the operating system, driver, firmware, and other critical software on each host system.
Kubernetes deployment includes GPU Operator and Network Operator installation, a
critical part of GPU and network fabric enablement. NVIDIA BCM supports both stateful
and stateless host management, tracking each system, its health, and collecting metrics
that administrators can view in real time or can be rolled up into reports.
At the top layer of the solution, is the NVIDIA AI Enterprise software that accelerates the
data science pipeline and streamlines development and deployment of production AI
including generative AI, computer vision, speech AI, and more. Whether developing a new
AI model initially or using one of the reference AI workflows as a template to get started,
NVIDIA AI Enterprise offers secure, stable, end-to-end software that is rapidly growing
and fully supported by NVIDIA.
With Kubernetes deployed in the solution, there are several different MLOps solutions that
can be installed, whether open-source solutions like Kubeflow and MLFlow, or featured
supported solutions such as cnvrg.io, Domino, H2O.ai, Run:ai, and more. Each of these
solutions can be deployed to work in a multicluster and hybrid cloud scenario.
Each of these workflows has distinct compute, storage, network, and software
requirements. The solution design is modular and each of the components can be
independently scaled depending on the customer’s workflow and application
requirements. Also, some modules are optional or swappable with equivalent existing
solutions in an organization’s AI infrastructure such as their preferred MLOps and Data
Prep module or their preferred data module. The following table shows the functional
modules in the solution architecture:
Module Description
Scalability In the solution architecture, the functional modules can be scaled according to the use
cases and the capacity requirements. For example, the minimum training module unit for
large model training consists of eight PowerEdge XE9680 servers with 64 NVIDIA H100
GPUs.
As a theoretical example, the training module with an InfiniBand module could train a
175B parameter model in 112 days. To illustrate the scalability, six copies of these
modules could train the same model in 19 days. As another example, if you are training a
40B parameter model, then two copies of the training module are sufficient to train the
model in 14 days.
There is a similar scalability concept for the InfiniBand module. For example, one module
with two QM9700 switches can support up to 24 PowerEdge XE9680 servers. If you
double the InfiniBand module, in a fat-tree architecture, you can scale up to 48
PowerEdge XE9680 servers. The Ethernet module and Inference modules work similarly.
The Data module is powered by scale-out storage architecture storage solutions, which
can linearly scale to meet performance and capacity requirements, as you increase the
number of servers and GPUs in your Training and Inference modules.
Scalability and modularity are intrinsic to the Dell and NVIDIA design for generative AI
across the board.
Security The Dell approach to security is intrinsic in nature—it is built-in, not bolted-on later, and it
is integrated into every step through the Dell Secure Development Lifecycle. We strive to
continuously evolve our PowerEdge security controls, features, and solutions to meet the
ever-growing threat landscape, and we continue to anchor security with a Silicon Root of
Trust.
Security features are built into the PowerEdge Cyber Resilient Platform, enabled by the
integrated Dell Remote Access Controller (iDRAC). There are many features added to the
system that span from access control to data encryption to supply chain assurance.
These features include Live BIOS scanning, UEFI Secure Boot Customization, RSA
Secure ID MFA, Secure Enterprise Key Management (SEKM), Secured Component
Verification (SCV), enhanced System Erase, Automatic Certificate Enrollment and
Renewal, Cipher-Select, and CNSA support. All features make extensive use of
intelligence and automation to help you stay ahead of threats, and to enable the scaling
demanded by ever-expanding usage models.
As enterprises move to production AI, maintaining a secure and stable AI platform can be
challenging. This challenge is especially true for enterprises that have built their own AI
platform using open source, unsupported AI libraries and frameworks. To address this
concern and minimize the burden of maintaining an AI platform, the NVIDIA AI Enterprise
software subscription includes continuous monitoring for security vulnerabilities, ongoing
remediation, and security patches as well priority notifications of critical vulnerabilities.
This monitoring frees enterprise developers to focus on building innovative AI applications
instead of maintaining their AI development platform. In addition, maintaining API stability
can be challenging due to the many open-source dependencies. With NVIDIA AI
Enterprise, enterprises can count on API stability by using a production branch that
NVIDIA AI experts maintain. Access to NVIDIA Support experts means that AI projects
stay on track.
Accelerators As mentioned earlier, accelerators such as GPUs are often used to expedite the training
process. These accelerators are designed specifically for parallel processing of large
amounts of data, making them well suited for the matrix multiplication and other
operations required by generative AI models. In addition to specialized hardware, there
are also software-based acceleration techniques such as mixed precision training, which
can expedite the training process by reducing the precision of some of the calculations.
Storage Generative AI models can be sizable, with many parameters and intermediate outputs.
This volume means that the models require significant amounts of storage to hold all the
data. It is common to use distributed storage systems such as Hadoop or Spark to store
the training data and intermediate outputs during training. For inferencing, it might be
possible to store the model on a local disk, but for larger models, it might be necessary to
use network-attached storage or cloud-based storage solutions. Scalable, high-capacity,
and low-latency storage components for both file object and file store are essential in AI
systems.
Summary Generative AI requires significant amounts of computational power and storage, and often
involves the use of specialized accelerators such as GPUs. Also, high-speed networking
solutions are important to minimize latency during distributed training. By carefully
considering these requirements, businesses can build and deploy generative AI models
that are fast, efficient, and accurate.
Dell PowerEdge Dell Technologies offers a range of acceleration-optimized servers and an extensive
servers acceleration portfolio with NVIDIA GPUs. Two Dell servers are featured in the solution for
generative AI.
The PowerEdge adaptive compute approach enables servers engineered to optimize the
latest technology advances for predictable profitable outcomes. The improvements in the
PowerEdge portfolio include:
• Focus on acceleration—Support for the most complete portfolio of GPUs,
delivering maximum performance for AI, machine learning, and deep learning
training and inferencing, high performance computing (HPC) modeling and
simulation, advanced analytics, and rich-collaboration application suites and
workloads
• Thoughtful thermal design—New thermal solutions and designs to address
dense heat-producing components, and in some cases, front-to-back, air-cooled
designs
• Dell multivector cooling—Streamlined, advanced thermal design for airflow
pathways within the server
Dell file storage Dell PowerScale supports the most demanding AI workloads with all-flash NVMe file
storage solutions that deliver massive performance and efficiency in a compact form
factor.
There are several models used in the generative AI solution architecture, all powered by
the PowerScale OneFS operating system and supporting inline data compression and
deduplication. The minimum number of PowerScale nodes per cluster is three nodes, and
the maximum cluster size is 252 nodes.
PowerScale F900
PowerScale F900 provides the maximum performance
of all-NVMe drives in a cost-effective configuration to
address the storage needs of demanding AI workloads.
Each node is 2U in height and hosts 24 NVMe SSDs.
PowerScale F900 supports TLC or QLC drives for maximum performance. It enables you
to scale raw storage from 46 TB to 736 TB per node and up to 186 PB of raw capacity per
cluster.
PowerScale F600
PowerScale F600 includes NVMe drives to provide
larger capacity with massive performance in a cost-
effective compact 1U form factor to power demanding
workloads. The PowerScale F600 supports TLC or QLC drives for maximum
performance. Each node allows you to scale raw storage capacity from 15.36 TB to 245
TB and up to 60 PB of raw capacity per cluster.
Dell object Dell Technologies offers a choice of object-based storage products, all of which are
storage scalable and cost-effective for high volumes of unstructured data for AI workloads.
Dell ECS
ECS enterprise object storage combines the
simplicity of S3 with extreme performance at scale
for modern workloads such as AI, machine
learning, and real-time analytics applications. ECS EXF900 offers all-flash, NVMe
performance with capacity that scales up to 5.898 PB per rack—as well as 21 times faster
performance* than the previous generation. Using ECS to fuel GPU servers with
throughput-optimized storage rapidly exposes training algorithms and applications to more
data than ever before.
*Based on Dell Technologies internal analysis comparing the max bandwidth of the ECS EXF900 (511 MB/s) to
the maximum bandwidth of the ECS EX300 (24 MB/s) for 10 KB writes, November 2020. Actual performance
will vary.
Dell ObjectScale
ObjectScale is software-defined object storage that
delivers performance at scale to support AI
workloads. It delivers datasets at high transfer rates
to the most demanding CPU and GPU servers, exposing AI training algorithms to more
data without introducing the complexity of HPC storage. This storage includes fast stable
support for objects as large as 30 TB. Clusters can be scaled out easily to enhance
performance and capacity linearly. With the ability to deploy on NVMe-based, all-flash
drives, storage performance is no longer a bottleneck. Additionally, object tagging
provides inference models with richer datasets from which to make smarter predictions.
The Dell PowerSwitch Z9432F-ON 100/400GbE fixed switch consists of Dell’s latest
disaggregated hardware and software data center networking solutions, providing state-
of-the-art, high-density 100/400 GbE ports and a broad range of functionality to meet the
growing demands of today’s data center environment. This innovative, next-generation
open networking high-density aggregation switch offers optimum flexibility and cost-
effectiveness for the Web 2.0, enterprise, mid-market, and cloud service providers with
demanding compute and storage traffic environments.
Through integration with the integrated Dell Remote Access Controller (iDRAC)
embedded in all PowerEdge servers, you can set policy-based controls to maximize
resource use and throttle-back power when performance demand ebbs. By using
predefined power policies, OpenManage Enterprise Power Manager can help mitigate
operational risks and ensure that your servers and their key workloads continue to
operate.
CloudIQ integrates data from all your OpenManage Enterprise Power Manager consoles
to monitor the health, capacity, performance, and cybersecurity of Dell components
across all your locations.
The CloudIQ portal displays your Dell infrastructure systems in one view to simplify
monitoring across your core and secondary data centers and edge locations as well as
data protection in public clouds. With CloudIQ, you can easily assure that critical business
workloads get the capacity and performance that they need, spend less time monitoring
and troubleshooting infrastructure, and spend more time innovating and focusing on
projects that add value to your business.
Dell Services Dell Technologies provides multiple services, linking people, processes, and technology
to accelerate innovation and enable optimal business outcomes for AI solutions and all
your data center needs.
Consulting Services
Consulting Services help you create a competitive advantage for your business. Our
expert consultants work with companies at all stages of data analytics to help you plan,
implement, and optimize solutions that enable you to unlock your data capital and support
advanced techniques, such as AI, machine learning, and deep learning.
Deployment Services
Deployment Services help you streamline complexity and bring new IT investments online
as quickly as possible. Use our over 30 years of experience for efficient and reliable
solution deployment to accelerate adoption and return on investment (ROI) while freeing
IT staff for more strategic work.
Support Services
Support Services driven by AI and deep learning will change the way you think about
support with smart, groundbreaking technology backed by experts to help you maximize
productivity, uptime, and convenience. Experience more than fast problem resolution –
our AI engine proactively detects and prevents issues before they impact performance.
Managed Services
Managed Services can help reduce the cost, complexity, and risk of managing IT so you
can focus your resources on digital innovation and transformation while our experts help
optimize your IT operations and investment.
Residency Services
Residency Services provide the expertise needed to drive effective IT transformation and
keep IT infrastructure running at its peak. Resident experts work tirelessly to address
challenges and requirements, with the ability to adjust as priorities shift.
NVIDIA The following NVIDIA GPUs are among the NVIDIA acceleration components used in this
accelerators generative AI solution architecture.
For small jobs, the NVIDIA H100 GPU can be partitioned to right-sized Multi-Instance
GPU (MIG) partitions. With Hopper Confidential Computing, this scalable compute power
can secure sensitive applications on shared data center infrastructure. The inclusion of
the NVIDIA AI Enterprise software suite reduces time to development and simplifies
deployment of AI workloads and makes NVIDIA H100 GPU the most powerful end-to-end
AI and HPC data center platform.
As part of the NVIDIA OVX server platform, the NVIDIA L40 GPU delivers the highest
level of graphics, ray tracing, and simulation performance for NVIDIA Omniverse. With 48
GB of GDDR6 memory, even the most intense graphics applications run with the highest
level of performance.
The NVIDIA L4 GPU is the most efficient NVIDIA accelerator for mainstream use. Servers
equipped with the NVIDIA L4 GPU power up to 120 times higher AI video performance
and 2.5 times more generative AI performance over CPU solutions, as well as over four
times more graphics performance than the previous GPU generation. The NVIDIA L4
GPU’s versatility and energy-efficient, single-slot, low-profile form factor make it ideal for
global deployments, including edge locations.
For even greater scalability, NVIDIA NVSwitch builds on the advanced communication
capability of NVIDIA NVLink to deliver higher bandwidth and reduced latency for compute-
intensive workloads. To enable high-speed, collective operations, each NVIDIA NVSwitch
has 64 NVIDIA NVLink ports equipped with engines for NVIDIA Scalable Hierarchical
Aggregation Reduction Protocol (SHARP) for in-network reductions and multicast
acceleration.
NVIDIA AI NVIDIA enterprise software solutions are designed to give IT admins, data scientists,
software architects, and designers access to the tools they need to easily manage and optimize
their accelerated systems.
NVIDIA AI Enterprise
NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform, accelerates the data
science pipeline and streamlines development and deployment of production AI including
generative AI, computer vision, speech AI and more. This secure, stable, cloud-native
platform of AI software includes over 100 frameworks, pretrained models, and tools that
accelerate data processing, simplify model training and optimization, and streamline
deployment.
With an extensible and customizable framework, it has seamless integrations with the
multiple HPC workload managers, including Slurm IBM Spectrum LSF, OpenPBS, Univa
Grid Engine, and others. It offers extensive support for container technologies including
Docker, Harbor, Kubernetes, and operators. It also has a robust health management
framework covering metrics, health checks, and actions.
System Configurations
Based on the modular, scalable architecture for generative AI described earlier and
powered by Dell and NVIDIA components, there are initially three system configurations in
this family of designs, each focused on a particular use case. The three optimized system
configurations are designed for inferencing, customization, and training use cases.
The following sections describe the system configurations for each area of focus at a high
level. Note that the control plane, data storage, and Ethernet networking for each case is
similar. Therefore, if you are building AI Infrastructure that addresses two or more cases,
these core resources can be shared.
Large model Many enterprises elect to start with a pretrained model and use it without modification or
inferencing conduct some prompt engineering or P-tuning to make better use of the model for a
specific function. Starting with production deployment in mind is critical in the case of
LLMs because there is a heavy demand for compute power. Depending on the size of the
model, many larger models require multiple 8x GPU systems to achieve second or
subsecond-level throughput. The minimum configuration for inferencing pretrained models
starts with a single PowerEdge R760XA server with up to four NVIDIA H100 GPUs or one
PowerEdge XE9680 server with eight NVIDIA H100 GPUs based on model size and
number of instances. The number of nodes can then scale out as needed for performance
or capacity, though two nodes are recommended for reliability purposes.
When the model is split between GPUs, the communication between GPUs plays
a crucial role in delivering optimum performance. Therefore, the NVIDIA Triton
Inference Server software with multi-GPU deployment using fast transformer
technology might be employed.
For large models above 40B parameters, we recommend the PowerEdge XE9680
server. For model sizes less than 40B parameters, the PowerEdge R760xa server
delivers excellent performance.
Large model Many enterprises forgo initial training and elect to use and customize a pretrained model
customization as the basis for their solution. Using fine-tuning and P-tuning, it is possible to apply
enterprise-specific data to retrain a portion of an existing model or build a better prompt
interface to it. This method requires significantly less compute power than training a
model initially, with the ability to start with a similar configuration to the inference-only
configuration. The key difference is the addition of InfiniBand networking between
compute systems.
Design considerations for large model customization with fine-tuning or P-training using
pretrained large models include the following:
Even though this task is relatively less compute-intensive than large model
training, there is a need for a tremendous amount of information exchange (for
example, weights) between GPUs of different nodes. Therefore, InfiniBand is
required for optimized performance and throughput with an eight-way GPU and
an all-to-all NVLInk connection. In some cases, when the model sizes are less
than 40 B parameters and based on the application latency requirements, the
InfiniBand module can be optional.
P-tuning uses a small trainable model before using the LLM. The small model is
used to encode the text prompt and generate task-specific virtual tokens. Prompt-
tuning and prefix-tuning, which only tune continuous prompts with a frozen
language model, substantially reduce per-task storage and memory usage at
training.
For models less than 40B parameters, you might be able to use a PowerEdge
XE8640 server. For larger models, we recommend the PowerEdgeXE9680
server.
Large model Large model training is the most compute-demanding workload of the three use cases,
training with the largest models requiring data centers of large numbers of GPUs to train a model
in a few months. The minimum configuration for training requires eight PowerEdge
XE9680 servers with eight NVIDIA H100 GPUs each. The largest model training requires
expansion to greater cluster sizes of 16-times, 32-times, or even larger configurations.
The training model has a considerable memory footprint that does not fit in a
single GPU; therefore, you must split the model across multiple GPUs (N-GPUs).
The combination of model size, parallelism techniques for performance, and the
size of the working dataset requires high communication throughput between
GPUs, thus benefitting from PowerEdge XE9680 servers with eight NVIDIA GPUs
fully connected to each other by NVIDIA NVLink and NVIDIA NVSwitch.
The QM9700 InfiniBand switch has 64 network detection and response (NDR)
ports. Therefore, 24 nodes of the PowerEdge XE9680 servers in this cluster fill
the ports on the QM9700 in the InfiniBand module. Add additional InfiniBand
modules in a fat-tree network topology.
Four Dell PowerScale F600 Prime storage platforms deliver 8 GBS write and 40
GBS read throughput performance with linear scaling.
Summary The information contained in this section is a high-level overview of the characteristics
and key design considerations of the suggested configurations for inferencing,
customization, and training of large language generative AI models. As mentioned earlier,
further details about each use case will follow this white paper in a series of design guides
for these Dell Validated designs for AI.
Conclusion
Generative AI This document has explored the concepts, benefits, use cases, and challenges of
advantage generative AI, and presented a scalable and modular solution architecture designed by
Dell Technologies and NVIDIA.
Project Helix is a unique collaboration between Dell Technologies and NVIDIA that make
the promise of generative AI real for the enterprise. Together, we deliver a full-stack
solution, built on Dell infrastructure and software, and using the award-winning software
stack and accelerator technology of NVIDIA. Bringing together the deep knowledge and
creativity of NVIDIA with the global customer knowledge and technology expertise of Dell
Technologies, Project Helix:
• Delivers full-stack generative AI solutions built on the best of Dell infrastructure
and software, in combination with the latest NVIDIA accelerators, AI software,
and AI expertise.
• Enables enterprises to use purpose-built generative AI on-premises to solve
specific business challenges.
• Assists enterprises with the entire generative AI life cycle, from infrastructure
provisioning, large model development and training, pretrained model fine-
tuning, multisite model deployment and large model inferencing.
• Ensures trust, security, and privacy of sensitive and proprietary company data,
as well as compliance with government regulations.
With Project Helix, Dell Technologies and NVIDIA enable organizations to automate
complex processes, improve customer interactions and unlock new possibilities with
better machine intelligence. Together, we are leading the way in driving the next wave of
innovation in the enterprise AI landscape.
We value your Dell Technologies and the authors of this document welcome your feedback on this
feedback document. Contact the Dell Technologies Solutions team by email.
For more information about this solution, you can engage with an expert by emailing
[email protected].
References
These materials may provide additional information about the solutions and components
presented here, as well as related offers.
Dell The following Dell Technologies documentation and resources provide additional and
Technologies relevant information to that contained within this white paper.
documentation
• Dell Technologies AI Solutions
• Dell Technologies Info Hub for Artificial Intelligence Solutions
• Dell PowerEdge XE Servers
• Dell PowerEdge Accelerated Servers and Accelerators (GPUs)
• Dell PowerScale Storage
• Dell ECS Enterprise Object Storage
• Dell ObjectScale Storage
• Dell PowerSwitch Z-series Switches
• Dell OpenManage Systems Management
NVIDIA The following NVIDIA documentation and resources also provide additional and relevant
documentation information: