Strategies for Parallelizing LLMs Masterclass

Strategies for Parallelizing LLMs Masterclass

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 99 lectures (8h 41m) | 5.13 GB

Mastering LLM Parallelism: Scale Large Language Models with DeepSpeed & Multi-GPU Systems

Mastering LLM Parallelism: Scale Large Language Models with DeepSpeed & Multi-GPU Systems

Are you ready to unlock the full potential of large language models (LLMs) and train them at scale?

In this comprehensive course, you’ll dive deep into the world of parallelism strategies, learning how to efficiently train massive LLMs using cutting-edge techniques like data, model, pipeline, and tensor parallelism.

Whether you’re a machine learning engineer, data scientist, or AI enthusiast, this course will equip you with the skills to harness multi-GPU systems and optimize LLM training with DeepSpeed.

What You’ll Learn

  • Foundational Knowledge: Start with the essentials of IT concepts, GPU architecture, deep learning, and LLMs (Sections 3-7). Understand the fundamentals of parallel computing and why parallelism is critical for training large-scale models (Section 8).
  • Types of Parallelism: Explore the core parallelism strategies for LLMs—data, model, pipeline, and tensor parallelism (Sections 9-11). Learn the theory and practical applications of each method to scale your models effectively.
  • Hands-On Implementation: Get hands-on with DeepSpeed, a leading framework for distributed training. Implement data parallelism on the WikiText dataset and master pipeline parallelism strategies (Sections 12-13). Deploy your models on RunPod, a multi-GPU cloud platform, and see parallelism in action (Section 14).
  • Fault Tolerance & Scalability: Discover strategies to ensure fault tolerance and scalability in distributed LLM training, including advanced checkpointing techniques (Section 15).
  • Advanced Topics & Trends: Stay ahead of the curve with emerging trends and advanced topics in LLM parallelism, preparing you for the future of AI (Section 16).
Table of Contents

Introduction
1 Introduction & What Is This Course About
2 Course Structure
3 DEMO – What You’ll Build in This Course

Course Source Code and Resources
4 Get Course Slides
5 Get Source Code

Strategies for Parallelizing LLMS – Deep Dive
6 What is Parallelism and Why it Matters
7 Understanding the Single GPU Strategy
8 Understanding the Parallel Strategy and Advantages
9 Parallelism vs Single GPU – Summary

IT Fundamental Concepts
10 IT Fundamentals – Introduction
11 What is a Computer – CPU and RAM Overview
12 Data Storage and File Systems
13 OS File System Structure
14 LAN Introduction
15 What is the Internet
16 Internet Communication Deep Dive
17 Understanding Servers and Clients
18 GPUs – Overview

GPU Architecture for LLM Training Deep Dive
19 GPU Architecture for LLM Training
20 Why this Architecture Excels

Deep and Machine Learning – Deep Dive
21 Machine and Deep Learning Introduction
22 Deep and Machine Learning – Overview and Breakdown
23 Deep Learning Key Aspects
24 Deep Neural Networks – Deep Dive
25 The Single Neuron Computation – Deep Dive
26 Weights
27 Activation Functions – Deep Dive
28 Deep Learning – Summary
29 Machine Learning Introduction – ML vs DL
30 Learning Types and Full ML & DL Analogy Example
31 DL and ML Comparative Capabilities – Summary

Large Language Models – Fundamentals of AI and LLMs
32 Introduction
33 The Transformer Architecture Fundamentals
34 The Self-Attention Mechanism – Analogy
35 The Transformer Architecture Animation
36 The Transformer Library – Deep dive

Parallel Computing Fundamentals & Parallelism in LLM Training
37 Parallel Computing Introduction – Key Concepts
38 Parallel Computing Fundamentals and Scaling Laws – Deep Dive

Types of Parallelism in LLM Training – Data – Model and Hybrid Parallelism
39 Types of Parallelism in LLM Training
40 Data Parallelism – How It Works
41 Data Parallelism Advantages for LLM Training
42 Real-world Example – Data Parallelism in GPT-3 Training
43 Model Parallelism and Tensor Parallelism and Layer Parallelism – Deep Dive
44 LLM Relevance and Implementaion
45 Model vs Data Parallelism
46 Key Differences Highlighted – Data vs Model Parallelism
47 Data vs Model Parallelism
48 Hybrid Parallelism – Animation
49 Hybrid Parallelism – What is It and Motivation

Types of Parallelism – Pipeline and Tensor Parallelism
50 Pipeline Parallelism Overview
51 Pipeline Parallelism Key Concepts and How it Works – Step by Step
52 Pipeline Bubbles Key Concepts
53 Pipeline Schedules Key Concepts
54 Activation Recomputation – Overview and Introduction
55 Neural Network and Activation and Backward and Forward Passes – Full Dive
56 Understanding Activation Recomputation vs Standard Training – Deep Dive
57 Demo – Activation Recomputation Visualization
58 Activation Recomputation vs Standard Approach
59 Benefits of Activation Recomputation and Implementation Strategies
60 Pipeline Parallelism Implementation Frameworks and Key Takeaways

Tensor Parallelism – Deep Dive
61 What is Tensor Parallelism and Why – Benefits
62 Tensor Parallel Pizza Making Analogy
63 Tensors and Partitioning Strategies – Deep Dive
64 Tensor Communication Patterns – Deep Dive
65 Device Mesh Communication Pattern – Deep Dive
66 How Components Work Together in Distributed LLM Training
67 Understanding Tensor Parallelism with LEGO Bricks Animation Demo
68 Putting it All Together – All Strategies in LLM Training

HANDS-ON Strategies for Parallelism – Data Parallelism Deep Dive
69 Strategies for Parallelizing LLMs – Hands- on Introduction
70 Pytorch – LLM Training Library Overview
71 The Transformers Library – Overview
72 Numpy Overview
73 TorchVision and TorchDistributed Overview
74 DeepSpeed and Megatron-LM – Overview
75 Datasets and Why this Toolkit
76 HANDS-On Data Parallelism – Training a Small Model – MNIST Dataset
77 Testing Pseudo Data Parallelism Trained Model
78 HANDS-ON Data Parallelism – Colab – Full Demo
79 Data Parallelism – Simulated Parallelism on GPU Takeaways

HANDS-ON Data Parallelism w WikiText Dataset & DeepSpeed Mem. Optimizatization
80 Hands-on Data Parallelism – Wikitext-2 Dataset
81 DeepSpeed – Full Dive
82 Hands-on Data Parallelism with DeepSpeed Optimization

Running TRUE Parallelism on Multiple GPU Systems – Runpod.io
83 Setup Runpod.io Environment Overview
84 Runpod SSH Setup
85 Setting up Runpod Parallelism in JupyterNotebook
86 HANDS-ON – Parallelism with IMDB Dataset – Deep Dive – True Parallelism
87 Runpod Cleanup

Fault Tolerance and Scalability & Advanced Checkpointing Strategies – Deep Dive
88 Fault Tolerance Introduction & Types of Failures in Distributed LLM Training
89 Strategies for Fault Tolerance
90 Checkpointing in LLM Training – Animation
91 Basic Checkpointing in LLM Taining
92 Incremental Checkpointing in LLM Training
93 Asynchronous Checkpointing in LLM Training
94 Multi-level Checkpointing in LLM Training – Animation
95 Checkpoint Storage Considerations – Deep Dive
96 Implementing a Hybrid Approach – Performance, Failure, Optimizations – Full Dive
97 Checkpoint Storage Strategy – Summary

Advanced Topics and Emerging Trends
98 Advanced Topics and Emerging Trends

Wrap up and Next Steps
99 Course Summary and Next Steps

Homepage