AI and Data Science:

AI's Parallelization Methods in Supercomputers

The course provides an introduction to the key techniques for running and scaling deep learning models on supercomputers, focusing on the efficient training of large models. It includes both theoretical learning and practical exercises, with instructors providing guidance throughout the course.

This course introduces the essentials for running deep learning models on supercomputers and effectively scaling them. It provides foundational skills to enable efficient training of large models. There is a comprised two day and a five day version of this course. In the latter version, you will learn how to use various parallelization techniques.

All days cover alternating sequences of theoretical input and hands-on exercises, during which the instructors are available for quick feedback and advice.

Learning goals

By the end of the course, you will be able to …

Short version (2 days)

Day 1: Supercomputer Access Basics

  • Understand what a supercomputer is.
  • Configure SSH keys.
  • Setup VSCODE.
  • Use software packages of the supercomputer.
  • Run your first job on the supercomputer.
  • Bonus: Blablador.

Extended version (5 days)

Day 1: Supercomputer Access Basics

  • Understand what a supercomputer is.
  • Configure SSH keys.
  • Setup VSCODE.
  • Use software packages of the supercomputer.
  • Run your first job on the supercomputer.
  • Bonus: Blablador.

Day 2: Distributed Data Parallel (DDP)

  • The good practices before starting training.
  • Where to store your data and how to load it
  • Run your first PyTorch code on the supercomputer.
  • Understand what is a distributed training.
  • Understand DDP.
  • Transform your code to a distributed one with DDP.
  • Use Tensorboard on the supercomputer.
  • Check GPU usage with llview.

Day 3: Tensor Parallelism (TP)

  • Know what TP is.
  • Parallelize your code with TP.

Day 4: Pipeline Parallelism (PP)

  • Know what PP is.
  • Parallelize your code with PP.

Day 5: Fully Sharded Data Parallel (FSDP)

  • Understand FSDP
  • Distribute your code with FSDP.

Course date

Register now:

March 06–07, 2025

July 07–08, 2025

For more information on how to register, please follow the link on the course date.

Prerequisites

To participate in this course, you need to know

  • How to code with Python (as taught in the courses “First steps in Python” and “Data processing with Pandas & Data visualization in Matplotlib”) and PyTorch
  • Machine Learning (see the course “Machine Learning 1” or “Introduction to Machine Learning”) and Deep Learning (as taught in the course “Deep Learning”)

Target group

This course is addressing anyone who wants to learn how to scale their models (students, researchers, employees …).

This course is free of charge.

Alternativ-Text

Subscribe newsletter