undefined

by Jo J.

Name 3 Approaches for optimizing run time / resource usage etc.

Data Parallelsim
Pipeline Parallelism
Quantization (possible in combination with Pruding and Distillation)

What is distributed training, and why is it used?

Distributed training enables training large models across multiple GPUs or TPUs to handle high computational demands.

Why is it needed?

Reduces training time.
Handles large datasets and models that cannot fit in a single GPU.
Improves efficiency by utilizing parallelism.

How does Data Parallelism work in distributed training?

Data Parallelism splits the dataset across multiple GPUs while keeping a copy of the model on each GPU.

Process:

Shard Data → Divide dataset into chunks for different GPUs.
Aggregate Gradients → Each GPU computes gradients and sends them to a central process.
Update Weights → The server updates weights and shares them back to all GPUs.

How does Pipeline Parallelism differ from Data Parallelism?

Pipeline Parallelism splits the model across multiple GPUs instead of splitting data.

How it works:

Each GPU holds part of the model.
Forward pass flows through multiple GPUs sequentially.
Backward pass updates all segments accordingly.

What is Quantization, and why is it used in model training and inference?

Quantization reduces computation by storing and processing 4/8-bit integers instead of 16/32-bit floating points.

Advantages:

Reduces memory usage.
Accelerates inference (less computational cost).
Can be combined with pruning (GPTQ) and distillation (ZeroQuant).

What is Linear Quantization, and what is the formular?

Linear Quantization maps a large set of continuous values to a smaller set of discrete values.

Formular:

What is Pruning?

Removing excessive model weights to lower parameter count. A lot of the work are done solely for research purposes. Cultivate different routes of estimating importance of parameters

What is Distillation?

Train a small model (student) on the outputs of a larger model (teacher).

-> Distillation = model ensembling.

Formular for estimating training time

Join Course

Preview

Author

Jo J.

Information

Last changed
6 months ago

Report course

15. LLM - Scaling, Computation Costs

Author

Jo J.

Information