Skip to content

Learning Journeys

FSDL-2022-Lecture-2

Development Infrastructure and Tooling¶

Frameworks for ML/AI¶

Pytorch
- The dominance of Pytorch (as of 2022)
  - https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2022/
  - https://blog.mlcontests.com/p/winning-at-competitive-ml-in-2022?s=w
- Pytorch Lightning
Tensorflow
Jax
- Deep Learning Libraries for JAX
- Haiku
- Jax

Meta Frameworks and Model zoos¶

Distributed Training¶

There are different scenarios depending on * whether data fits on a single GPU or not * the model parameters fit on a single GPU or not

When model fits on single GPU¶

Data parallelism¶

https://www.reddit.com/r/MachineLearning/comments/hmgr9g/d_pytorch_distributeddataparallel_and_horovod/

When the model does not fit on a single GPU¶

Sharded Data Parallelism¶

What takes up GPU memory?

The model parameters that make up model layers
The gradients needed to do back-propagation.
The optimizer states include statistics about the gradients
and, a batch of data for model development.

Sharded data parallelism shards model parametersm gradients and optimzer states, meaning that large batch sizes can be used. Examples are

Shared data parallelism is mplemented by: * Microsoft Deepspeed * Facebook Fairscale * Fairscale CPU offloading - Zero principle on a single GPU * Pytorch Fully-sharded data parallel

Pipelined Model Parallelism¶

This approach puts each layer on each GPU. However, this approach means that only one GPU is active at any given time.

Tensor-parallelism¶

This approach distributes matrix multiplications across multiple GPUs.

NVidia Megatron-LM repo for the tranformer model

Combine all 3 options¶

Bloom

Resources to speed up model training¶

Compute¶

GPU benchmarks¶

Startups that provide GPU access¶

Comparing the cost of GPU access¶

Insight: "the most expensive per hour chips are not the most expensive per experiment!"

Resource Management¶

End to end solutions for cluster/compute management¶

Experiment and model management¶

All in One solutions¶