Development Infrastructure and Tooling¶
Frameworks for ML/AI¶
- Pytorch
- The dominance of Pytorch (as of 2022)
- Pytorch Lightning
- Tensorflow
- Jax
Meta Frameworks and Model zoos¶
Distributed Training¶
There are different scenarios depending on * whether data fits on a single GPU or not * the model parameters fit on a single GPU or not
When model fits on single GPU¶
Data parallelism¶
When the model does not fit on a single GPU¶
Sharded Data Parallelism¶
What takes up GPU memory?
- The model parameters that make up model layers
- The gradients needed to do back-propagation.
- The optimizer states include statistics about the gradients
- and, a batch of data for model development.
Sharded data parallelism shards model parametersm gradients and optimzer states, meaning that large batch sizes can be used. Examples are
- Microsoft Zero
- https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
Shared data parallelism is mplemented by: * Microsoft Deepspeed * Facebook Fairscale * Fairscale CPU offloading - Zero principle on a single GPU * Pytorch Fully-sharded data parallel
Pipelined Model Parallelism¶
This approach puts each layer on each GPU. However, this approach means that only one GPU is active at any given time.
Tensor-parallelism¶
This approach distributes matrix multiplications across multiple GPUs.
Combine all 3 options¶
Resources to speed up model training¶
Compute¶
GPU benchmarks¶
Startups that provide GPU access¶
Comparing the cost of GPU access¶
Insight: "the most expensive per hour chips are not the most expensive per experiment!"
Resource Management¶
End to end solutions for cluster/compute management¶
- AWS Sagemaker
- Anyscale
- Ray
- Ray Train
- Grid.ai
- Uncertain future?
- Determined.ai