Skip to content

Development Infrastructure and Tooling

Frameworks for ML/AI

Meta Frameworks and Model zoos

Distributed Training

There are different scenarios depending on * whether data fits on a single GPU or not * the model parameters fit on a single GPU or not

When model fits on single GPU

Data parallelism

https://www.reddit.com/r/MachineLearning/comments/hmgr9g/d_pytorch_distributeddataparallel_and_horovod/

When the model does not fit on a single GPU

Sharded Data Parallelism

What takes up GPU memory?

  • The model parameters that make up model layers
  • The gradients needed to do back-propagation.
  • The optimizer states include statistics about the gradients
  • and, a batch of data for model development.

Sharded data parallelism shards model parametersm gradients and optimzer states, meaning that large batch sizes can be used. Examples are

Shared data parallelism is mplemented by: * Microsoft Deepspeed * Facebook Fairscale * Fairscale CPU offloading - Zero principle on a single GPU * Pytorch Fully-sharded data parallel

Pipelined Model Parallelism

This approach puts each layer on each GPU. However, this approach means that only one GPU is active at any given time.

Tensor-parallelism

This approach distributes matrix multiplications across multiple GPUs.

Combine all 3 options

Resources to speed up model training

Compute

GPU benchmarks

Startups that provide GPU access

Comparing the cost of GPU access

FSDL tool

Insight: "the most expensive per hour chips are not the most expensive per experiment!"

Resource Management

End to end solutions for cluster/compute management

Experiment and model management

All in One solutions