Coder Social home page Coder Social logo

ml-containers's Introduction

ml-containers

Repository for building ML images at CoreWeave

Index

See the list of all published images.

Special PyTorch Images:

PyTorch Base Images

CoreWeave provides custom builds of PyTorch, torchvision and torchaudio tuned for our platform in a single container image, ml-containers/torch.

Versions compiled against CUDA 11.8.0, 12.0.1, 12.1.1, and 12.2.2 are available in this repository, with two variants:

  1. base: Tagged as ml-containers/torch:a1b2c3d-base-....
    1. Built from nvidia/cuda:...-base-ubuntu22.04 as a base.
    2. Only includes essentials (CUDA, torch, torchvision, torchaudio), so it has a small image size, making it fast to launch.
  2. nccl: Tagged as ml-containers/torch:a1b2c3d-nccl-....
    1. Built from ghcr.io/coreweave/nccl-tests as a base.
    2. Ultimately inherits from nvidia/cuda:...-cudnn8-devel-ubuntu22.04.
    3. Larger, but includes development libraries and build tools such as nvcc necessary for compiling other PyTorch extensions.
    4. These PyTorch builds are built on component libraries optimized for the CoreWeave cloud—see coreweave/nccl-tests.

Note

Most torch images have both a variant built on Ubuntu 22.04 and a variant built on Ubuntu 20.04.

  • CUDA 11.8.0 is an exception, and is only available on Ubuntu 20.04.
  • Ubuntu 22.04 images use Python 3.10.
  • Ubuntu 20.04 images use Python 3.8.
  • The base distribution is indicated in the container image tag.

PyTorch Extras

ml-containers/torch-extras extends the ml-containers/torch images with a set of common PyTorch extensions:

  1. DeepSpeed
  2. FlashAttention
  3. NVIDIA Apex

Each one is compiled specially against the custom PyTorch builds in ml-containers/torch.

Both base and nccl editions are available for ml-containers/torch-extras matching those for ml-containers/torch. The base edition retains a small size, as a multi-stage build is used to avoid including CUDA development libraries in it, despite those libraries being required to build the extensions themselves.

PyTorch Nightly

ml-containers/nightly-torch is an experimental, nightly release channel of the PyTorch Base Images in the style of PyTorch's own nightly preview builds, featuring the latest development versions of torch, torchvision, and torchaudio pulled daily from GitHub and compiled from source.

ml-containers/nightly-torch-extras is a version of PyTorch Extras built on top of the ml-containers/nightly-torch container images. These are not nightly versions of the extensions themselves, but rather match the extension versions in the regular PyTorch Extras containers.

⚠ The PyTorch Nightly containers are based on unstable, experimental preview builds of PyTorch, and should be expected to contain bugs and other issues. For more stable containers use the PyTorch Base Images and PyTorch Extras containers.

Organization

This repository contains multiple container image Dockerfiles, each is expected to be within its own folder along with any other needed files for the build.

CI Builds (Actions)

The current CI builds are set up to run when changes to files in the respective folders are detected so that only the changed container images are built. The actions are set up with an action per image utilizing a reusable base action build.yml. The reusable action accepts several inputs:

  • folder - the folder containing the dockerfile for the image
  • image-name - the name to use for the image
  • build-args - arguments to pass to the docker build

Images built using the same source can utilize one action as the main reason for the multiple actions is to handle only building the changed images. A build matrix can be helpful for these cases https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs.

ml-containers's People

Contributors

eta0 avatar wbrown avatar harubaru avatar sangstar avatar salanki avatar rtalaricw avatar navarrepratt avatar arsenetar avatar dmarx avatar todie avatar

Stargazers

zzh avatar MagicSource avatar Simo Lin avatar Justin Reppert avatar Igor Zan avatar  avatar Adrian Torrie avatar Kevin Yin avatar  avatar Lize Cai avatar  avatar Jorge Iranzo avatar Samuel Butler avatar  avatar  avatar Ayush Somani avatar

Watchers

 avatar Julian Mann avatar Francois Lebel avatar Ben Chess avatar  avatar Brandon Bianchi avatar Brian Lechthaler avatar Carlos Robles avatar Shadow0pz avatar  avatar Marcin Gucki avatar Matthew Owen avatar  avatar Josh S. avatar Kristofer Christakos avatar  avatar Yitzy Dier avatar Cesar Mesones avatar zachary avatar

ml-containers's Issues

feature request: TransformerEngine in torch-extras

I'd like to try out TransformerEngine. The reason is that I profiled our Tensor Parallel layers and the communication happens sequentially with the computation, which makes it very slow as the number of TP shards increases. Megatron supports overlapping communication with computation, but it calls out to TransformerEngine to make it happen. I wasn't able to find any other way to overlap unless I used nvidia primitives.

I wasn't able to install TE in my own image, because the Github Actions run times out.
It worked using this command to install the dev build: RUN export MAX_JOBS=4 && export NVTE_WITH_USERBUFFERS=1 && export NVTE_FRAMEWORK=pytorch && pip3 install -v git+https://github.com/NVIDIA/TransformerEngine.git@main
4 jobs was a total guess. Not sure that NVTE_FRAMEWORK=pytorch does anything.

There's a catch here. TE 1.4, the latest stable which was released 5 days ago, forces FlashAttention<=2.4.2. The version check was recently bumped to 2.5.6: NVIDIA/TransformerEngine@965803c but that probably won't appear in TE stable for a while.

So if you add TE stable, then FA will have a version restriction. On the other hand, the nightly torch-extras image had a broken PyTorch 2/2 times I tried it. So it's up to you which image might want TE added, and which version of TE.

The image I normally use is torch-extras, NCCL, CUDA 12.2.2 (latest), Ubuntu 22.04.

Scaling up self-hosted GitHub Actions runners

ml-containers uses a self-hosted GitHub Actions runner to build container images through CI. It is currently only capable of handling one job at a time, sequentially. As a consequence, complex builds with many variations such as ml-containers/torch are taking up to 7 hours per commit to finish their CI.

Very heavy commits slow down development, as it makes iteratively fixing bugs in a CI deployment impractical.

Either dedicating more resources to keep runners available or implementing some form of autoscaling like with actions-runner-controller may improve on the situation.

Warning from `requests` package when using any PyTorch application

Testcase:

  1. clone https://github.com/pytorch-labs/gpt-fast
  2. Run this in the gpt-fast folder
export MODEL_REPO=openlm-research/open_llama_7b
./scripts/prepare.sh $MODEL_REPO
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

Result:

/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

I solved this by running pip3 install --upgrade requests, which is the solution google gives.

Zero urgency; our image already fixes this, so it's not affecting anything.

[feature-request] Support for JAX container

Concise Description:
I'd like to use JAX for distributed training of LLMs. In addition, the new release of Keras supports JAX as a backend in addition to TF.

Describe the solution you'd like
I'd like either a separate JAX container or jaxlib included in a TF container since the TF ecosystem (data loading, serving, etc) supports JAX.

Describe alternatives you've considered
I could install JAX on top of the PyT container.

Triton may have been broken for a while in the nightly builds

I receive this error:

[rank7]:     from triton.compiler.compiler import triton_key
[rank7]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank7]: ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py)

Same problem as this one: pytorch/pytorch#123042

It's been present since before Mar 14. Given the solution in that thread, I suspect PyTorch's interaction with Triton may have changed. The non-nightly torch-extras build is fine.

Unfortunately, I have no testcase; running a simple torch.compile works fine. Maybe it'll solve itself when Pytorch 2.3 comes out; I'll bump this issue if it's still present when 2.3 lands in torch-extras.

bad :buildcache tag for extras images

Haven't checked any of the other images, but here if you actually try to pull the "buildcache" tag it errors. That tag should really either be either available, or not listed in the repo. Normally I fix my versions but I'm just trying to pull something with a good pytorch install to run a terminal session; I understand the reasons to not use :latest or anything equivalent but in that case it really shouldn't be listed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.