Coder Social home page Coder Social logo

taocao / ml-engineering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from stas00/ml-engineering

0.0 0.0 0.0 2.53 MB

Machine Learning Engineering Online Book

Home Page: https://stasosphere.com/machine-learning/

License: Creative Commons Attribution Share Alike 4.0 International

Shell 8.31% Python 91.31% Makefile 0.38%

ml-engineering's Introduction

Machine Learning Engineering Online Book

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, I'm working on developing/training open-source Retrieval Augmented models at Contextual.AI.

I've been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual I'm happy to share these with the wider ML community.

Table of Contents

My apologies if the layout is a bit unstable while I'm writing new chapters and gradually re-organizing the content to be more intuitive.

Part 1. Insights

  1. The AI Battlefield Engineering - What You Need To Know

Part 2. Key Hardware Components

  1. Accelerator - the work horses of ML - GPU, TPU, IPU, XPU, FPGA, QPU, etc. (WIP)

  2. Network - intra-node and inter-node connectivity, calculating bandwidth requirements

  3. IO - local and distributed disks and filesystems

  4. CPU - cpus, affinities (WIP)

  5. CPU Memory - how much CPU memory is enough - the shortest chapter ever.

Part 3. Performance

  1. Fault Tolerance

  2. Performance

  3. Multi-Node networking

  4. Model parallelism

Part 4. Operating

  1. SLURM

  2. Training hyper-parameters and model initializations

  3. Instabilities

Part 5. Development

  1. Debugging software and hardware failures

  2. And more debugging

  3. Reproducibility

  4. Tensor precision / Data types

  5. HF Transformers notes - making small models, tokenizers, datasets, and other tips

Part 6. Miscellaneous

  1. Resources - LLM/VLM chronicles

Shortcuts

Things that you are likely to need to find quickly and often.

Tools:

Guides:

Gratitude

None of this would have been possible without me being entrusted with doing the specific LLM/VLM trainings I have learned this know-how from. This is a privilege that only a few enjoy due to the prohibitively expensive cost of renting huge ML compute clusters. So hopefully the rest of the ML community will vicariously learn from these notes.

Special thanks go to Thom Wolf who proposed that I lead the BLOOM-176B training back when I didn't know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

Contributing

If you found a bug, typo or would like to propose an improvement please don't hesitate to open an Issue or contribute a PR.

License

The content of this site is distributed under Attribution-ShareAlike 4.0 International.

My repositories map

Machine Learning: ML Engineering | ML ways | Porting

Guides: The Art of Debugging

Applications: ipyexperiments

Tools and Cheatsheets: bash | conda | git | jupyter-notebook | make | python | tensorboard | unix

ml-engineering's People

Contributors

stas00 avatar pitmonticone avatar anindya-saha avatar thecharlieblake avatar biogeek avatar cx0 avatar evelynmitchell avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.