Topic: distributed-training Goto Github

Some thing interesting about distributed-training

👇 Here are 148 public repositories matching this topic...

alibaba / easyparallellibrary

distributed-training,Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

Organization: alibaba

deep-learning data-parallelism model-parallelism pipeline-parallelism memory-efficient distributed-training gpu

alibaba / tepdist

distributed-training,TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

Organization: alibaba

auto-parallelization compiler deep-learning distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino disthlo

alpa-projects / alpa

distributed-training,Training and serving large-scale neural networks with auto parallelization.

Organization: alpa-projects

Home Page: https://alpa.ai

deep-learning machine-learning compiler distributed-training high-performance-computing alpa jax distributed-computing llm auto-parallelization

aws / sagemaker-xgboost-container

distributed-training,This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

Organization: aws

aws inference machine-learning python sagemaker training xgboost distributed-training gbm

awslabs / deeplearning-cfn

distributed-training,Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

Organization: awslabs

deeplearning tensorflow mxnet ec2-instance aws aws-cloudformation distributed-training deep-learning-ami

awslabs / dynamic-training-with-apache-mxnet-on-aws

distributed-training,Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.

Organization: awslabs

deep-learning mxnet neural-network distributed-training aws machine-learning

bindog / pytorch-model-parallel

distributed-training,A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

User: bindog

model-parallel distributed-training half-precision re-id pytorch

bryanyzhu / video-tutorial-cvpr2020

distributed-training,A Comprehensive Tutorial on Video Modeling

User: bryanyzhu

Home Page: https://cvpr20-video.mxnet.io

video-classification human-action-recognition mxnet gluoncv distributed-training

bytedance / byteps

distributed-training,A high performance and generic framework for distributed DNN training

Organization: bytedance

machine-learning deep-learning distributed-training tensorflow mxnet keras pytorch

chairc / integrated-design-diffusion-model

distributed-training,IDDM (Industrial, landscape, animate...), support DDPM, DDIM, PLMS, webui and multi-GPU distributed training. Pytorch实现，生成模型，扩散模型，分布式训练

User: chairc

ddpm diffusion-models industrial ddim distributed-computing distributed-training pytorch unet aigc generative-model webui plms

datacanvasio / hypergbm

distributed-training,A full pipeline AutoML tool for tabular data

Organization: datacanvasio

Home Page: https://hypergbm.readthedocs.io/

automl gbm fullpipeline xgboost lightgbm catboost pseudo-labeling adversarial-validation semi-supervised-learning datacleaning

deepglint / unicom

distributed-training,[ICLR 2023] Unicom: Universal and Compact Representation Learning for Image Retrieval

Organization: deepglint

Home Page: https://arxiv.org/pdf/2304.05884.pdf

large-sacle-pretrained-model universal-image-retrieval iclr2023 laion400m vision-transformer distributed-training retrieval-anything in-shop

deeprec-ai / deeprec

distributed-training,DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

Organization: deeprec-ai

deep-learning machine-learning python scalability distributed-training recommendation-engine search-engine advertising

dena / handyrl

distributed-training,HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

Organization: dena

reinforcement-learning pytorch games policy-gradient deep-learning machine-learning distributed-training

determined-ai / determined

distributed-training,Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

Organization: determined-ai

Home Page: https://determined.ai

deep-learning machine-learning ml-platform ml-infrastructure hyperparameter-optimization hyperparameter-search distributed-training pytorch tensorflow hyperparameter-tuning

dougsouza / pytorch-sync-batchnorm-example

distributed-training,How to use Cross Replica / Synchronized Batchnorm in Pytorch

User: dougsouza

pytorch batchnorm distributed-training dataparallel

fedml-ai / fedml

distributed-training,FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

Organization: fedml-ai

Home Page: https://TensorOpera.ai

federated-learning deep-learning distributed-training edge-ai machine-learning on-device-training inference-engine mlops model-deployment model-serving

foundation-model-stack / fms-fsdp

distributed-training,🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

Organization: foundation-model-stack

Home Page: https://pytorch.org/docs/stable/fsdp.html

distributed-training llm pytorch

g-u-n / a-pytorch-tutorial-to-class-incremental-learning

distributed-training,a PyTorch Tutorial to Class-Incremental Learning | a Distributed Training Template of CIL with core code less than 100 lines.

User: g-u-n

cifar100 class-incremental-learning computer-vision continual-learning distributed-training incremental-learning pytorch pytorch-tutorial resnet tutorial

gokumohandas / made-with-ml

distributed-training,Learn how to design, develop, deploy and iterate on production-grade ML applications.

User: gokumohandas

Home Page: https://madewithml.com

machine-learning deep-learning pytorch natural-language-processing data-science python mlops data-engineering data-quality distributed-ml llms ray distributed-training

guitaricet / relora

distributed-training,Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

User: guitaricet

Home Page: https://arxiv.org/abs/2307.05695

deep-learning distributed-training llama nlp peft transformer

hmunachi / nanodl

distributed-training,A Jax-based library for designing and training transformer models from scratch.

User: hmunachi

attention-mechanism deep-learning distributed-training gpt jax llama machine-learning nlp transformer attention flax mistral

hongxinxiang / pytorch-multi-gpu-training-tutorial

distributed-training,

User: hongxinxiang

deep-learning distributed-training multi-gpu-training pytorch tutorial computer-vision

huggingface / chug

distributed-training,Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

Organization: huggingface

computer-vision dataloading datasets distributed-training document-understanding multi-modal-learning pdf-document webdataset

huggingface / pytorch-image-models

distributed-training,The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Organization: huggingface

Home Page: https://huggingface.co/docs/timm

pytorch resnet dual-path-networks pretrained-models pretrained-weights distributed-training mobile-deep-learning mobilenet-v2 mobilenetv3 efficientnet augmix randaugment mixnet vision-transformer-models nfnets normalization-free-training maxvit convnext image-classification imagenet

idea-ccnl / fengshenbang-lm

distributed-training,Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

Organization: idea-ccnl

chinese-nlp pretrained-models pytorch distributed-training transformers aigc multimodal

intelligent-machine-learning / dlrover

distributed-training,DLRover: An Automatic Distributed Deep Learning System

Organization: intelligent-machine-learning

distributed-training k8s llm-training

learning-at-home / hivemind

distributed-training,Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Organization: learning-at-home

asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing

lsds / kungfu

distributed-training,Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

Organization: lsds

tensorflow keras distributed-training distributed-systems

maudzung / yolo3d-yolov4-pytorch

distributed-training,YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

User: maudzung

Home Page: https://arxiv.org/pdf/1808.02350v1.pdf

yolov4 darknet point-cloud real-time yolo3d rotated-boxes-iou 3d-object-detection object-detection distributed-training

mryab / efficient-dl-systems

distributed-training,Efficient Deep Learning Systems course materials (HSE, YSDA)

User: mryab

deep-learning efficient-deep-learning pytorch cuda distributed-training machine-learning ml-infrastructure mlops

omerbsezer / fast-kubeflow

distributed-training,This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

User: omerbsezer

automl distributed-training jupyter-notebooks kale katib kubeflow kubeflow-component kubeflow-demo kubeflow-pipeline kubernetes training-operators

oneflow-inc / libai

distributed-training,LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Organization: oneflow-inc

Home Page: https://libai.readthedocs.io

oneflow nlp deep-learning large-scale data-parallelism model-parallelism distributed-training pipeline-parallelism transformer self-supervised-learning

paddlepaddle / paddle

distributed-training,PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Organization: paddlepaddle

Home Page: http://www.paddlepaddle.org/

paddlepaddle deep-learning scalability machine-learning neural-network python efficiency distributed-training

paddlepaddle / paddlenlp

distributed-training,👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

Organization: paddlepaddle

Home Page: https://paddlenlp.readthedocs.io

nlp embedding bert ernie paddlenlp pretrained-models transformers information-extraction question-answering search-engine

paddlepaddle / plsc

distributed-training,Paddle Large Scale Classification Tools，supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

Organization: paddlepaddle

face-recognition arcface cosface partial-fc data-parallel model-parallel large-scale paddlepaddle paddle distributed-training

panjinquan / pytorch-base-trainer

distributed-training,Pytorch分布式训练框架

User: panjinquan

distributed-training pytorch-prune pytorch-training deep-learning

petuum / adaptdl

distributed-training,Resource-adaptive cluster scheduler for deep learning training.

Organization: petuum

Home Page: https://adaptdl.readthedocs.io/

deep-learning kubernetes pytorch distributed-systems aws distributed-training machine-learning cloud

pinpoint-apm / pinpoint-node-agent

distributed-training,Pinpoint Node.js agent

Organization: pinpoint-apm

agent performance monitoring apm node distributed-training pinpoint

pytorch / torchx

distributed-training,TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Organization: pytorch

Home Page: https://pytorch.org/torchx

pytorch machine-learning kubernetes slurm distributed-training pipelines components deep-learning python aws-batch ray airflow

richardkxu / distributed-pytorch

distributed-training,Distributed, mixed-precision training with PyTorch

User: richardkxu

distributed-training deep-learning computer-vision imagenet mixed-precision-training pytorch nvidia-apex tensorboard horovod

skypilot-org / skypilot

distributed-training,SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Organization: skypilot-org

Home Page: https://skypilot.readthedocs.io

cloud-computing data-science deep-learning gpu hyperparameter-tuning machine-learning tpu job-queue job-scheduler cloud-management

synxlin / deep-gradient-compression

distributed-training,[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

User: synxlin

Home Page: https://arxiv.org/pdf/1712.01887.pdf

deep-learning distributed-training deep-gradient-compression gradient-compression sparse-distributed-training

taishan1994 / pytorch-distributed-nlp

distributed-training,pytorch分布式训练

User: taishan1994

bert pytorch distributed-training text-classification

tanyuqian / redco

distributed-training,NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

User: tanyuqian

Home Page: https://tanyuqian.github.io/redco/

jax model-parallelism distributed-training large-language-models llama flan-t5-xxl diffusion-models fedavg federated-learning image-captioning maml meta-learning ppo reinforcement-learning seq2seq stable-diffusion mlsys gemma mixed-precision differential-privacy

tensorflow / adanet

distributed-training,Fast and flexible AutoML with learning guarantees.

Organization: tensorflow

Home Page: https://adanet.readthedocs.io

automl tensorflow learning-theory deep-learning neural-architecture-search gpu machine-learning ensemble tpu python

tensorlayer / hyperpose

distributed-training,Library for Fast and Flexible Human Pose Estimation

Organization: tensorlayer

Home Page: https://hyperpose.readthedocs.io

tensorlayer tensorflow openpose pose-estimation computer-vision distributed-training tensorrt mobilenet neural-networks

wenwei202 / terngrad

distributed-training,Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)

User: wenwei202

data-parallelism deep-learning deep-neural-networks distributed-training quantization sgd

zh320 / realtime-semantic-segmentation-pytorch

distributed-training,PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training etc.

User: zh320

distributed-training enet semantic-segmentation cityscapes real-time pytorch knowledge-distillation

zju-openks / openks

distributed-training,OpenKS - 领域可泛化的知识学习与计算引擎

Organization: zju-openks

knowledge-computing kgqa distributed-training

Topic: distributed-training Goto Github

👇 Here are 148 public repositories matching this topic...

alibaba / easyparallellibrary

alibaba / tepdist

alpa-projects / alpa

aws / sagemaker-xgboost-container

awslabs / deeplearning-cfn

awslabs / dynamic-training-with-apache-mxnet-on-aws

bindog / pytorch-model-parallel

bryanyzhu / video-tutorial-cvpr2020

bytedance / byteps

chairc / integrated-design-diffusion-model

datacanvasio / hypergbm

deepglint / unicom

deeprec-ai / deeprec

dena / handyrl

determined-ai / determined

dougsouza / pytorch-sync-batchnorm-example

fedml-ai / fedml

foundation-model-stack / fms-fsdp

g-u-n / a-pytorch-tutorial-to-class-incremental-learning

gokumohandas / made-with-ml

guitaricet / relora

hmunachi / nanodl

hongxinxiang / pytorch-multi-gpu-training-tutorial

huggingface / chug

huggingface / pytorch-image-models

idea-ccnl / fengshenbang-lm

intelligent-machine-learning / dlrover

learning-at-home / hivemind

lsds / kungfu

maudzung / yolo3d-yolov4-pytorch

mryab / efficient-dl-systems

omerbsezer / fast-kubeflow

oneflow-inc / libai

paddlepaddle / paddle

paddlepaddle / paddlenlp

paddlepaddle / plsc

panjinquan / pytorch-base-trainer

petuum / adaptdl

pinpoint-apm / pinpoint-node-agent

pytorch / torchx

richardkxu / distributed-pytorch

skypilot-org / skypilot

synxlin / deep-gradient-compression

taishan1994 / pytorch-distributed-nlp

tanyuqian / redco

tensorflow / adanet

tensorlayer / hyperpose

wenwei202 / terngrad

zh320 / realtime-semantic-segmentation-pytorch

zju-openks / openks

Recommend Projects

Recommend Topics

Recommend Org