Comments (9)
Thanks for this @rakataprime -- here are some thoughts (will look to @chainzero @andy108369 @troian for additional inputs as well):
What is our timeout for large docker container pulls on Akash?
[am] I am not sure if we have a limit right now. @troian or @andy108369 do you guys know?
What is our desired time budget for running the benchmark how long to run?
[am] I was thinking we can leave it at the default (5min)
What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have?
[am] This one is tricky. Our intention is to figure out the performance we get on different GPU models and we intend to build a decent number of providers in the first phase of the testnet (with the benchmarking happening in the second phase). So the computational budget would be a range of GPU+CPU combos and the goal would be to see which ones fail, which pass and of the ones that pass, what the relative performance is. Context: https://github.com/akash-network/community/blob/main/wg-gpu/GPU-AI-Incentivized-Testnet.md
What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user?
[am] Agreed on older models being fine for the benchmark. The general thinking is that we don't care about the latest and greatest for the benchmarking exercise (reliability and consistency are more important) and we'll have a separate set of tasks for "just deploying" (not benchmarking) models, for which we will attempt to deploy the "latest and greatest". In terms of models, the below list would be great to hit (we're hoping to produce results similar to https://lambdalabs.com/gpu-benchmarks):
We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay.
[am] 100% agree on doing this.
from awesome-akash.
So if we look at the two tiers of gpus we still have a lot of variation within those tiers.
Tier 1 H100, A100, V100, P100, A40, A10, P4, K80, T4, 4090, 4080, 3090Ti, 3090, 3080Ti, 3080, 3060Ti.
Tier 2 RTX 2060, 2070, 2080, 2080Ti, GTX 1030, 1050, 1050Ti, 1060, 1070, 1070Ti, 1080, 1080Ti, 1630, 1650, 1660, 1660Ti.
For instance latest cuda 11 is deprecated for the k80 generation of cards and before.
the lowest VRAM usage for tier 1 is the 3060Ti with 8gb of VRAM,
the lowest VRAM usage for tier 2 is the 1630 with 4gb of VRAM
We probably wouldn't want to run benchmarks like bert large on the cards without enough VRAM to actually run. Right now of the shared models between that the lamda list and torchbench the only models that we couldn't run on all of tier 1 would be bert or other llms.
The other kind of thorny issue is what cuda/cudnn version install on nodes. I think k8s is still limited to one driver version on the nodes and one cuda version on the nodes. Even if you could do multiple cuda versions it would hurt the distributed training if the pool was highly fragmented bc the deployment would only be able to work with a fraction of the cuda compatible nodes. If you have to keep the newly deprecated cards, it may be better to move just that generation of cards to the last supported cuda version and bump the others to the latest.
Currently the torchbench container is using pytorch 2.0.1-cuda11.7-cudnn8-runtime with python 3.10
There are some major performance improvements with latest cuda 11 and pytorch2 for generative ai, especially stable diffusion vs pytorch 1 and prior cuda version before Jan 2022.
the relevant torchbench models currently supported in that list from lambda labs are:
resnet50
hf_Bert
hf_Bert_large
tacotron2
The models not currently included are ssd, gnmt, transformerxlbase, transformerxllarge, and baseglow. We could substitute the transformerxl with longformer, and ssd with yolov3. I'm not sure what would be a similar model for gnmt that is already in torchbench.
if that subset of the shared models is sufficient than I can refactor the container to install on run and update the entrypoint to only benchmark those shared models. Once we have a list of core models and smaller container we will have a better sense of where we stand relative to the 5 min gpu benchmark goal.
from awesome-akash.
Thanks for the details @rakataprime - the substitutions of the models you mentioned sound fine.
The Cuda version issue should only arise in cases of a heterogenous provider (more than one GPU type in the same cluster) and if the GPUs models in the cluster require different Cuda versions, right? I think that may be a relatively uncommon case for the testnet (but could be a problem).
Thinking of the logistics of all this, is it better if we just built an SDL (or more than one SDL) that deployed a jupyter notebook with the correct python kernel and pytorch included? At least for the tensorflow models, the approach I was thinking we could take would be to have people run https://github.com/akash-network/awesome-akash/tree/master/tensorflow-jupyter-mnist and then use that instance to run the models from the list in https://github.com/tensorflow/models/tree/master/official
from awesome-akash.
@anilmurty, if you don't actively try to coral the providers into standardized cuda versions it would prevent people from running training jobs like foundation models across multiple providers because the sdl includes 1 docker container for the training job with a cuda version dependency. My startup wants to train a foundation model with akash (lmk if you want to discuss a formal partnership on this more) , but would want to train across a huge cluster of gpus not just 1 provider. I think the you can have gpu heterogeneity but you want them to be on the same cuda/cudnn version and preferable a known minimum vram. I think in k8s you can set gpu requirements for vram with a helm plugin. VRAM resource resource requirements a setting in sdl right now? I'm not sure if i saw that in the docs.
I don't like the notebooks because they are prone to people executing cells out of order and not having functional code. You could do a notebook and then have people export after the benchmark runs as a pdf. Usually the formatting of console like output isn't that great though. I think we probably would better off writing a json output to somewhere else like s3 compatible bucket or ipfs or internal database for aggregation. It might be a lot of data to write on chain though but you could certainly write out some of the summary data on chain easily. I don't know if there is an easy cosmos python client though and you may have to use rust through python rust bridge to do that easily
from awesome-akash.
hey @rakataprime - sorry for the late reply - somehow missed the notification of this. Would definitely be interested in discussing a partnership with you. I've reached out via discord DM to coordinate.
Re. notebooks - I was looking at them purely for the benchmarking exercise for the testnet and not really for use in production for training or inference.
Do you feel like the Pytorch SDL is usable now? Asking because I was planning to update the instructions to tell people to use either pytorch or tensorflow for the testnet exercise with a preference towards pytorch. Thanks!
from awesome-akash.
@anilmurty , I think someone should test the torchbench sdl on the gpu testnet before we say its usable. I believe it is currently usable, but should test that assumption since the gpu testnet is up now. If we want jupyter notebook usage we should package a jupyter notebook container in the docker container or add a second one / sdl to make it easy as possible for people with clear instructions for those who may not have used jupyter before. I would also clarify how you want them to export the notebook in those intstructions as well if you want to look at 20+ submissions.
from awesome-akash.
Thanks @rakataprime - I'll test this out and confirm https://github.com/akash-network/awesome-akash/blob/e115932a1b8e0536649a2d88f3a614f097ad2c43/torchbench/torchbench_gpu_sdl.yaml (@chainzero - would be great if you did too).
Is this usable for the jupyter notebook? https://github.com/akash-network/awesome-akash/tree/master/jupyter
from awesome-akash.
hey @rakataprime - I just tested it and unfortunately it doesn't work because we have since added support for specifying some GPU attributes (vendor and model). Here are 3 examples of what the structure is like https://docs.akash.network/testnet/example-gpu-sdls
At the minimum the SDL needs to be updated to include the "vendor" key as shown here https://docs.akash.network/testnet/example-gpu-sdls/specific-gpu-vendor
add:
attributes:
vendor:
nvidia:
It still doesn't return bids (probably because there are no GPU providers on the network that meet the requirements yet) but at least the SDL is valid
from awesome-akash.
@anilmurty the latest commit adds jupyter and an example notebook. It still needs to be tested on testnet. Also the juypter notebook implementation requires users to paste in the auth token from the logss to access.
from awesome-akash.
Related Issues (20)
- Deploying DApps on Akash Cloud
- update auditor key
- Akash Filebase Challenge ($10,000 in $AKT) HOT 4
- Awesome Apps on Akash HOT 5
- Radicle.xyz + Akash (1000 AKT) HOT 5
- Minecraft on Akash HOT 6
- Solana on Akash HOT 4
- Discourse on Akash HOT 5
- test
- The Permaweb Hackathon powered by Akash, ArDrive, and Arweave HOT 4
- chia image should use a compiled binary HOT 1
- chia image should set worker_processes to 1 HOT 1
- Adding port configuration -x 9699 (port for Chives) - K29 plots HOT 6
- Chia - rclone upload all plots at once (parallel upload problem) HOT 4
- How to use ramdisk + madmax | question Akash chia
- How to install the database for the official Drupal deploy image HOT 3
- 502 Bad Gateway? HOT 1
- Wordpress template from cloudmos HOT 3
- [Docs] Update E2E Pocket Servicer documentation HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from awesome-akash.