Comments (13)
@cloustone there's work going on on cleaning that tight integration that we have and we should have something out relatively soon.
the thought process is that you can create a PVC, load all the training data to this PVC and in the manifest file provide a pvc reference id/name similar to the way you provide s3 details in manifest and the learner can mount that pvc rather than the s3 storage and use the data
from ffdl.
@atinsood thanks for your reply. I just used dynamic external storage with NFS to deploy model train. It seems ok.
from ffdl.
@cloustone would love to get more details about how you did this. We would love to include a PR with a doc stating how to leverage NFS, with the steps you defined above
"The following steps are our adaptions for NFS.
Deploy an external NFS server out of kubernetes.
Add PVs declaration in templates folder
Add PVCs file "/etc/static-volumes/PVCs.yaml" in LCM docker environment"
from ffdl.
@cloustone I just used dynamic external storage with NFS to deploy model train. It seems ok.
curious on how you got this going from a technical perspective :)
thinking more about your initial suggestion, you can also have a configmap with a list of pvcs that you have created before hand, mount it as a volume in lcm, and then lcm can just pick 1 pvc and allocate it to training (basically change the https://github.com/IBM/FfDL/blob/master/lcm/service/lcm/learner_deployment_helpers.go#L493 and add the volume mount)
I wonder if you went this route or a different one
from ffdl.
@atinsood Yes, the method is almost same with what you provided.
thinking more about your initial suggestion, you can also have a configmap with a list of pvcs that you have created before hand, mount it as a volume in lcm, and then lcm can just pick 1 pvc and allocate it to training.
from ffdl.
@cloustone other interesting thing that you can try is this https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/
https://github.com/IntelAI/vck
we have been looking into this as well. but this can help bring data down to your nodes running the gpus and you'd end up accessing the data as you would access local data on those machines.
this is an interesting approach and should work well if you don't have a need of isolation of training data for every training.
from ffdl.
@atinsood Thanks, we will try this method according to our requirement.
from ffdl.
Hello, @FfDL
We deploy FfDL in a private environment in which S3 and Swift are not available, only support NFS external storage. for model definition file, we can use localstack in current dev environment, for training data, we wish use NFS.
The following steps are our adaptions for NFS.
- Deploy an external NFS server out of kubernetes.
- Add PVs declaration in templates folder
- Add PVCs file "/etc/static-volumes/PVCs.yaml" in LCM docker environment
We are confirming the above method, however, new question already occurred.
If there are two models to be submitted, they are all using NFS static external storage at the same mount point, is this not a problem?Would you please confirm the above method and question, or provide a right solution to us.
Thanks
@cloustone Can you please detailly tell me how to use NFS? I also want to use NFS but I do not know how to use it. Which files do you change and how to change? Thank you very much.
from ffdl.
@cloustone other interesting thing that you can try is this https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/
https://github.com/IntelAI/vck
we have been looking into this as well. but this can help bring data down to your nodes running the gpus and you'd end up accessing the data as you would access local data on those machines.
this is an interesting approach and should work well if you don't have a need of isolation of training data for every training.
@atinsood Do you have add this method into FfDL? Or do you have document about how to use this method in FfDL? Thank you very much.
from ffdl.
@Tomcli @fplk did you try the intel vck approach with ffdl
from ffdl.
@atinsood @Eric-Zhang1990 No, we do not currently have vck integration in FfDL.
@cloustone said:
and you'd end up accessing the data as you would access local data on those machines.
Which I think just implies a host mount, which I think is enabled in the current FfDL. So you could give that a try.
@cloustone said:
thinking more about your initial suggestion, you can also have a configmap with a list of pvcs that you have created before hand, mount it as a volume in lcm, and then lcm can just pick 1 pvc and allocate it to training.
We do have an internal PR that enables use of generic PVCs for training and result volumes. I don't think we need a configmap? The idea is that PVC allocation is done by some other process, and then we just point to the training data and result data volumes by name, in the manifest.
Perhaps we can go ahead and externalize this in the next few days, at least on a branch, and you could give it a try. Let me see what I can do.
from ffdl.
@sboagibm Thank you for your kind reply. You say "then we just point to the training data and result data volumes by name, in the manifest.", can you give me a example of manifest file using local path of host?
I find a file in "https://github.com/IBM/FfDL/blob/vck-patch/etc/examples/vck-integration.md", what you say is like this manifest file? If it is, can I add multi learners in it?
Thank you very much.
from ffdl.
@cloustone @atinsood @sboagibm How to use NFS to store data to start training jobs?? Can you provide more detail docs for us??
Thanks.
from ffdl.
Related Issues (20)
- FfDL v0.1.1 model training error HOT 4
- FfDL CLI output is not properly machine parsable
- [Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md
- dind-port-forward.sh -> invalid resource name ? HOT 5
- Grafana charts shows no data points HOT 1
- Unable to mount volumes for pod Learner HOT 8
- Learner pod stuck at training step 100 using custom image with TF Object Detection HOT 5
- / FfDL/demos/fashion-mnist-adversarial/README.md references internal repository HOT 1
- how to use pytorch and caffe built by ourselves? HOT 2
- kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff HOT 26
- tiller-deploy is in status CrashLoopBackOff HOT 2
- Confused about manifest.yml HOT 2
- learner pod failed HOT 19
- caffe training speed is very slow HOT 4
- pytorch training issue: insufficient shared memory HOT 2
- distributed training questions HOT 2
- why pytorch distributed training on two servers is slower than training on one server HOT 21
- .travis.yml: The 'sudo' tag is now deprecated in Travis CI
- ssh permission denied when deploying FfDL on public cloud
- fail to install
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffdl.