Comments (2)
Need to figure out a shared file storage for all the learner pods (required for many distributed learning methods) and a way to store the model results for our users.
So this shared file storage, will be satisfied the PVC work? Or you explicitly need NFS under the covers?
from ffdl.
Many distributed learning methods required shared file storage to sync with the other workers. Currently all our workers are mounted on the same input and result bucket, so we have that satisfied. However, with VCK that pulls the data to HostPath, each K8s node will have their own path for the input and result directory. So we need to figure out a shared place where we should store the result files and other files that are required to shared among all the workers.
With the PVC work, this definitely could be solved for the NFS use case because it is mounted with PV. However, for S3
or Pachyderm
using VCK we still have the same issue since VCK technically create replicas in the HostPath for the files (can be from multiple sources) that you want to cache.
from ffdl.
Related Issues (20)
- FfDL v0.1.1 model training error HOT 4
- FfDL CLI output is not properly machine parsable
- [Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md
- dind-port-forward.sh -> invalid resource name ? HOT 5
- Grafana charts shows no data points HOT 1
- Unable to mount volumes for pod Learner HOT 8
- Learner pod stuck at training step 100 using custom image with TF Object Detection HOT 5
- / FfDL/demos/fashion-mnist-adversarial/README.md references internal repository HOT 1
- how to use pytorch and caffe built by ourselves? HOT 2
- kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff HOT 26
- tiller-deploy is in status CrashLoopBackOff HOT 2
- Confused about manifest.yml HOT 2
- learner pod failed HOT 19
- caffe training speed is very slow HOT 4
- pytorch training issue: insufficient shared memory HOT 2
- distributed training questions HOT 2
- why pytorch distributed training on two servers is slower than training on one server HOT 21
- .travis.yml: The 'sudo' tag is now deprecated in Travis CI
- ssh permission denied when deploying FfDL on public cloud
- fail to install
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffdl.