Comments (14)
If u submit tony app to secured cluster, the machine must be certified, which means keytab or principle must be provided.
I think you could use this machine to submit spark app for test. If it's ok, the tony app also can be submitted to cluster.
from tony.
Thanks for your reply. The cluster is hadoop 3.2.2 with kerberos, and I tried spark example successfully. I tried minist-tensorflow example according to the guide, https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow, but it failed. Do I need any other setting or configuration for this task?
from tony.
Please attach the detailed error log and submit cli command args/ tony.xml and so on.
from tony.
cli command:
#!/usr/bin/env bash
java -cp hadoop classpath
:/data/tony-dist/tony-cli-0.5.3-uber.jar com.linkedin.tony.cli.ClusterSubmitter
--python_venv=/data/venv/myvenv.zip
--src_dir=/data/tony-dist/mnist-tensorflow
--executes=mnist_distributed.py \ # relative path inside src/
--task_params="--steps 1000 --data_dir /user/test/tony/data --working_dir /user/test/tony/model" \ # You can use your HDFS path here.
--conf_file=/data/tony-dist/tony.xml
--python_binary_path=venv/bin/python # relative path inside venv.zip
error logs as the below:
AM Container for appattempt_1657011602166_1367_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2022-08-03 13:41:09.319]Exception from container-launch.
Container id: container_e94_1657011602166_1367_02_000001
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is test
main : requested yarn user is test
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data1/yarn/nm/nmPrivate/application_1657011602166_1367/container_e94_1657011602166_1367_02_000001/container_e94_1657011602166_1367_02_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
[2022-08-03 13:41:09.321]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of amstderr.log :
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
from tony.
Is the same problem? #672
It looks the nodemanager machine don't have the complete hadoop environment.
from tony.
Got it, I have updated hadoop environment, and it reported python error as the below.
The error: ModuleNotFoundError: No module named 'contextlib'
from tony.
You should package your pyenv zip at linux system machine same as the NM system. @tonywang-sh
from tony.
My package pyenv is set at ubuntu 18.04 system with anaconda according to the guide https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow. Do you have another guide about setting up nomachine system package env to package this pyenv zip? Thanks.
from tony.
Conda is also OK. If you want to check whether the env is OK, you could launch it in local machine.
from tony.
I used anaconda to package virtualenv python and obtained virtualenv pyenv zip, but this pyenv zip can not work at worker nodes. Is it right method?
from tony.
Does this pyenv can be used in your local machine? You'd better to pre-check
from tony.
It worked in local machine by using "ven/bin/python " cmd line, but failed in remote worker node by submitting task with TonY script.
from tony.
I guess this is caused by your local machine' env is not consistent with the nodemanager.
from tony.
If pyenv is packaged by virtualenv or anaconda, does it need to activate this pyenv python environment at the worker node, such as the comand, 'venv/bin/activate' before the task start at the worker. But I didn't find this "activate" operation in TonY project.
from tony.
Related Issues (20)
- Task executors that support specific roles are restarted when they fail HOT 2
- Guava Conflict? HOT 2
- TonY Client allow users to specify jars to container runtime classpath HOT 2
- CI looks unstable HOT 1
- Allow that one role of task executor could make other roles exit HOT 3
- The process of task executor is still alive when existing NM marked as lost node by RM
- Configurable status when dependency times out HOT 2
- [Optimization] Seperate the interface of registerTask and getClusterSpec in TaskExecutor
- [Optimization] Introducing the config of timeout that task executor register to AM
- Get task executor's python subprocess exit detailed diagnostics message
- Instability test case of testTonyAllocationTimeoutShouldFail
- Support venv of tar.gz compression algorithm
- Introduce the simple TonY web dashboard HOT 1
- ConcurrentModificationException when we traverse registeredTasks HOT 1
- Optimize the TonY AM web dashboard page HOT 2
- Failed to get RM principal HOT 2
- Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream HOT 1
- ERROR ApplicationMaster:496 - Exception while preparing AM org.apache.hadoop.yarn.exceptions.YarnException: Can't resolve the ip of ubuntu at com.linkedin.tony.util.Utils.getHostNameOrIpFromTokenConf(Utils.java:365) at com.linkedin.tony.ApplicationMaster.prepare(ApplicationMaster.java:476) at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:368) at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:342) HOT 2
- Support placement constraint
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tony.