qihoo360 / xlearning Goto Github PK

AI on Hadoop

License: Apache License 2.0

Shell 1.02% Java 98.98%

hadoop tensorflow caffe mxnet ai deeplearning machinelearning yarn

xlearning's Introduction

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.

中文文档

Architecture

There are three essential components in XLearning:

Client: start and get the state of the application.
ApplicationMaster(AM): the role for the internal schedule and lifecycle manager, including the input data distribution and containers management.
Container: the actual executor of the application to start the progress of Worker or PS(Parameter Server), monitor and report the status of the progress to AM, and save the output, especially start the TensorBoard service for TensorFlow application.

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

XLearning is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or xlearning.input.strategy configuration. XLearning support three ways to read the HDFS input data:

Download: AM traverses all files under the specified HDFS path and distributes data to workers in files. Each worker download files from the remote to local.
Placeholder: The difference with Download mode is that AM send the related HDFS file list to workers. The process in worker read the data from HDFS directly.
InputFormat: Integrated the InputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of InputFormat for the input data. AM splits the input data and assigns fragments to the different workers. Each worker passes the assigned fragments through the pipeline to the execution progress.

Similar with the read strategy, XLearning allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or xlearning.output.strategy configuration. There are two kinds of result output modes:

Upload: After the program finished, each worker upload the local directory of the output to specified HDFS path directly. The button, "Saved Model", on the web interface allows user to upload the intermediate result to remote during the execution.
OutputFormat: Integrated the OutputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of OutputFormat for saving the result to HDFS.

More detail see data management

3 Visualization Display

The application interface can be divided into four parts:

All Containers：display the container list and corresponding information, including the container host, container role, current state of container, start time, finish time, current progress.
View TensorBoard：If set to start the service of TensorBoard when the type of application is TensorFlow, provide the link to enter the TensorBoard for real-time view.
Save Model：If the application has the output, user can upload the intermediate output to specified HDFS path during the execution of the application through the button of "Save Model". After the upload finished, display the list of the intermediate saved path.
Worker Metrix：display the resource usage information metrics of each worker.
As shown below:

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at XLearning directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

jdk >= 1.7
Maven >= 3.3

2 Compilation Method

Run the following command in the root directory of the source code:

mvn package

After compiling, a distribution package named xlearning-1.1-dist.tar.gz will be generated under target in the root directory.
Unpacking the distribution package, the following subdirectories will be generated under the root directory:

bin: scripts for application commit
lib: jars for XLearning and dependencies
conf: configuration files
sbin: scripts for history service
data: data and files for examples
examples: XLearning examples

3 Deployment Environment Requirements

CentOS 7.2
Java >= 1.7
Hadoop = 2.6, 2.7, 2.8
[optional] Dependent environment for deep learning frameworks at the cluster nodes, such as TensorFlow, numpy, Caffe.

4 XLearning Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$XLEARNING_HOME", configure the related files:

xlearning-env.sh: set the environment variables, such as:
- JAVA_HOME
- HADOOP_CONF_DIR
xlearning-site.xml: configure related properties. Note that the properties associated with the history service needs to be consistent with what has configured when the history service started.For more details, please see the Configuration part。
log4j.properties：configure the log level

5 Start Method of XLearning History Service [Optional]

run $XLEARNING_HOME/sbin/start-history-server.sh.

Quick Start

Use $XLEARNING_HOME/bin/xl-submit to submit the application to cluster in the XLearning client.
Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $XLEARNING_HOME  
hadoop fs -put data /tmp/

2 submit

cd $XLEARNING_HOME/examples/tensorflow
$XLEARNING_HOME/bin/xl-submit \
   --app-type "tensorflow" \
   --app-name "tf-demo" \
   --input /tmp/data/tensorflow#data \
   --output /tmp/tensorflow_model#model \
   --files demo.py,dataDeal.py \
   --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" \
   --worker-memory 10G \
   --worker-num 2 \
   --worker-cores 3 \
   --ps-memory 1G \
   --ps-num 1 \
   --ps-cores 2 \
   --queue default \

The meaning of the parameters are as follows:

Property Name	Meaning
app-name	application name as "tf-demo"
app-type	application type as "tensorflow"
input	input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output	output file，HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files	application program and required local files, including demo.py, dataDeal.py
launch-cmd	execute command
worker-memory	amount of memory to use for the worker process is 10GB
worker-num	number of worker containers to use for the application is 2
worker-cores	number of cores to use for the worker process is 3
ps-memory	amount of memory to use for the ps process is 1GB
ps-num	number of ps containers to use for the application is 1
ps-cores	number of cores to use for the ps process is 2
queue	the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

XLearning FAQ

Authors

XLearning is designed, authored, reviewed and tested by the team at the github:

@Yuance Li, @Wen OuYang, @Runying Jia, @YuHan Jia, @Lei Wang

Contact us

Mail： [email protected]
QQ群：588356340

xlearning's People

Contributors

Stargazers

Watchers

Forkers

jiarunying dongjiewhu caifuli shiyongde caichangqi linyaochi mashroomxl superwangvip fengxueguang forevergg 1024tool citysir starten houcy haimo-free raoxl songfang jiawenqi liuq4360 ibean emersonxuelinux lite-j qlw yangsishu wwjiang007 icaruszzc huangjun6919 sdwflcz muyimo xuqianjin-stars yongqiangz qicny xzm2004260 zhangzhang83 allensmile eideo iceflameworm xbzbing shyhuai zhangjiekui jinganglang houhlin andyzhuang breakteam hesitationer hellogiantman1989 winnerineast fysoft2006 igit-cn swinsey stevenlol liyuanyaun 1099094077 03050903 onlyliucat dictionaryhouse gitlp lxiong v1cker skymysky mrchao0216 pjpan qdj0511 hudawei996 banyue zhengkejun shitou9999 xfstudio lonelygo nan3r xdlwork xwyangjshb beaver-company lvye1937 zmyer hexiaofeng niushencheng rockystevejobs avaudioplayer sehriff guohf tomzhang hj1212 cwkcyd s0x06 10183308 jango2015 jangoai dailong zofuthan 280185386 yuexiahandao joe2hpimn formath bgregwang04 chencool tasfe heroming paulzhu8597 zhuzhengyi

xlearning's Issues

任务提交之后创建work和ps container失败 (ln: command not found)

Hadoop版本：3.1.0
XL版本：xlearning-gpu-beta

XL的AM启动之后会通知NodeManager执行launch_container.sh创建work和ps对于的container，执行launch_container.sh会有如下错误：

ps：如果不是通过XL提交任务，只是提交一个MR任务(wordcount) container创建没有问题。

跑TensorFlow Demo时，worker训练完成，但是worker就是不退出，一直在卡着。

请教一下：
Xlearning 1.1 版本，跑 TensorFlow的demo，日志中显示所有的work都已经训练完毕了，但是只有task_index = 0 的container状态更新为success，其他container一直在running，日志中没有任何输出？

另外，问一下，无论worker-num，设置多少个，都是在一台机器上起的吗？

demo运行失败

按照README的流程，执行 sh run.sh 时出错：

报错信息：
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.JobConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

请问有没有遇到相同问题的，有没有解决的？

ERROR Client: Application run failed!

hi,I have pull the latest code and add the "--queue default" at the end of the "run.sh" first.The run infomation is
17/12/13 06:41:52 INFO Client: Copying /opt/XLearning/target/xlearning-1.0/lib/xlearning-1.0-hadoop2.7.3.jar to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/AppMaster.jar
17/12/13 06:41:52 INFO Client: Building environments for the application master
17/12/13 06:41:52 INFO Client: Copy xlearning files from local filesystem to remote.
17/12/13 06:41:52 INFO Client: Copying demo.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/demo.py
17/12/13 06:41:52 INFO Client: Copying dataDeal.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/dataDeal.py
17/12/13 06:41:52 INFO Client: Building application master launch command
17/12/13 06:41:52 INFO Client: Application master launch command: ${JAVA_HOME}/bin/java -Xms1024m -Xmx1024m net.qihoo.xlearning.AM.ApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
17/12/13 06:41:52 INFO Client: Submitting application to ResourceManager
17/12/13 06:41:53 INFO YarnClientImpl: Submitted application application_1511938500942_0009
17/12/13 06:41:53 INFO Client: Application submitAndMonitor succeed
17/12/13 06:41:53 INFO Client: The url to track the job: http://test-2:8088/proxy/application_1511938500942_0009/
17/12/13 06:41:53 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:54 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:55 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:56 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:57 INFO Client: Application report for application_1511938500942_0009 (state: FAILED)
17/12/13 06:41:57 INFO Client: Application has completed with YarnApplicationState=FAILED and FinalApplicationStatus=FAILED
17/12/13 06:41:57 ERROR Client: Application run failed!

I view the log under $XLEARNING_HOME/logs files ,the error is Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable,which is about IO error.
More Information is follows:

17/12/13 06:56:03 INFO MetricsSystemImpl: Stopping JobHistoryServer metrics system...
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system stopped.
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system shutdown complete.
17/12/13 06:56:03 FATAL JobHistoryServer: Error starting JobHistoryServer
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.serviceStart(JobHistoryServer.java:218)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:250)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.main(JobHistoryServer.java:259)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService.startStorage(HistoryServerLeveldbStateStoreService.java:82)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerStateStoreService.serviceStart(HistoryServerStateStoreService.java:79)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 5 more
17/12/13 06:56:03 INFO ExitUtil: Exiting with status -1
17/12/13 06:56:03 INFO JobHistoryServer: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JobHistoryServer at test-2/172.16.12.46
************************************************************/
Do you have any ideas about this?How can I solve it?Thank you~

自定义任务，在xl-submit中通过--files提交了多个脚本，但works里仍然提示找不到需要的的脚本

如图，提交命令里已经包括conv3d_utils.py文件。

但提交任务后失败，某个work的日志显示缺了conv3d_utils.py文件。

可明明已经提交上去了啊，这是什么问题？

model_fn() 中 if mode == tf.estimator.ModeKeys.PREDICT: 不支持

model = tf.estimator.Estimator(...)
ps=model.predict(...)

xlearning好像直接跳过 ps=model.predict(...) 不执行，直接显示success

mode_fn() 中tf.estimator.ModeKeys.PREDICT 模块加print 函数，在日志中没有看到print 输出的内容
用 model.eval(...) ,在model_fn(）中的 print内容能打印出来，
从这些现象看 model.predict(...) 确实被忽略了

anaconda环境下运行tensorflow demo

在centos服务器中，配置anaconda为python环境。
运行 run.sh 提示“no such file or directory” ，请问该怎么办？

错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster

Application application_1541471478754_0001 failed 2 times due to AM Container for appattempt_1541471478754_0001_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2018-11-06 10:32:36.870]Exception from container-launch.
Container id: container_1541471478754_0001_02_000001
Exit code: 1
[2018-11-06 10:32:36.878]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
[2018-11-06 10:32:36.879]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
For more detailed output, check the application tracking page: http://why-System-Product-Name:10086/cluster/app/application_1541471478754_0001 Then click on links to logs of each attempt.
. Failing the application.

各位大佬，跑xlearning-gpu-1.3的run.sh的时候出现这个错误，怎么解决？
环境：
ubuntu16
hadoop3.1.1
xlearning-gpu-1.3

有没有编程接口用来提交任务

文档中任务提交是使用xl-submit命令行的方式提交任务的，有没有可以使用python接口用来提交任务？

macos 怎么安装，特别想尝试一下这个框架

求助

连接RM时候，一直在等待

运行tensorflow 下的demo： run.sh 之后出现如下问题
17/12/06 11:55:26 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/12/06 11:55:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:28 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:29 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:30 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:31 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:32 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:33 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:34 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:35 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:36 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

sofa

tensorflow 分布式estimator启动出现 TrainStatus:false

框架：tensorflow
环境：gpu群：6卡p100
xlearning
本地代码已经能跑通，在xlearning上报错；
还望懂得人帮忙解决一下。

分布式模式下train_and_evaluate

请教，分布式模式下train_and_evaluate无法触发evaluate，tf中提到需要启动evaluate节点，且该节点不属于训练集群，请问xlearning下如何处理。

train_and_evaluate的stop condition只有max_step，有没有比较好的方式，通过验证集提前结束，防止过拟合的方案。

运行demo 出现如下错误？

Application application_1510661908139_1155931 failed 3 times due to AM Container for appattempt_1510661908139_1155931_000003 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xxx:8088/cluster/app/application_1510661908139_1155931Then, click on links to logs of each attempt.
Diagnostics: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "hadoopxxx"; destination host is: "xxx":8020;
java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "xxx"; destination host is: "xxx":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)

如何获取tensorflow最后保存的模型文件目录

tensorflow的demo跑完后，模型存到了hdfs上其中一个container目录下，除了枚举hdfs上每个目录外，有没有什么方法可以获取模型数据？

INFO Client: reporter progress:100.00%

然后就一直卡在了 INFO Client: reporter progress:100.00%

关于Tensorflow单机和分布式区分咨询

看了FAQ文档上说Tensorflow这边根据PS的个数来区分单机和分布式，那么如果PS=1，然后Worker=2，这种情况是属于单机还是分布式呢？

无法连接外部数据库

自己开发的应用需要连接外部数据库来获取一些信息，但是发现连不上。
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '***' (timed out)")。
不使用XLearning直接利用单机方式是可以连接上这个数据库的。
XLearning不允许在执行期间访问外部链接吗？

tf server启动异常，会有端口占用问题

reservedSocket.bind(new InetSocketAddress("127.0.0.1", 0));
xlcontainer在申请端口时会用"127.0.0.1"，但实际上很多服务是用真实ip(比如192.168.2.2)去绑定端口，这样会有问题，比如已经有服务绑定了192.168.2.2:12345，但xlcontainer仍然会获取到12345为可用端口，并将此端口传给tf去启动服务，从而导致端口占用异常。
我们这边修改了xlcontainer里获取可用端口的实现，改为用真实ip去申请，目前线上稳定，没有再遇到类似问题

命令行是否支持多个输入输出？

xl-submit 命令行是否能够支持多个input以及多个output？

hadoop版本为3.1.1，是否可以？

关于tensorflow分布式性能问题

能分享下架构下的tensorflow分布式性能benchmark数据么？

自定义estimator，任务无法执行

按照官网自定义estimator的方法写了model_fn，在xlearning上无法执行，有人试过自定义estimator成功执行的吗？

官网例子如下：
https://www.tensorflow.org/get_started/custom_estimators

pom file lose dependence com.google.code.gson

使用hadoop 2.7之前版本需添加如下依赖

com.google.code.gson
gson
2.2.4

XLearning是否支持安全集群？

尝试在启用kerberos集群的环境下启动XLearning HistoryServer会报找不到keytab文件的错误，现在XL支持安全集群吗？

how to use it on mac

怎么在mac下使用

Tensorflow任务修改不同worker num，任务提交失败

对于demo任务，我在submit命令行中，修改了worker的个数>=3的 worker num执行都会失败，不知道什么问题，从DEBUG日志也看不出是什么错。

JobHistoryServer 服务访问报错

Hadoop版本：3.1.0
XL版本：xlearning-gpu-beta
进程启动正常，但是访问：http://xlhost:19886/jobhistory 报如下错误：

tensor board使用随机端口，在内网无法访问

公司内网环境只有审批后才能访问对应host:port, tensor board使用随机端口导致无法访问

xlearning是否能够支持Tensorflow的输入文件自动分割

xlearning默认的tensorflow任务运行方式输入是要求用户自己分割好文件，即ps 的数目要小于或等于输入文件的个数。而对于input strategy，之前有咨询过Stream模式的输入策略，这个策略要求输入是标准输入。那么对于文件输入的话，xlearning是否支持自动分割文件？

tensorflow demo失败

请教一下，tensorflow demo运行时，报错“ImporError：libcublas.so.9.0:cannot open sared object file :no such file or directory”
我的环境是：anaconda管理python3.6，tensorflow-gpu1.11,cuda9.0,hadoop2.7.7，master分支版本
但单独运行外部tensorflow-gpu示例代码时不报错。
还需要对哪里进行配置吗？

FATAL ApplicationMaster: Error running ApplicationMaster

Environment:
1.hdfs
`

Started:	Thu Apr 19 16:28:15 +0800 2018
3.1.0, r16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d
Fri Mar 30 08:00:00 +0800 2018 by centos from branch-3.1.0
CID-ea3f6bd7-9801-4a0d-a80e-e60465bb928f
BP-232525608-14.29.85.83-1522829491235

2.xlearning:
xlearnging-gpu-beta
commit c732e13
`

Error message:

18/04/19 16:30:01 FATAL ApplicationMaster: Error running ApplicationMaster
java.lang.RuntimeException: Error while build container local resource
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:764)
        at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1171)
        at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1525)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://gpu1:8020/tmp/XLearning/staging/application_1523879759427_0061/AppMaster.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1566)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1581)
        at net.qihoo.xlearning.util.Utilities.createApplicationResource(Utilities.java:121)
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:677)
        ... 2 more
18/04/19 16:30:01 INFO ApplicationMaster: Deleting the staging file successed.

分布式tensorflow如何关闭server？

创建两个ps server，两个worker client，运算然后退出。但是问题是两个worker client运行完退出后，ps server的Container并没有退出，因为还停在server.join()里。

我的问题是：

ps server不退出的原因是什么？
如何在client计算完成后关闭server？

这个框架支持深度学习的chainer框架吗？

chainer框架是日本做的，小巧好用，但是据我所知，并行分布式需要将梯度信息在多个机器上传输，这个框架支持这么做吗？

需要hadoop集群中安装对应深度学习运行环境嘛？

比如想运行MXNet，需要hadoop集群的每台机器都安装MXNet的运行环境嘛？还是只要client端安装就行

跑demo出错，日志如下，麻烦看看，新手

18/03/10 17:16:24 INFO ApplicationMaster: Application appId=1, clustertimestamp=1520673372220, attemptId=1
18/03/10 17:16:24 INFO ApplicationMaster: Application files location: file:/tmp/XLearning/staging/application_1520673372220_0001/demo.py,file:/tmp/XLearning/staging/application_1520673372220_0001/dataDeal.py
18/03/10 17:16:24 INFO ApplicationMaster: Application jar location: file:/tmp/XLearning/staging/application_1520673372220_0001/AppMaster.jar
18/03/10 17:16:24 INFO ApplicationMaster: Application conf location: file:/tmp/XLearning/staging/application_1520673372220_0001/core-site.xml
18/03/10 17:16:24 INFO ApplicationMaster: XLearning exec command: python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10
18/03/10 17:16:24 INFO ApplicationMaster: XLearning app type: TENSORFLOW
18/03/10 17:16:24 INFO NMClientAsyncImpl: Upper bound of the thread pool size is 500
18/03/10 17:16:24 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/03/10 17:16:24 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030
18/03/10 17:16:24 INFO ApplicationMessageService: Starting application message server
18/03/10 17:16:24 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:24 INFO Server: Starting Socket Reader #1 for port 35894
18/03/10 17:16:24 INFO Server: IPC Server Responder: starting
18/03/10 17:16:24 INFO Server: IPC Server listener on 35894: starting
18/03/10 17:16:24 INFO ApplicationMessageService: Started application message server at localhost.localdomain/127.0.0.1:35894
18/03/10 17:16:24 INFO ApplicationContainerListener: Starting application web server
18/03/10 17:16:24 INFO log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
18/03/10 17:16:24 INFO AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
18/03/10 17:16:24 INFO HttpRequestLog: Http request log for http.requests.proxy is not defined
18/03/10 17:16:24 INFO HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context proxy
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
18/03/10 17:16:24 INFO HttpServer2: adding path spec: /proxy/*
18/03/10 17:16:25 INFO WebApps: Registered webapp guice modules
18/03/10 17:16:25 INFO HttpServer2: Jetty bound to port 42921
18/03/10 17:16:25 INFO log: jetty-6.1.26
18/03/10 17:16:25 INFO log: Extract jar:file:/home/larry/tools/hadoop-2.8.3/share/hadoop/yarn/hadoop-yarn-common-2.8.3.jar!/webapps/proxy to /tmp/Jetty_0_0_0_0_42921_proxy____yzbxeb/webapp
18/03/10 17:16:25 INFO log: NO JSP Support for /static/xlWebApp, did not find org.apache.jasper.servlet.JspServlet
18/03/10 17:16:25 INFO log: Started [email protected]:42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Web app proxy started at 42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Starting application containers handler server
18/03/10 17:16:25 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:25 INFO Server: Starting Socket Reader #1 for port 36827
18/03/10 17:16:25 INFO Server: IPC Server Responder: starting
18/03/10 17:16:25 INFO Server: IPC Server listener on 36827: starting
18/03/10 17:16:25 INFO ApplicationContainerListener: Container timeout monitor thread had started
18/03/10 17:16:25 INFO ApplicationMaster: master tracking url:localhost:42921
18/03/10 17:16:25 INFO ApplicationMaster: history url: 0.0.0.0:19886/jobhistory/job/application_1520673372220_0001
18/03/10 17:16:25 INFO ApplicationMaster: ApplicationMaster Starting ...
18/03/10 17:16:25 INFO Utilities: input path: file:/tmp/data/tensorflow
18/03/10 17:16:25 INFO ApplicationMaster: XLearning application needs 1 worker and 1 ps containers in fact
18/03/10 17:16:25 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Create ps container request: Capability[<memory:1024, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Try to allocate 1 ps/server containers
18/03/10 17:16:27 INFO AMRMClientImpl: Received new token for : localhost:40131
18/03/10 17:16:27 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000002 on host localhost , with the resource <memory:1024, vCores:1>
18/03/10 17:16:27 INFO RMCallbackHandler: Current acquired worker container 0 / 1 ps container 1 / 1
18/03/10 17:16:27 INFO ApplicationMaster: Total 1 ps containers has allocated.
18/03/10 17:16:27 INFO ApplicationMaster: Try to allocate 1 worker containers
18/03/10 17:16:29 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000003 on host localhost , with the resource <memory:4096, vCores:1>
18/03/10 17:16:29 INFO RMCallbackHandler: Current acquired worker container 1 / 1 ps container 1 / 1
18/03/10 17:16:29 INFO ApplicationMaster: Total 1 worker containers has allocated.
18/03/10 17:16:29 INFO ApplicationMaster: Initializing container_1520673372220_0001_01_000003 input splits
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at net.qihoo.xlearning.AM.ApplicationMaster.allocateInputSplits(ApplicationMaster.java:511)
at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1139)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1475)

运行demo时报错

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.yarn.webapp.WebApps$Builder.build(Lorg/apache/hadoop/yarn/webapp/WebApp;)Lorg/apache/hadoop/yarn/webapp/WebApp;
at net.qihoo.xlearning.AM.ApplicationWebService.start(ApplicationWebService.java:35)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.AM.ApplicationMaster.init(ApplicationMaster.java:217)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1245)

Tensorflow版本兼容和模型保存

目前使用XLearning测试Tensorflow分布式模型训练的场景，遇到一些问题：

XLearning现在兼容支持的最高的Tensorflow的版本是哪个？目前example里面里提供的测试脚本在1.10的版本是测试不通过的，1.3版本可以兼容。

2.能否给出保存pb模型文件的方式，现在测试在本机可以保存pb文件的python代码，使用xlearning保存的时候就会报错。

IllegalArgumentException

src/main/java/net/qihoo/xlearning/AM/ApplicationMaster.java:823
updateBlacklist.invoke(amrmAsync, blackHosts) throw excption "java.lang.IllegalArgumentException: wrong number of arguments"

运行环境什么时候支持docker？

x-learning 在两个人同时执行demo时，最后报错

运行的 example 为： xlearning/examples/tensorflow/run.sh
任务在执行至 95% 时报错，在工作节点上看，报的是目录权限不对，
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bing.wb, access=WRITE, inode="/tmp/XLearning/eventLog":xxxxxxx:supergroup:drwxr-xr-x
xxxxxx 是之前一个同事运行命令后创建的目录，导致当前我的任务执行失败。
但是我在运行demo 执行，已经对 eventLog 进行了重定向，目前看这个改动貌似没有生效。

[[email protected] /home/bing.wb/xlearning/conf]
$grep -b2  event xlearning-site.xml
1450-    <property>
1465-        <name>xlearning.tf.board.history.dir</name>
1517:        <value>/tmp/bing.wb/XLearning/eventLog</value>
1572-    </property>
1588-    <property>

运行tensorflow demo之后，模型文件找不到

运行官方给出的tensorflow demo之后，在hdfs /tmp/tensorflow_model中未找到模型文件，如下图所示：container里面都是空的。但是yarn上以及shell上，确实提示了运行成功。不知道怎么回事

能否支持按天先后分发数据来训练

能否支持按天先后分发数据来训练， 0901分发训练完毕后，再0902， 0903，。。。。

FATAL Client: Error running Client

hi,I follow your steps ,when I run the $XLEARNING_HOME/bin/xl-submit --app-type "tensorflow" --app-name "tf-demo" --input /tmp/data/tensorflow#data --output /tmp/tensorflow_model#model --files demo.py,dataDeal.py --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" --worker-memory 2G --worker-num 2 --worker-cores 3 --ps-memory 1G --ps-num 1 command,it failed,the error information is

17/12/13 02:57:28 INFO Client: Submitting application to ResourceManager
17/12/13 02:57:28 FATAL Client: Error running Client
java.lang.RuntimeException: Application submitAndMonitor failed!
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:594)
at net.qihoo.xlearning.client.Client.main(Client.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
How I can solve this problem?Can you help me ? Thanks~

请问XLearning是否能在Ubuntu16.04下运行呢

如题，我想在Ubuntu下搭建XLearning框架，不知道是否支持Ubuntu系统呢

在gpu-beta版本中提交到yarn申请资源超时

集群是有资源的，
AM log：
18/04/12 10:53:03 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE 18/04/12 10:53:04 INFO Utilities: input path: hdfs://gpu1:8020/user/tmp/data/tensorflow 18/04/12 10:53:04 INFO ApplicationMaster: XLearning application needs 2 worker and 1 ps containers in fact 18/04/12 10:53:04 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:2, yarn.io/gpu: 2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Create ps container request: Capability[<memory:4096, vCores:2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Try to allocate 1 ps/server containers 18/04/12 10:53:05 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000002 on host gpu3 , with the resource <memory:4096, vCores:2> 18/04/12 10:53:05 INFO RMCallbackHandler: Current acquired worker container 0 / 2 ps container 1 / 1 18/04/12 10:53:06 INFO ApplicationMaster: Total 1 ps containers has allocated. 18/04/12 10:53:06 INFO ApplicationMaster: Try to allocate 2 worker containers 18/04/12 10:53:07 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000003 on host gpu3 , with the resource <memory:4096, vCores:2, yarn.io/gpu: 2> 18/04/12 10:53:07 INFO RMCallbackHandler: Current acquired worker container 1 / 2 ps container 1 / 1 18/04/12 11:03:08 INFO ApplicationMaster: Container waiting except the allocated expiry time. Maybe the Cluster available resources are not satisfied the user need. Please resubmit ! 18/04/12 11:03:08 INFO ApplicationMaster: Unregister Application 18/04/12 11:03:08 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 11:03:08 INFO ApplicationMaster: Application failed.
ResourceManager log：
clusterResource=<memory:400000, vCores:36, yarn.io/gpu: 8>

run.sh:
--worker-memory 4G \ --worker-num 2 \ --worker-cores 2 \ --worker-gcores 2 \ --ps-memory 4G \ --ps-num 1 \ --ps-cores 2 \

集成KERBEROS报错

JobHistoryServer在xlearning-site.xml里面添加了

xlearning.history.keytab
/var/run/cloudera-scm-agent/process/3001-hive-HIVESERVER2/hive.keytab

xlearning.history.principal
hive/bd129118@MYCDH

服务启动成功，但是运行demo的时候报错如下，集群各个机器上票据都正常
18/01/18 10:33:26 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Connecting to ResourceManager failed, try again later
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy21.fetchApplicationMessages(Unknown Source)
at net.qihoo.xlearning.client.Client.waitCompleted(Client.java:682)
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:643)
at net.qihoo.xlearning.client.Client.main(Client.java:711)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]; Host Details : local host is: "bd129118/192.168.129.118"; destination host is: "bd129120":10079;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
... 10 more
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
... 12 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
... 15 more
18/01/18 10:33:27 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)

tensorflow demo运行偶尔出错

18/03/12 20:50:33 INFO XLearningContainer: WARNING:tensorflow:From demo.py:75: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
18/03/12 20:50:33 INFO XLearningContainer: Instructions for updating:
18/03/12 20:50:33 INFO XLearningContainer: Please switch to tf.train.MonitoredTrainingSession
18/03/12 20:50:33 INFO XLearningContainer: 2018-03-12 20:50:33.961848: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
18/03/12 20:50:33 INFO XLearningContainer: Traceback (most recent call last):
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 173, in
18/03/12 20:50:33 INFO XLearningContainer: tf.app.run(main=main)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
18/03/12 20:50:33 INFO XLearningContainer: _sys.exit(main(argv))
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 76, in main
18/03/12 20:50:33 INFO XLearningContainer: with sv.prepare_or_wait_for_session(server.target, config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement = True, log_device_placement = True)) as sess:
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
18/03/12 20:50:33 INFO XLearningContainer: init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
18/03/12 20:50:33 INFO XLearningContainer: sess.run(init_op, feed_dict=init_feed_dict)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
18/03/12 20:50:33 INFO XLearningContainer: run_metadata_ptr)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
18/03/12 20:50:33 INFO XLearningContainer: feed_dict_tensor, options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
18/03/12 20:50:33 INFO XLearningContainer: options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
18/03/12 20:50:33 INFO XLearningContainer: raise type(e)(node_def, op, message)
18/03/12 20:50:33 INFO XLearningContainer: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

进度条不显示

在标准错误输出里打印 report:progress:0.775，在任务监控页面里任然看不到进度，请问改怎么修改，才能显示进度条？