Coder Social home page Coder Social logo

xlearning's Introduction


license Release Version PRs Welcome

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.

中文文档

Architecture

architecture
There are three essential components in XLearning:

  • Client: start and get the state of the application.
  • ApplicationMaster(AM): the role for the internal schedule and lifecycle manager, including the input data distribution and containers management.
  • Container: the actual executor of the application to start the progress of Worker or PS(Parameter Server), monitor and report the status of the progress to AM, and save the output, especially start the TensorBoard service for TensorFlow application.

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

XLearning is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or xlearning.input.strategy configuration. XLearning support three ways to read the HDFS input data:

  • Download: AM traverses all files under the specified HDFS path and distributes data to workers in files. Each worker download files from the remote to local.
  • Placeholder: The difference with Download mode is that AM send the related HDFS file list to workers. The process in worker read the data from HDFS directly.
  • InputFormat: Integrated the InputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of InputFormat for the input data. AM splits the input data and assigns fragments to the different workers. Each worker passes the assigned fragments through the pipeline to the execution progress.

Similar with the read strategy, XLearning allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or xlearning.output.strategy configuration. There are two kinds of result output modes:

  • Upload: After the program finished, each worker upload the local directory of the output to specified HDFS path directly. The button, "Saved Model", on the web interface allows user to upload the intermediate result to remote during the execution.
  • OutputFormat: Integrated the OutputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of OutputFormat for saving the result to HDFS.

More detail see data management

3 Visualization Display

The application interface can be divided into four parts:

  • All Containers:display the container list and corresponding information, including the container host, container role, current state of container, start time, finish time, current progress.
  • View TensorBoard:If set to start the service of TensorBoard when the type of application is TensorFlow, provide the link to enter the TensorBoard for real-time view.
  • Save Model:If the application has the output, user can upload the intermediate output to specified HDFS path during the execution of the application through the button of "Save Model". After the upload finished, display the list of the intermediate saved path.
  • Worker Metrix:display the resource usage information metrics of each worker.
    As shown below:

yarn1

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at XLearning directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

  • jdk >= 1.7
  • Maven >= 3.3

2 Compilation Method

Run the following command in the root directory of the source code:

mvn package

After compiling, a distribution package named xlearning-1.1-dist.tar.gz will be generated under target in the root directory.
Unpacking the distribution package, the following subdirectories will be generated under the root directory:

  • bin: scripts for application commit
  • lib: jars for XLearning and dependencies
  • conf: configuration files
  • sbin: scripts for history service
  • data: data and files for examples
  • examples: XLearning examples

3 Deployment Environment Requirements

  • CentOS 7.2
  • Java >= 1.7
  • Hadoop = 2.6, 2.7, 2.8
  • [optional] Dependent environment for deep learning frameworks at the cluster nodes, such as TensorFlow, numpy, Caffe.

4 XLearning Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$XLEARNING_HOME", configure the related files:

  • xlearning-env.sh: set the environment variables, such as:

    • JAVA_HOME
    • HADOOP_CONF_DIR
  • xlearning-site.xml: configure related properties. Note that the properties associated with the history service needs to be consistent with what has configured when the history service started.For more details, please see the Configuration part。

  • log4j.properties:configure the log level

5 Start Method of XLearning History Service [Optional]

  • run $XLEARNING_HOME/sbin/start-history-server.sh.

Quick Start

Use $XLEARNING_HOME/bin/xl-submit to submit the application to cluster in the XLearning client.
Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $XLEARNING_HOME  
hadoop fs -put data /tmp/ 

2 submit

cd $XLEARNING_HOME/examples/tensorflow
$XLEARNING_HOME/bin/xl-submit \
   --app-type "tensorflow" \
   --app-name "tf-demo" \
   --input /tmp/data/tensorflow#data \
   --output /tmp/tensorflow_model#model \
   --files demo.py,dataDeal.py \
   --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" \
   --worker-memory 10G \
   --worker-num 2 \
   --worker-cores 3 \
   --ps-memory 1G \
   --ps-num 1 \
   --ps-cores 2 \
   --queue default \

The meaning of the parameters are as follows:

Property Name Meaning
app-name application name as "tf-demo"
app-type application type as "tensorflow"
input input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output output file,HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files application program and required local files, including demo.py, dataDeal.py
launch-cmd execute command
worker-memory amount of memory to use for the worker process is 10GB
worker-num number of worker containers to use for the application is 2
worker-cores number of cores to use for the worker process is 3
ps-memory amount of memory to use for the ps process is 1GB
ps-num number of ps containers to use for the application is 1
ps-cores number of cores to use for the ps process is 2
queue the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

XLearning FAQ

Authors

XLearning is designed, authored, reviewed and tested by the team at the github:

@Yuance Li, @Wen OuYang, @Runying Jia, @YuHan Jia, @Lei Wang

Contact us

Mail: [email protected]
QQ群:588356340
qq

xlearning's People

Contributors

hbrnws avatar jiarunying avatar liyuance avatar wangxingda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xlearning's Issues

任务提交之后创建work和ps container失败 (ln: command not found)

Hadoop版本:3.1.0
XL版本:xlearning-gpu-beta

XL的AM启动之后会通知NodeManager执行launch_container.sh创建work和ps对于的container,执行launch_container.sh会有如下错误:

1
2
3

ps:如果不是通过XL提交任务,只是提交一个MR任务(wordcount) container创建没有问题。

demo运行失败

按照README的流程,执行 sh run.sh 时出错:

报错信息:
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.JobConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

请问有没有遇到相同问题的,有没有解决的?

ERROR Client: Application run failed!

hi,I have pull the latest code and add the "--queue default" at the end of the "run.sh" first.The run infomation is
17/12/13 06:41:52 INFO Client: Copying /opt/XLearning/target/xlearning-1.0/lib/xlearning-1.0-hadoop2.7.3.jar to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/AppMaster.jar
17/12/13 06:41:52 INFO Client: Building environments for the application master
17/12/13 06:41:52 INFO Client: Copy xlearning files from local filesystem to remote.
17/12/13 06:41:52 INFO Client: Copying demo.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/demo.py
17/12/13 06:41:52 INFO Client: Copying dataDeal.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/dataDeal.py
17/12/13 06:41:52 INFO Client: Building application master launch command
17/12/13 06:41:52 INFO Client: Application master launch command: ${JAVA_HOME}/bin/java -Xms1024m -Xmx1024m net.qihoo.xlearning.AM.ApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
17/12/13 06:41:52 INFO Client: Submitting application to ResourceManager
17/12/13 06:41:53 INFO YarnClientImpl: Submitted application application_1511938500942_0009
17/12/13 06:41:53 INFO Client: Application submitAndMonitor succeed
17/12/13 06:41:53 INFO Client: The url to track the job: http://test-2:8088/proxy/application_1511938500942_0009/
17/12/13 06:41:53 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:54 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:55 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:56 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:57 INFO Client: Application report for application_1511938500942_0009 (state: FAILED)
17/12/13 06:41:57 INFO Client: Application has completed with YarnApplicationState=FAILED and FinalApplicationStatus=FAILED
17/12/13 06:41:57 ERROR Client: Application run failed!

I view the log under $XLEARNING_HOME/logs files ,the error is Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable,which is about IO error.
More Information is follows:

17/12/13 06:56:03 INFO MetricsSystemImpl: Stopping JobHistoryServer metrics system...
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system stopped.
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system shutdown complete.
17/12/13 06:56:03 FATAL JobHistoryServer: Error starting JobHistoryServer
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.serviceStart(JobHistoryServer.java:218)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:250)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.main(JobHistoryServer.java:259)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService.startStorage(HistoryServerLeveldbStateStoreService.java:82)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerStateStoreService.serviceStart(HistoryServerStateStoreService.java:79)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 5 more
17/12/13 06:56:03 INFO ExitUtil: Exiting with status -1
17/12/13 06:56:03 INFO JobHistoryServer: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JobHistoryServer at test-2/172.16.12.46
************************************************************/
Do you have any ideas about this?How can I solve it?Thank you~

model_fn() 中 if mode == tf.estimator.ModeKeys.PREDICT: 不支持

model = tf.estimator.Estimator(...)
ps=model.predict(...)

xlearning好像直接跳过 ps=model.predict(...) 不执行,直接显示success

mode_fn() 中tf.estimator.ModeKeys.PREDICT 模块加print 函数,在日志中没有看到print 输出的内容
用 model.eval(...) ,在model_fn() 中的 print内容能打印出来,
从这些现象看 model.predict(...) 确实被忽略了

错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster

Application application_1541471478754_0001 failed 2 times due to AM Container for appattempt_1541471478754_0001_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2018-11-06 10:32:36.870]Exception from container-launch.
Container id: container_1541471478754_0001_02_000001
Exit code: 1
[2018-11-06 10:32:36.878]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
[2018-11-06 10:32:36.879]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
For more detailed output, check the application tracking page: http://why-System-Product-Name:10086/cluster/app/application_1541471478754_0001 Then click on links to logs of each attempt.
. Failing the application.

各位大佬,跑xlearning-gpu-1.3的run.sh的时候出现这个错误,怎么解决?
环境:
ubuntu16
hadoop3.1.1
xlearning-gpu-1.3

连接RM时候,一直在等待

运行tensorflow 下的demo: run.sh 之后出现如下问题
17/12/06 11:55:26 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/12/06 11:55:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:28 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:29 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:30 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:31 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:32 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:33 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:34 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:35 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:36 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

分布式模式下train_and_evaluate

请教,分布式模式下train_and_evaluate无法触发evaluate,tf中提到需要启动evaluate节点,且该节点不属于训练集群,请问xlearning下如何处理。

train_and_evaluate的stop condition只有max_step,有没有比较好的方式,通过验证集提前结束,防止过拟合的方案。

运行demo 出现如下错误?

Application application_1510661908139_1155931 failed 3 times due to AM Container for appattempt_1510661908139_1155931_000003 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xxx:8088/cluster/app/application_1510661908139_1155931Then, click on links to logs of each attempt.
Diagnostics: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "hadoopxxx"; destination host is: "xxx":8020;
java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "xxx"; destination host is: "xxx":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)

无法连接外部数据库

自己开发的应用需要连接外部数据库来获取一些信息,但是发现连不上。
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '***' (timed out)")。
不使用XLearning直接利用单机方式是可以连接上这个数据库的。
XLearning不允许在执行期间访问外部链接吗?

tf server启动异常,会有端口占用问题

reservedSocket.bind(new InetSocketAddress("127.0.0.1", 0));
xlcontainer在申请端口时会用"127.0.0.1",但实际上很多服务是用真实ip(比如192.168.2.2)去绑定端口,这样会有问题,比如已经有服务绑定了192.168.2.2:12345,但xlcontainer仍然会获取到12345为可用端口,并将此端口传给tf去启动服务,从而导致端口占用异常。
我们这边修改了xlcontainer里获取可用端口的实现,改为用真实ip去申请,目前线上稳定,没有再遇到类似问题

XLearning是否支持安全集群?

尝试在启用kerberos集群的环境下启动XLearning HistoryServer会报找不到keytab文件的错误,现在XL支持安全集群吗?

xlearning是否能够支持Tensorflow的输入文件自动分割

xlearning默认的tensorflow任务运行方式输入是要求用户自己分割好文件,即ps 的数目要小于或等于输入文件的个数。 而对于input strategy,之前有咨询过Stream模式的输入策略,这个策略要求输入是标准输入。那么对于文件输入的话,xlearning是否支持自动分割文件?

tensorflow demo失败

请教一下,tensorflow demo运行时,报错“ImporError:libcublas.so.9.0:cannot open sared object file :no such file or directory”
我的环境是:anaconda管理python3.6,tensorflow-gpu1.11,cuda9.0,hadoop2.7.7,master分支版本
但单独运行外部tensorflow-gpu示例代码时不报错。
还需要对哪里进行配置吗?

FATAL ApplicationMaster: Error running ApplicationMaster

Environment:
1.hdfs
`

Started: Thu Apr 19 16:28:15 +0800 2018
3.1.0, r16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d
Fri Mar 30 08:00:00 +0800 2018 by centos from branch-3.1.0
CID-ea3f6bd7-9801-4a0d-a80e-e60465bb928f
BP-232525608-14.29.85.83-1522829491235

2.xlearning:
xlearnging-gpu-beta
commit c732e13
`

Error message:

18/04/19 16:30:01 FATAL ApplicationMaster: Error running ApplicationMaster
java.lang.RuntimeException: Error while build container local resource
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:764)
        at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1171)
        at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1525)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://gpu1:8020/tmp/XLearning/staging/application_1523879759427_0061/AppMaster.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1566)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1581)
        at net.qihoo.xlearning.util.Utilities.createApplicationResource(Utilities.java:121)
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:677)
        ... 2 more
18/04/19 16:30:01 INFO ApplicationMaster: Deleting the staging file successed.

分布式tensorflow如何关闭server?

创建两个ps server,两个worker client,运算然后退出。但是问题是两个worker client运行完退出后,ps server的Container并没有退出,因为还停在server.join()里。

我的问题是:

  1. ps server不退出的原因是什么?
  2. 如何在client计算完成后关闭server?

跑demo出错,日志如下,麻烦看看,新手

18/03/10 17:16:24 INFO ApplicationMaster: Application appId=1, clustertimestamp=1520673372220, attemptId=1
18/03/10 17:16:24 INFO ApplicationMaster: Application files location: file:/tmp/XLearning/staging/application_1520673372220_0001/demo.py,file:/tmp/XLearning/staging/application_1520673372220_0001/dataDeal.py
18/03/10 17:16:24 INFO ApplicationMaster: Application jar location: file:/tmp/XLearning/staging/application_1520673372220_0001/AppMaster.jar
18/03/10 17:16:24 INFO ApplicationMaster: Application conf location: file:/tmp/XLearning/staging/application_1520673372220_0001/core-site.xml
18/03/10 17:16:24 INFO ApplicationMaster: XLearning exec command: python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10
18/03/10 17:16:24 INFO ApplicationMaster: XLearning app type: TENSORFLOW
18/03/10 17:16:24 INFO NMClientAsyncImpl: Upper bound of the thread pool size is 500
18/03/10 17:16:24 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/03/10 17:16:24 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030
18/03/10 17:16:24 INFO ApplicationMessageService: Starting application message server
18/03/10 17:16:24 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:24 INFO Server: Starting Socket Reader #1 for port 35894
18/03/10 17:16:24 INFO Server: IPC Server Responder: starting
18/03/10 17:16:24 INFO Server: IPC Server listener on 35894: starting
18/03/10 17:16:24 INFO ApplicationMessageService: Started application message server at localhost.localdomain/127.0.0.1:35894
18/03/10 17:16:24 INFO ApplicationContainerListener: Starting application web server
18/03/10 17:16:24 INFO log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
18/03/10 17:16:24 INFO AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
18/03/10 17:16:24 INFO HttpRequestLog: Http request log for http.requests.proxy is not defined
18/03/10 17:16:24 INFO HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context proxy
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
18/03/10 17:16:24 INFO HttpServer2: adding path spec: /proxy/*
18/03/10 17:16:25 INFO WebApps: Registered webapp guice modules
18/03/10 17:16:25 INFO HttpServer2: Jetty bound to port 42921
18/03/10 17:16:25 INFO log: jetty-6.1.26
18/03/10 17:16:25 INFO log: Extract jar:file:/home/larry/tools/hadoop-2.8.3/share/hadoop/yarn/hadoop-yarn-common-2.8.3.jar!/webapps/proxy to /tmp/Jetty_0_0_0_0_42921_proxy____yzbxeb/webapp
18/03/10 17:16:25 INFO log: NO JSP Support for /static/xlWebApp, did not find org.apache.jasper.servlet.JspServlet
18/03/10 17:16:25 INFO log: Started [email protected]:42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Web app proxy started at 42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Starting application containers handler server
18/03/10 17:16:25 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:25 INFO Server: Starting Socket Reader #1 for port 36827
18/03/10 17:16:25 INFO Server: IPC Server Responder: starting
18/03/10 17:16:25 INFO Server: IPC Server listener on 36827: starting
18/03/10 17:16:25 INFO ApplicationContainerListener: Container timeout monitor thread had started
18/03/10 17:16:25 INFO ApplicationMaster: master tracking url:localhost:42921
18/03/10 17:16:25 INFO ApplicationMaster: history url: 0.0.0.0:19886/jobhistory/job/application_1520673372220_0001
18/03/10 17:16:25 INFO ApplicationMaster: ApplicationMaster Starting ...
18/03/10 17:16:25 INFO Utilities: input path: file:/tmp/data/tensorflow
18/03/10 17:16:25 INFO ApplicationMaster: XLearning application needs 1 worker and 1 ps containers in fact
18/03/10 17:16:25 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Create ps container request: Capability[<memory:1024, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Try to allocate 1 ps/server containers
18/03/10 17:16:27 INFO AMRMClientImpl: Received new token for : localhost:40131
18/03/10 17:16:27 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000002 on host localhost , with the resource <memory:1024, vCores:1>
18/03/10 17:16:27 INFO RMCallbackHandler: Current acquired worker container 0 / 1 ps container 1 / 1
18/03/10 17:16:27 INFO ApplicationMaster: Total 1 ps containers has allocated.
18/03/10 17:16:27 INFO ApplicationMaster: Try to allocate 1 worker containers
18/03/10 17:16:29 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000003 on host localhost , with the resource <memory:4096, vCores:1>
18/03/10 17:16:29 INFO RMCallbackHandler: Current acquired worker container 1 / 1 ps container 1 / 1
18/03/10 17:16:29 INFO ApplicationMaster: Total 1 worker containers has allocated.
18/03/10 17:16:29 INFO ApplicationMaster: Initializing container_1520673372220_0001_01_000003 input splits
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at net.qihoo.xlearning.AM.ApplicationMaster.allocateInputSplits(ApplicationMaster.java:511)
at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1139)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1475)

运行demo时报错

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.yarn.webapp.WebApps$Builder.build(Lorg/apache/hadoop/yarn/webapp/WebApp;)Lorg/apache/hadoop/yarn/webapp/WebApp;
at net.qihoo.xlearning.AM.ApplicationWebService.start(ApplicationWebService.java:35)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.AM.ApplicationMaster.init(ApplicationMaster.java:217)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1245)

Tensorflow版本兼容和模型保存

目前使用XLearning测试Tensorflow分布式模型训练的场景,遇到一些问题:

  1. XLearning现在兼容支持的最高的Tensorflow的版本是哪个?目前example里面里提供的测试脚本在1.10的版本是测试不通过的,1.3版本可以兼容。

2.能否给出保存pb模型文件的方式,现在测试在本机可以保存pb文件的python代码,使用xlearning保存的时候就会报错。

IllegalArgumentException

src/main/java/net/qihoo/xlearning/AM/ApplicationMaster.java:823
updateBlacklist.invoke(amrmAsync, blackHosts) throw excption "java.lang.IllegalArgumentException: wrong number of arguments"

x-learning 在两个人同时执行demo时,最后报错

运行的 example 为: xlearning/examples/tensorflow/run.sh
任务在执行至 95% 时报错, 在工作节点上看,报的是 目录权限不对,
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bing.wb, access=WRITE, inode="/tmp/XLearning/eventLog":xxxxxxx:supergroup:drwxr-xr-x
xxxxxx 是之前一个同事运行命令后创建的目录, 导致当前我的任务执行失败。
但是我在运行demo 执行,已经对 eventLog 进行了重定向,目前看这个改动貌似没有生效。

[[email protected] /home/bing.wb/xlearning/conf]
$grep -b2  event xlearning-site.xml
1450-    <property>
1465-        <name>xlearning.tf.board.history.dir</name>
1517:        <value>/tmp/bing.wb/XLearning/eventLog</value>
1572-    </property>
1588-    <property>

运行tensorflow demo之后,模型文件找不到

运行官方给出的tensorflow demo之后,在hdfs /tmp/tensorflow_model中未找到模型文件,如下图所示:container里面都是空的。但是yarn上以及shell上,确实提示了运行成功。不知道怎么回事
qq 20180417160407

FATAL Client: Error running Client

hi,I follow your steps ,when I run the $XLEARNING_HOME/bin/xl-submit --app-type "tensorflow" --app-name "tf-demo" --input /tmp/data/tensorflow#data --output /tmp/tensorflow_model#model --files demo.py,dataDeal.py --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" --worker-memory 2G --worker-num 2 --worker-cores 3 --ps-memory 1G --ps-num 1 command,it failed,the error information is

17/12/13 02:57:28 INFO Client: Submitting application to ResourceManager
17/12/13 02:57:28 FATAL Client: Error running Client
java.lang.RuntimeException: Application submitAndMonitor failed!
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:594)
at net.qihoo.xlearning.client.Client.main(Client.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
How I can solve this problem?Can you help me ? Thanks~

在gpu-beta版本中提交到yarn申请资源超时

集群是有资源的,
AM log:
18/04/12 10:53:03 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE 18/04/12 10:53:04 INFO Utilities: input path: hdfs://gpu1:8020/user/tmp/data/tensorflow 18/04/12 10:53:04 INFO ApplicationMaster: XLearning application needs 2 worker and 1 ps containers in fact 18/04/12 10:53:04 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:2, yarn.io/gpu: 2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Create ps container request: Capability[<memory:4096, vCores:2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Try to allocate 1 ps/server containers 18/04/12 10:53:05 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000002 on host gpu3 , with the resource <memory:4096, vCores:2> 18/04/12 10:53:05 INFO RMCallbackHandler: Current acquired worker container 0 / 2 ps container 1 / 1 18/04/12 10:53:06 INFO ApplicationMaster: Total 1 ps containers has allocated. 18/04/12 10:53:06 INFO ApplicationMaster: Try to allocate 2 worker containers 18/04/12 10:53:07 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000003 on host gpu3 , with the resource <memory:4096, vCores:2, yarn.io/gpu: 2> 18/04/12 10:53:07 INFO RMCallbackHandler: Current acquired worker container 1 / 2 ps container 1 / 1 18/04/12 11:03:08 INFO ApplicationMaster: Container waiting except the allocated expiry time. Maybe the Cluster available resources are not satisfied the user need. Please resubmit ! 18/04/12 11:03:08 INFO ApplicationMaster: Unregister Application 18/04/12 11:03:08 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 11:03:08 INFO ApplicationMaster: Application failed.
ResourceManager log:
clusterResource=<memory:400000, vCores:36, yarn.io/gpu: 8>

run.sh:
--worker-memory 4G \ --worker-num 2 \ --worker-cores 2 \ --worker-gcores 2 \ --ps-memory 4G \ --ps-num 1 \ --ps-cores 2 \

集成KERBEROS报错

JobHistoryServer在xlearning-site.xml里面添加了

xlearning.history.keytab
/var/run/cloudera-scm-agent/process/3001-hive-HIVESERVER2/hive.keytab


xlearning.history.principal
hive/bd129118@MYCDH

服务启动成功,但是运行demo的时候报错如下,集群各个机器上票据都正常
18/01/18 10:33:26 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Connecting to ResourceManager failed, try again later
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy21.fetchApplicationMessages(Unknown Source)
at net.qihoo.xlearning.client.Client.waitCompleted(Client.java:682)
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:643)
at net.qihoo.xlearning.client.Client.main(Client.java:711)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]; Host Details : local host is: "bd129118/192.168.129.118"; destination host is: "bd129120":10079;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
... 10 more
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
... 12 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
... 15 more
18/01/18 10:33:27 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)

tensorflow demo运行偶尔出错

18/03/12 20:50:33 INFO XLearningContainer: WARNING:tensorflow:From demo.py:75: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
18/03/12 20:50:33 INFO XLearningContainer: Instructions for updating:
18/03/12 20:50:33 INFO XLearningContainer: Please switch to tf.train.MonitoredTrainingSession
18/03/12 20:50:33 INFO XLearningContainer: 2018-03-12 20:50:33.961848: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
18/03/12 20:50:33 INFO XLearningContainer: Traceback (most recent call last):
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 173, in
18/03/12 20:50:33 INFO XLearningContainer: tf.app.run(main=main)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
18/03/12 20:50:33 INFO XLearningContainer: _sys.exit(main(argv))
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 76, in main
18/03/12 20:50:33 INFO XLearningContainer: with sv.prepare_or_wait_for_session(server.target, config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement = True, log_device_placement = True)) as sess:
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
18/03/12 20:50:33 INFO XLearningContainer: init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
18/03/12 20:50:33 INFO XLearningContainer: sess.run(init_op, feed_dict=init_feed_dict)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
18/03/12 20:50:33 INFO XLearningContainer: run_metadata_ptr)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
18/03/12 20:50:33 INFO XLearningContainer: feed_dict_tensor, options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
18/03/12 20:50:33 INFO XLearningContainer: options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
18/03/12 20:50:33 INFO XLearningContainer: raise type(e)(node_def, op, message)
18/03/12 20:50:33 INFO XLearningContainer: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

进度条不显示

在标准错误输出里打印 report:progress:0.775,在任务监控页面里任然看不到进度,请问改怎么修改,才能显示进度条?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.