qihoo360 / xlearning Goto Github PK

18/03/12 20:50:33 INFO XLearningContainer: WARNING:tensorflow:From demo.py:75: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
18/03/12 20:50:33 INFO XLearningContainer: Instructions for updating:
18/03/12 20:50:33 INFO XLearningContainer: Please switch to tf.train.MonitoredTrainingSession
18/03/12 20:50:33 INFO XLearningContainer: 2018-03-12 20:50:33.961848: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
18/03/12 20:50:33 INFO XLearningContainer: Traceback (most recent call last):
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 173, in
18/03/12 20:50:33 INFO XLearningContainer: tf.app.run(main=main)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
18/03/12 20:50:33 INFO XLearningContainer: _sys.exit(main(argv))
18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 76, in main
18/03/12 20:50:33 INFO XLearningContainer: with sv.prepare_or_wait_for_session(server.target, config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement = True, log_device_placement = True)) as sess:
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
18/03/12 20:50:33 INFO XLearningContainer: init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
18/03/12 20:50:33 INFO XLearningContainer: sess.run(init_op, feed_dict=init_feed_dict)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
18/03/12 20:50:33 INFO XLearningContainer: run_metadata_ptr)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
18/03/12 20:50:33 INFO XLearningContainer: feed_dict_tensor, options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
18/03/12 20:50:33 INFO XLearningContainer: options, run_metadata)
18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
18/03/12 20:50:33 INFO XLearningContainer: raise type(e)(node_def, op, message)
18/03/12 20:50:33 INFO XLearningContainer: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Application application_1510661908139_1155931 failed 3 times due to AM Container for appattempt_1510661908139_1155931_000003 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xxx:8088/cluster/app/application_1510661908139_1155931Then, click on links to logs of each attempt.
Diagnostics: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "hadoopxxx"; destination host is: "xxx":8020;
java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "xxx"; destination host is: "xxx":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)

关于Tensorflow单机和分布式区分咨询

看了FAQ文档上说Tensorflow这边根据PS的个数来区分单机和分布式，那么如果PS=1，然后Worker=2，这种情况是属于单机还是分布式呢？

命令行是否支持多个输入输出？

xl-submit 命令行是否能够支持多个input以及多个output？

demo运行失败

按照README的流程，执行 sh run.sh 时出错：

报错信息：
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.JobConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

请问有没有遇到相同问题的，有没有解决的？

跑demo出错，日志如下，麻烦看看，新手

18/03/10 17:16:24 INFO ApplicationMaster: Application appId=1, clustertimestamp=1520673372220, attemptId=1
18/03/10 17:16:24 INFO ApplicationMaster: Application files location: file:/tmp/XLearning/staging/application_1520673372220_0001/demo.py,file:/tmp/XLearning/staging/application_1520673372220_0001/dataDeal.py
18/03/10 17:16:24 INFO ApplicationMaster: Application jar location: file:/tmp/XLearning/staging/application_1520673372220_0001/AppMaster.jar
18/03/10 17:16:24 INFO ApplicationMaster: Application conf location: file:/tmp/XLearning/staging/application_1520673372220_0001/core-site.xml
18/03/10 17:16:24 INFO ApplicationMaster: XLearning exec command: python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10
18/03/10 17:16:24 INFO ApplicationMaster: XLearning app type: TENSORFLOW
18/03/10 17:16:24 INFO NMClientAsyncImpl: Upper bound of the thread pool size is 500
18/03/10 17:16:24 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/03/10 17:16:24 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030
18/03/10 17:16:24 INFO ApplicationMessageService: Starting application message server
18/03/10 17:16:24 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:24 INFO Server: Starting Socket Reader #1 for port 35894
18/03/10 17:16:24 INFO Server: IPC Server Responder: starting
18/03/10 17:16:24 INFO Server: IPC Server listener on 35894: starting
18/03/10 17:16:24 INFO ApplicationMessageService: Started application message server at localhost.localdomain/127.0.0.1:35894
18/03/10 17:16:24 INFO ApplicationContainerListener: Starting application web server
18/03/10 17:16:24 INFO log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
18/03/10 17:16:24 INFO AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
18/03/10 17:16:24 INFO HttpRequestLog: Http request log for http.requests.proxy is not defined
18/03/10 17:16:24 INFO HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context proxy
18/03/10 17:16:24 INFO HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
18/03/10 17:16:24 INFO HttpServer2: adding path spec: /proxy/*
18/03/10 17:16:25 INFO WebApps: Registered webapp guice modules
18/03/10 17:16:25 INFO HttpServer2: Jetty bound to port 42921
18/03/10 17:16:25 INFO log: jetty-6.1.26
18/03/10 17:16:25 INFO log: Extract jar:file:/home/larry/tools/hadoop-2.8.3/share/hadoop/yarn/hadoop-yarn-common-2.8.3.jar!/webapps/proxy to /tmp/Jetty_0_0_0_0_42921_proxy____yzbxeb/webapp
18/03/10 17:16:25 INFO log: NO JSP Support for /static/xlWebApp, did not find org.apache.jasper.servlet.JspServlet
18/03/10 17:16:25 INFO log: Started [email protected]:42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Web app proxy started at 42921
18/03/10 17:16:25 INFO ApplicationContainerListener: Starting application containers handler server
18/03/10 17:16:25 INFO CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
18/03/10 17:16:25 INFO Server: Starting Socket Reader #1 for port 36827
18/03/10 17:16:25 INFO Server: IPC Server Responder: starting
18/03/10 17:16:25 INFO Server: IPC Server listener on 36827: starting
18/03/10 17:16:25 INFO ApplicationContainerListener: Container timeout monitor thread had started
18/03/10 17:16:25 INFO ApplicationMaster: master tracking url:localhost:42921
18/03/10 17:16:25 INFO ApplicationMaster: history url: 0.0.0.0:19886/jobhistory/job/application_1520673372220_0001
18/03/10 17:16:25 INFO ApplicationMaster: ApplicationMaster Starting ...
18/03/10 17:16:25 INFO Utilities: input path: file:/tmp/data/tensorflow
18/03/10 17:16:25 INFO ApplicationMaster: XLearning application needs 1 worker and 1 ps containers in fact
18/03/10 17:16:25 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Create ps container request: Capability[<memory:1024, vCores:1>]Priority[3]
18/03/10 17:16:25 INFO ApplicationMaster: Try to allocate 1 ps/server containers
18/03/10 17:16:27 INFO AMRMClientImpl: Received new token for : localhost:40131
18/03/10 17:16:27 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000002 on host localhost , with the resource <memory:1024, vCores:1>
18/03/10 17:16:27 INFO RMCallbackHandler: Current acquired worker container 0 / 1 ps container 1 / 1
18/03/10 17:16:27 INFO ApplicationMaster: Total 1 ps containers has allocated.
18/03/10 17:16:27 INFO ApplicationMaster: Try to allocate 1 worker containers
18/03/10 17:16:29 INFO RMCallbackHandler: Acquired container container_1520673372220_0001_01_000003 on host localhost , with the resource <memory:4096, vCores:1>
18/03/10 17:16:29 INFO RMCallbackHandler: Current acquired worker container 1 / 1 ps container 1 / 1
18/03/10 17:16:29 INFO ApplicationMaster: Total 1 worker containers has allocated.
18/03/10 17:16:29 INFO ApplicationMaster: Initializing container_1520673372220_0001_01_000003 input splits
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at net.qihoo.xlearning.AM.ApplicationMaster.allocateInputSplits(ApplicationMaster.java:511)
at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1139)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1475)

关于tensorflow分布式性能问题

能分享下架构下的tensorflow分布式性能benchmark数据么？

hadoop版本为3.1.1，是否可以？

分布式模式下train_and_evaluate

请教，分布式模式下train_and_evaluate无法触发evaluate，tf中提到需要启动evaluate节点，且该节点不属于训练集群，请问xlearning下如何处理。

train_and_evaluate的stop condition只有max_step，有没有比较好的方式，通过验证集提前结束，防止过拟合的方案。

anaconda环境下运行tensorflow demo

在centos服务器中，配置anaconda为python环境。
运行 run.sh 提示“no such file or directory” ，请问该怎么办？

在gpu-beta版本中提交到yarn申请资源超时

集群是有资源的，
AM log：
18/04/12 10:53:03 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE 18/04/12 10:53:04 INFO Utilities: input path: hdfs://gpu1:8020/user/tmp/data/tensorflow 18/04/12 10:53:04 INFO ApplicationMaster: XLearning application needs 2 worker and 1 ps containers in fact 18/04/12 10:53:04 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:2, yarn.io/gpu: 2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Create ps container request: Capability[<memory:4096, vCores:2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Try to allocate 1 ps/server containers 18/04/12 10:53:05 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000002 on host gpu3 , with the resource <memory:4096, vCores:2> 18/04/12 10:53:05 INFO RMCallbackHandler: Current acquired worker container 0 / 2 ps container 1 / 1 18/04/12 10:53:06 INFO ApplicationMaster: Total 1 ps containers has allocated. 18/04/12 10:53:06 INFO ApplicationMaster: Try to allocate 2 worker containers 18/04/12 10:53:07 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000003 on host gpu3 , with the resource <memory:4096, vCores:2, yarn.io/gpu: 2> 18/04/12 10:53:07 INFO RMCallbackHandler: Current acquired worker container 1 / 2 ps container 1 / 1 18/04/12 11:03:08 INFO ApplicationMaster: Container waiting except the allocated expiry time. Maybe the Cluster available resources are not satisfied the user need. Please resubmit ! 18/04/12 11:03:08 INFO ApplicationMaster: Unregister Application 18/04/12 11:03:08 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 11:03:08 INFO ApplicationMaster: Application failed.
ResourceManager log：
clusterResource=<memory:400000, vCores:36, yarn.io/gpu: 8>

run.sh:
--worker-memory 4G \ --worker-num 2 \ --worker-cores 2 \ --worker-gcores 2 \ --ps-memory 4G \ --ps-num 1 \ --ps-cores 2 \

model_fn() 中 if mode == tf.estimator.ModeKeys.PREDICT: 不支持

model = tf.estimator.Estimator(...)
ps=model.predict(...)

xlearning好像直接跳过 ps=model.predict(...) 不执行，直接显示success

mode_fn() 中tf.estimator.ModeKeys.PREDICT 模块加print 函数，在日志中没有看到print 输出的内容
用 model.eval(...) ,在model_fn(）中的 print内容能打印出来，
从这些现象看 model.predict(...) 确实被忽略了

ERROR Client: Application run failed!

hi,I have pull the latest code and add the "--queue default" at the end of the "run.sh" first.The run infomation is
17/12/13 06:41:52 INFO Client: Copying /opt/XLearning/target/xlearning-1.0/lib/xlearning-1.0-hadoop2.7.3.jar to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/AppMaster.jar
17/12/13 06:41:52 INFO Client: Building environments for the application master
17/12/13 06:41:52 INFO Client: Copy xlearning files from local filesystem to remote.
17/12/13 06:41:52 INFO Client: Copying demo.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/demo.py
17/12/13 06:41:52 INFO Client: Copying dataDeal.py to remote path hdfs://test-2:8020/tmp/XLearning/staging/application_1511938500942_0009/dataDeal.py
17/12/13 06:41:52 INFO Client: Building application master launch command
17/12/13 06:41:52 INFO Client: Application master launch command: ${JAVA_HOME}/bin/java -Xms1024m -Xmx1024m net.qihoo.xlearning.AM.ApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
17/12/13 06:41:52 INFO Client: Submitting application to ResourceManager
17/12/13 06:41:53 INFO YarnClientImpl: Submitted application application_1511938500942_0009
17/12/13 06:41:53 INFO Client: Application submitAndMonitor succeed
17/12/13 06:41:53 INFO Client: The url to track the job: http://test-2:8088/proxy/application_1511938500942_0009/
17/12/13 06:41:53 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:54 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:55 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:56 INFO Client: Application report for application_1511938500942_0009 (state: ACCEPTED)
17/12/13 06:41:57 INFO Client: Application report for application_1511938500942_0009 (state: FAILED)
17/12/13 06:41:57 INFO Client: Application has completed with YarnApplicationState=FAILED and FinalApplicationStatus=FAILED
17/12/13 06:41:57 ERROR Client: Application run failed!

I view the log under $XLEARNING_HOME/logs files ,the error is Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable,which is about IO error.
More Information is follows:

17/12/13 06:56:03 INFO MetricsSystemImpl: Stopping JobHistoryServer metrics system...
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system stopped.
17/12/13 06:56:03 INFO MetricsSystemImpl: JobHistoryServer metrics system shutdown complete.
17/12/13 06:56:03 FATAL JobHistoryServer: Error starting JobHistoryServer
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.serviceStart(JobHistoryServer.java:218)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:250)
at net.qihoo.xlearning.jobhistory.JobHistoryServer.main(JobHistoryServer.java:259)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /hadoop/mapreduce/jhs/mr-jhs-state/LOCK: Resource temporarily unavailable
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService.startStorage(HistoryServerLeveldbStateStoreService.java:82)
at org.apache.hadoop.mapreduce.v2.hs.HistoryServerStateStoreService.serviceStart(HistoryServerStateStoreService.java:79)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 5 more
17/12/13 06:56:03 INFO ExitUtil: Exiting with status -1
17/12/13 06:56:03 INFO JobHistoryServer: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JobHistoryServer at test-2/172.16.12.46
************************************************************/
Do you have any ideas about this?How can I solve it?Thank you~

连接RM时候，一直在等待

运行tensorflow 下的demo： run.sh 之后出现如下问题
17/12/06 11:55:26 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/12/06 11:55:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:28 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:29 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:30 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:31 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:32 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:33 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:34 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:35 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
17/12/06 11:55:36 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

sofa

xlearning是否能够支持Tensorflow的输入文件自动分割

xlearning默认的tensorflow任务运行方式输入是要求用户自己分割好文件，即ps 的数目要小于或等于输入文件的个数。而对于input strategy，之前有咨询过Stream模式的输入策略，这个策略要求输入是标准输入。那么对于文件输入的话，xlearning是否支持自动分割文件？

需要hadoop集群中安装对应深度学习运行环境嘛？

比如想运行MXNet，需要hadoop集群的每台机器都安装MXNet的运行环境嘛？还是只要client端安装就行

FATAL Client: Error running Client

hi,I follow your steps ,when I run the $XLEARNING_HOME/bin/xl-submit --app-type "tensorflow" --app-name "tf-demo" --input /tmp/data/tensorflow#data --output /tmp/tensorflow_model#model --files demo.py,dataDeal.py --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" --worker-memory 2G --worker-num 2 --worker-cores 3 --ps-memory 1G --ps-num 1 command,it failed,the error information is

17/12/13 02:57:28 INFO Client: Submitting application to ResourceManager
17/12/13 02:57:28 FATAL Client: Error running Client
java.lang.RuntimeException: Application submitAndMonitor failed!
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:594)
at net.qihoo.xlearning.client.Client.main(Client.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
How I can solve this problem?Can you help me ? Thanks~

tensorflow demo失败

请教一下，tensorflow demo运行时，报错“ImporError：libcublas.so.9.0:cannot open sared object file :no such file or directory”
我的环境是：anaconda管理python3.6，tensorflow-gpu1.11,cuda9.0,hadoop2.7.7，master分支版本
但单独运行外部tensorflow-gpu示例代码时不报错。
还需要对哪里进行配置吗？

任务提交之后创建work和ps container失败 (ln: command not found)

Hadoop版本：3.1.0
XL版本：xlearning-gpu-beta

XL的AM启动之后会通知NodeManager执行launch_container.sh创建work和ps对于的container，执行launch_container.sh会有如下错误：

ps：如果不是通过XL提交任务，只是提交一个MR任务(wordcount) container创建没有问题。

集成KERBEROS报错

JobHistoryServer在xlearning-site.xml里面添加了

xlearning.history.keytab
/var/run/cloudera-scm-agent/process/3001-hive-HIVESERVER2/hive.keytab

xlearning.history.principal
hive/bd129118@MYCDH

服务启动成功，但是运行demo的时候报错如下，集群各个机器上票据都正常
18/01/18 10:33:26 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN UserGroupInformation: PriviledgedActionException as:hive/bd129118@MYCDH (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
18/01/18 10:33:26 WARN Client: Connecting to ResourceManager failed, try again later
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy21.fetchApplicationMessages(Unknown Source)
at net.qihoo.xlearning.client.Client.waitCompleted(Client.java:682)
at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:643)
at net.qihoo.xlearning.client.Client.main(Client.java:711)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]; Host Details : local host is: "bd129118/192.168.129.118"; destination host is: "bd129120":10079;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
... 10 more
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
... 12 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
... 15 more
18/01/18 10:33:27 INFO Client: Application report for application_1516178233465_0044 (state: RUNNING)

无法连接外部数据库

自己开发的应用需要连接外部数据库来获取一些信息，但是发现连不上。
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '***' (timed out)")。
不使用XLearning直接利用单机方式是可以连接上这个数据库的。
XLearning不允许在执行期间访问外部链接吗？

how to use it on mac

怎么在mac下使用

JobHistoryServer 服务访问报错

Hadoop版本：3.1.0
XL版本：xlearning-gpu-beta
进程启动正常，但是访问：http://xlhost:19886/jobhistory 报如下错误：

运行环境什么时候支持docker？

分布式tensorflow如何关闭server？

创建两个ps server，两个worker client，运算然后退出。但是问题是两个worker client运行完退出后，ps server的Container并没有退出，因为还停在server.join()里。

我的问题是：

ps server不退出的原因是什么？
如何在client计算完成后关闭server？

tf server启动异常，会有端口占用问题

reservedSocket.bind(new InetSocketAddress("127.0.0.1", 0));
xlcontainer在申请端口时会用"127.0.0.1"，但实际上很多服务是用真实ip(比如192.168.2.2)去绑定端口，这样会有问题，比如已经有服务绑定了192.168.2.2:12345，但xlcontainer仍然会获取到12345为可用端口，并将此端口传给tf去启动服务，从而导致端口占用异常。
我们这边修改了xlcontainer里获取可用端口的实现，改为用真实ip去申请，目前线上稳定，没有再遇到类似问题

pom file lose dependence com.google.code.gson

使用hadoop 2.7之前版本需添加如下依赖

com.google.code.gson
gson
2.2.4

macos 怎么安装，特别想尝试一下这个框架

求助

有没有编程接口用来提交任务

文档中任务提交是使用xl-submit命令行的方式提交任务的，有没有可以使用python接口用来提交任务？

XLearning是否支持安全集群？

尝试在启用kerberos集群的环境下启动XLearning HistoryServer会报找不到keytab文件的错误，现在XL支持安全集群吗？

错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster

Application application_1541471478754_0001 failed 2 times due to AM Container for appattempt_1541471478754_0001_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2018-11-06 10:32:36.870]Exception from container-launch.
Container id: container_1541471478754_0001_02_000001
Exit code: 1
[2018-11-06 10:32:36.878]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
[2018-11-06 10:32:36.879]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 net.qihoo.xlearning.AM.ApplicationMaster
For more detailed output, check the application tracking page: http://why-System-Product-Name:10086/cluster/app/application_1541471478754_0001 Then click on links to logs of each attempt.
. Failing the application.

各位大佬，跑xlearning-gpu-1.3的run.sh的时候出现这个错误，怎么解决？
环境：
ubuntu16
hadoop3.1.1
xlearning-gpu-1.3

自定义任务，在xl-submit中通过--files提交了多个脚本，但works里仍然提示找不到需要的的脚本

如图，提交命令里已经包括conv3d_utils.py文件。

但提交任务后失败，某个work的日志显示缺了conv3d_utils.py文件。

可明明已经提交上去了啊，这是什么问题？

Tensorflow版本兼容和模型保存

目前使用XLearning测试Tensorflow分布式模型训练的场景，遇到一些问题：

XLearning现在兼容支持的最高的Tensorflow的版本是哪个？目前example里面里提供的测试脚本在1.10的版本是测试不通过的，1.3版本可以兼容。

2.能否给出保存pb模型文件的方式，现在测试在本机可以保存pb文件的python代码，使用xlearning保存的时候就会报错。

Tensorflow任务修改不同worker num，任务提交失败

对于demo任务，我在submit命令行中，修改了worker的个数>=3的 worker num执行都会失败，不知道什么问题，从DEBUG日志也看不出是什么错。

运行tensorflow demo之后，模型文件找不到

运行官方给出的tensorflow demo之后，在hdfs /tmp/tensorflow_model中未找到模型文件，如下图所示：container里面都是空的。但是yarn上以及shell上，确实提示了运行成功。不知道怎么回事

能否支持按天先后分发数据来训练

能否支持按天先后分发数据来训练， 0901分发训练完毕后，再0902， 0903，。。。。

tensorflow 分布式estimator启动出现 TrainStatus:false

框架：tensorflow
环境：gpu群：6卡p100
xlearning
本地代码已经能跑通，在xlearning上报错；
还望懂得人帮忙解决一下。

FATAL ApplicationMaster: Error running ApplicationMaster

Environment:
1.hdfs
`

Started:	Thu Apr 19 16:28:15 +0800 2018
3.1.0, r16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d
Fri Mar 30 08:00:00 +0800 2018 by centos from branch-3.1.0
CID-ea3f6bd7-9801-4a0d-a80e-e60465bb928f
BP-232525608-14.29.85.83-1522829491235

2.xlearning:
xlearnging-gpu-beta
commit c732e13
`

Error message:

18/04/19 16:30:01 FATAL ApplicationMaster: Error running ApplicationMaster
java.lang.RuntimeException: Error while build container local resource
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:764)
        at net.qihoo.xlearning.AM.ApplicationMaster.run(ApplicationMaster.java:1171)
        at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1525)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://gpu1:8020/tmp/XLearning/staging/application_1523879759427_0061/AppMaster.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1566)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1581)
        at net.qihoo.xlearning.util.Utilities.createApplicationResource(Utilities.java:121)
        at net.qihoo.xlearning.AM.ApplicationMaster.buildContainerLocalResource(ApplicationMaster.java:677)
        ... 2 more
18/04/19 16:30:01 INFO ApplicationMaster: Deleting the staging file successed.

x-learning 在两个人同时执行demo时，最后报错

运行的 example 为： xlearning/examples/tensorflow/run.sh
任务在执行至 95% 时报错，在工作节点上看，报的是目录权限不对，
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=bing.wb, access=WRITE, inode="/tmp/XLearning/eventLog":xxxxxxx:supergroup:drwxr-xr-x
xxxxxx 是之前一个同事运行命令后创建的目录，导致当前我的任务执行失败。
但是我在运行demo 执行，已经对 eventLog 进行了重定向，目前看这个改动貌似没有生效。

[[email protected] /home/bing.wb/xlearning/conf]
$grep -b2  event xlearning-site.xml
1450-    <property>
1465-        <name>xlearning.tf.board.history.dir</name>
1517:        <value>/tmp/bing.wb/XLearning/eventLog</value>
1572-    </property>
1588-    <property>

跑TensorFlow Demo时，worker训练完成，但是worker就是不退出，一直在卡着。

请教一下：
Xlearning 1.1 版本，跑 TensorFlow的demo，日志中显示所有的work都已经训练完毕了，但是只有task_index = 0 的container状态更新为success，其他container一直在running，日志中没有任何输出？

另外，问一下，无论worker-num，设置多少个，都是在一台机器上起的吗？

进度条不显示

在标准错误输出里打印 report:progress:0.775，在任务监控页面里任然看不到进度，请问改怎么修改，才能显示进度条？

INFO Client: reporter progress:100.00%

然后就一直卡在了 INFO Client: reporter progress:100.00%

IllegalArgumentException

src/main/java/net/qihoo/xlearning/AM/ApplicationMaster.java:823
updateBlacklist.invoke(amrmAsync, blackHosts) throw excption "java.lang.IllegalArgumentException: wrong number of arguments"

运行demo时报错

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.yarn.webapp.WebApps$Builder.build(Lorg/apache/hadoop/yarn/webapp/WebApp;)Lorg/apache/hadoop/yarn/webapp/WebApp;
at net.qihoo.xlearning.AM.ApplicationWebService.start(ApplicationWebService.java:35)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at net.qihoo.xlearning.AM.ApplicationMaster.init(ApplicationMaster.java:217)
at net.qihoo.xlearning.AM.ApplicationMaster.main(ApplicationMaster.java:1245)

qihoo360 / xlearning Goto Github PK

xlearning's Issues

Recommend Projects

Recommend Topics

Recommend Org