Coder Social home page Coder Social logo

xiaomi / minos Goto Github PK

View Code? Open in Web Editor NEW
522.0 104.0 200.0 12.65 MB

Minos is beyond a hadoop deployment system.

License: Apache License 2.0

Shell 0.43% Python 54.62% Puppet 0.05% XSLT 0.04% JavaScript 40.86% CSS 0.11% HTML 3.60% Smarty 0.25% Makefile 0.03%

minos's Introduction

What is Minos

Minos is a distributed deployment and monitoring system. It was initially developed and used at Xiaomi to deploy and manage the Hadoop, HBase and ZooKeeper clusters used in the company. Minos can be easily extended to support other systems, among which HDFS, YARN and Impala have been supported in the current release.

Components

The Minos system contains the following four components:

Client

This is the command line client tool used to deploy and manage processes of various systems. You can use this client to perform various deployment tasks, e.g. installing, (re)starting, stopping a service. Currently, this client supports ZooKeeper, HDFS, HBase, YARN and Impala. It can be extended to support other systems. You can refer to the following Using Client to learn how to use it.

Owl

This is the dashboard system to display the status of all processes, where users can take a overview of the whole clusters managed by Minos. It collects data from servers through JMX interface. And it organizes pages in cluster, job and task corresponding to the definition in cluster configuration. It also provides some utils like health alerter, HDFS quota updater and quota reportor. You can refer to Installing Owl to learn how to install and use it.

Supervisor

This is the process management and monitoring system. Supervisor is an open source project, a client/server system that allows its users to monitor and control a number of processes on a UNIX-like operating system.

Based on the version of supervisor-3.0b1, we extended Supervisor to support Minos. We implemented an RPC interface under the deployment directory, so that our deploy client can invoke the services supplied by supervisord.

When deploying a Hadoop cluster for the first time, you need to set up supervisord on every production machine. This only needs to be done once. You can refer to Installing Supervisor to learn how to install and use it.

Tank

This is a simple package management Django app server for our deployment tool. When setting up a cluster for the first time, you should set up a tank server first. This also needs to be done only once. You can refer to Installing Tank to learn how to install and use it.

Setting Up Minos on Centos/Ubuntu

Prerequisites

Install Python

Make sure install Python 2.7 or later from http://www.python.org.

Install JDK

Make sure that the Oracle Java Development Kit 6 is installed (not OpenJDK) from http://www.oracle.com/technetwork/java/javase/downloads/index.html, and that JAVA_HOME is set in your environment.

Building Minos

Clone the Minos repository

To Using Minos, just check out the code on your production machine:

git clone https://github.com/XiaoMi/minos.git

Build the virtual environment

All the Components of Minos run with its own virtual environment. So, before using Minos, building the virtual environment firstly.

cd minos
./build.sh build

Note: If you only use the Client component on your current machine, this operation is enough, then you can refer to Using Client to learn how to deploy and manage a cluster. If you want to use the current machine as a Tank server, you can refer to Installing Tank to learn how to do that. Similarly, if you want to use the current machine as a Owl server or a Supervisor server, you can refer to Installing Owl and Installing Supervisor respectively.

Installing Tank

Start Tank

cd minos
./build.sh start tank --tank_ip ${your_local_ip} --tank_port ${port_tank_will_listen}

Note: If you do not specify the tank_ip and tank_port, it will start tank server using 0.0.0.0 on 8000 port.

Stop Tank

./build.sh stop tank

Installing Supervisor

Prerequisites

Make sure you have intstalled Tank on one of the production machines.

Start Supervisor

cd minos
./build.sh start supervisor --tank_ip ${tank_server_ip} --tank_port ${tank_server_port}

When starting supervisor for the first time, the tank_ip and tank_port must be specified.

After starting supervisor on the destination machine, you can access the web interface of the supervisord. For example, if supervisord listens on port 9001, and the serving machine's IP address is 192.168.1.11, you can access the following URL to view the processes managed by supervisord:

http://192.168.1.11:9001/

Stop Supervisor

./build.sh stop supervisor

Monitor Processes

We use Superlance to monitor processes. Superlance is a package of plug-in utilities for monitoring and controlling processes that run under supervisor.

We integrate superlance-0.7 to our supervisor system, and use the crashmail tool to monitor all processes. When a process exits unexpectedly, crashmail will send an alert email to a mailing list that is configurable.

We configure crashmail as an auto-started process. It will start working automatically when the supervisor is started. Following is a config example, taken from minos/build/template/supervisord.conf.tmpl, that shows how to configure crashmail:

[eventlistener:crashmailbatch-monitor]
command=python superlance/crashmailbatch.py \
        --toEmail="[email protected]" \
        --fromEmail="[email protected]" \
        --password="123456" \
        --smtpHost="mail.example.com" \
        --tickEvent=TICK_5 \
        --interval=0.5
events=PROCESS_STATE,TICK_5
buffer_size=100
stdout_logfile=crashmailbatch.stdout
stderr_logfile=crashmailbatch.stderr
autostart=true

Note: The related configuration information such as the server port or username is set in minos/build/template/supervisord.conf.tmpl, if you don't want to use the default value, change it.

Using Client

Prerequisites

Make sure you have intstalled Tank and Supervisor on your production machines.

A Simple Tutorial

Here we would like to show you how to use the client in a simple tutorial. In this tutorial we will use Minos to deploy an HDFS service, which itself requires the deployment of a ZooKeeper service.

The following are some conventions we will use in this tutorial:

  • Cluster type: we define three types of clusters: tst for testing, prc for offline processing, and srv for online serving.
  • ZooKeeper cluster name: we define the ZooKeeper cluster name using the IDC short name and the cluster type. For example, dptst is used to name a testing cluster at IDC dp.
  • Other service cluster names: we define other service cluster names using the corresponding ZooKeeper cluster name and the name of the business for which the service is intended to serve. For example, the dptst-example is the name of a testing cluster used to do example tests.
  • Configuration file names: all the services will have a corresponding configuration file, which will be named as ${service}-${cluster}.cfg. For example, the dptst ZooKeeper service's configuration file is named as zookeeper-dptst.cfg, and the dptst example HDFS service's configuration file is named as hdfs-dptst-example.cfg.

Configuring deploy.cfg

There is a configuration file named deploy.cfg under the root directory of minos. You should first edit this file to set up the deployment environment. Make sure that all service packages are prepared and configured in deploy.cfg.

Configuring ZooKeeper

As mentioned in the cluster naming conventions, we will set up a testing ZooKeeper cluster at the dp IDC, and the corresponding configuration file for the cluster will be named as zookeeper-dptst.cfg.

You can edit zookeeper-dptst.cfg under the config/conf/zookeeper directory to configure the cluster. The zookeeper-dptst.cfg is well commented and self explained, so we will not explain more here.

Setting up a ZooKeeper Cluster

To set up a ZooKeeper cluster, just do the following two steps:

  • Install a ZooKeeper package to the tank server:

      cd minos/client
      ./deploy install zookeeper dptst
    
  • Bootstrap the cluster, this is only needed once when the cluster is setup for the first time:

      ./deploy bootstrap zookeeper dptst
    

Here are some handy ways to manage the cluster:

  • Show the status of the ZooKeeper service:

      ./deploy show zookeeper dptst
    
  • Start/Stop/Restart the ZooKeeper cluster:

      ./deploy stop zookeeper dptst
      ./deploy start zookeeper dptst
      ./deploy restart zookeeper dptst
    
  • Clean up the ZooKeeper cluster:

      ./deploy cleanup zookeeper dptst
    
  • Rolling update the ZooKeeper cluster:

      ./deploy rolling_update zookeeper dptst
    

Configuring HDFS

Now it is time to configure the HDFS system. Here we set up a testing HDFS cluster named dptst-example, whose configuration file will be named as hdfs-dptst-example.cfg, as explained in the naming conventions.

You can edit hdfs-dptst-example.cfg under the config/conf/hdfs directory to configure the cluster. The hdfs-dptst-example.cfg is well commented and self explained, so we will not explain more here.

Setting Up HDFS Cluster

Setting up and managing an HDFS cluster is similar to setting up and managing a ZooKeeper cluster. The only difference is the cluster name, dptst-example, which implies that the corresponding ZooKeeper cluster is dptst:

./deploy install hdfs dptst-example
./deploy bootstrap hdfs dptst-example
./deploy show hdfs dptst-example
./deploy stop hdfs dptst-example
./deploy start hdfs dptst-example
./deploy restart hdfs dptst-example
./deploy rolling_update hdfs dptst-example --job=datanode
./deploy cleanup hdfs dptst-example

Shell

The client tool also supports a very handy command named shell. You can use this command to manage the files on HDFS, tables on HBase, jobs on YARN, etc. Here are some examples about how to use the shell command to perform several different HDFS operations:

./deploy shell hdfs dptst-example dfs -ls /
./deploy shell hdfs dptst-example dfs -mkdir /test
./deploy shell hdfs dptst-example dfs -rm -R /test

You can run ./deploy --help to see the detailed help messages.

Installing Owl

Owl must be installed on the machine that you also use the Client component, they both use the same set of cluster configuration files.

Prerequisites

Install Gnuplot

Gnuplot is required for opentsdb, you can install it with the following command.

Centos: sudo yum install gnuplot
Ubuntu: sudo apt-get install gnuplot

Install Mysql

Ubuntu:
sudo apt-get install mysql-server
sudo apt-get install mysql-client

Centos:
yum install mysql-server mysql mysql-devel

Configuration

Configure the clusters you want to monitor with owl in minos/config/owl/collector.cfg. Following is an example that shows how to modify the configuration.

[collector]
# service name(space seperated)
service = hdfs hbase

[hdfs]
# cluster name(space seperated)
clusters=dptst-example
# job name(space seperated)
jobs=journalnode namenode datanode
# url for collecotr, usually JMX url
metric_url=/jmx?qry=Hadoop:*

Note: Some other configurations such as and opentsdb port is set in minos/build/minos_config.py. You can change the default port for avoiding port conflicts.

Start Owl

cd minos
./build.sh start owl --owl_ip ${your_local_ip} --owl_port ${port_owl_monitor_will_listen}

After starting Owl, you can access the web interface of the Owl. For example, if Owl listens on port 8088, and the machine's IP address is 192.168.1.11, you can access the following URL to view the Owl web interface:

http://192.168.1.11:8088/

Stop Owl

./build.sh stop owl

FAQ

  1. When installing Mysql-python, you may get an error of _mysql.c:44:23: error: my_config.h: No such file or directory (centos) or EnvironmentError: mysql_config not found (ubuntu). As mysql_config is part of mysql-devel, installing mysql-devel allows the installation of Mysql-python. So you may need to install it.

     ubuntu: sudo apt-get install libmysqlclient-dev
     centos: sudo yum install mysql-devel
    
  2. When installing twisted, you may get an error of CompressionError: bz2 module is not available and compile appears:

     Python build finished, but the necessary bits to build these modules were not found:
     _sqlite3           _tkinter           bsddb185
     bz2                dbm                dl
    

Then, you may need to install bz2 and sqlite3 such as

  sudo apt-get install libbz2-dev
  sudo apt-get install libsqlite3-dev
  1. When setting up the stand-alone hbase on Ubuntu, you may fail to start it because of the /etc/hosts file. You can refer to http://hbase.apache.org/book/quickstart.html#ftn.d2907e114 to fix the problem.

  2. When using the Minos client to install a service package, if you get an error of socket.error: [Errno 101] Network is unreachable, please check your tank server configuration in deploy.cfg file, you might miss it.

Note: See Minos Wiki for more advanced features.

minos's People

Contributors

atupal avatar ilovehao avatar lshmouse avatar lzj0111 avatar renozhang avatar suyannone avatar wuzesheng avatar yxac avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minos's Issues

owl配置文件问题

请问在配置collector.cfg文件
配置hdfs中的metric_url=/jmx?qry=Hadoop:请问“Hadoop:”中的“Hadoop”是hadoop集群的名称吗?
配置hbase的metric_url=/jmx?qry=hadoop:请问“hadoop”是什么意思
如下是配置文件内容
[collector]
services=hdfs hbase yarn
period=10
[hdfs]
clusters=dptst-example
jobs=journalnode namenode datanode
metric_url=/jmx?qry=Hadoop:

[hbase]
clusters=dptst-example
jobs=master regionserver
metric_url=/jmx?qry=hadoop:*
[yarn]
clusters=dptst-example
jobs=resourcemanager nodemanager historyserver proxyserver
metric_url=/jmx?qry=Hadoop:*

第一次启动supervisor的时候报这个错,是下载源的问题吗?

[hadoop@jl-master minos-master]$ ./build.sh start supervisor --tank_ip 172.16.8.1 --tank_port 8000
2014-11-24 11:48:58 Building supervisor
2014-11-24 11:48:58 Check and install prerequisite python libraries
2014-11-24 11:48:58 Installing elementtree
Downloading/unpacking elementtree>=1.2.6-20050316
Could not find any downloads that satisfy the requirement elementtree>=1.2.6-20050316
Some externally hosted files were ignored (use --allow-external elementtree to allow).
Cleaning up...
No distributions at all found for elementtree>=1.2.6-20050316
Storing debug log for failure in /data/hadoop/.pip/pip.log
2014-11-24 11:48:59 Command '['/data/hadoop/z.zeng/minos-master/build/env/bin/pip', 'install', 'elementtree>=1.2.6-20050316']' returned non-zero exit status 1

关于初始化Supervisor时出现elementtree安装错误

系统环境:
CentOS 6.5 64位
Python 2.7.8

报错内容:

[root@AY1407221519105745fcZ minos]# ./build.sh start supervisor --tank_ip 127.0.0.1 --tank_port 8000
2014-10-14 11:41:16 Building supervisor
2014-10-14 11:41:16 Check and install prerequisite python libraries
2014-10-14 11:41:16 Installing elementtree
Downloading/unpacking elementtree>=1.2.6-20050316
  Could not find any downloads that satisfy the requirement elementtree>=1.2.6-20050316
  Some externally hosted files were ignored (use --allow-external elementtree to allow).
Cleaning up...
No distributions at all found for elementtree>=1.2.6-20050316
Storing debug log for failure in /root/.pip/pip.log
2014-10-14 11:41:17 Command '['/root/minos/build/env/bin/pip', 'install', 'elementtree>=1.2.6-20050316']' returned non-zero exit status 1

所提问题:
经检查在Python 2.7.8中已经存在elementtree,是否能把安装elementtree模块的这一步去掉,具体要修改哪个文件?

owl监控问题请教

我的owl启动安装都已经ok了?但是打开监控页面的时候什么都检测不到HDFS和其他的监控指标?请问我还要配置什么?
我的/data/hadoop/z.zeng/minos-master/config/owl/collector.cfg内容如下:

collector config

[collector]
services=hdfs hbase yarn impala

Period to fetch/report metrics, in seconds.

period=10

[hdfs]
clusters=dptst-example
jobs=journalnode namenode datanode

The jmx output of each bean is as following:

{

"name" : "hadoop:service=RegionServer,name=RegionServerDynamicStatistics",

"modelerType" : "org.apache.hadoop.hbase.regionserver.metrics.RegionServerDynamicStatistics",

"tbl.YCSBTest.cf.test.blockCacheNumCached" : 0,

"tbl.YCSBTest.cf.test.compactionBlockReadCacheHitCnt" : 0,

...

Some metrics/values are from hjadoop/hbase and some are from java run time

environment, we specify a filter on jmx url to get hadoop/hbase metrics.

metric_url=/jmx?qry=Hadoop:*

metric_url=http://sx-master:50070/jmx?qry=Hadoop:*

[hbase]
clusters=dptst-example
jobs=master regionserver
metric_url=/jmx?qry=hadoop:*

[yarn]
clusters=dptst-example
jobs=resourcemanager nodemanager historyserver proxyserver
metric_url=/jmx?qry=Hadoop:*

[impala]
clusters=dptst-example
jobs=statestored impalad
metric_url=/
need_analyze=false

hdfs-dptst-example.cfg配置文件相关参数问题咨询

文件目录:monos/config/conf/hdfs/hdfs-dptst-example.cfg
[journalnode]
base_port=12100
host.0=10.38.11.59
host.1=10.38.11.134
host.2=10.38.11.135
[namenode]
base_port=12200
host.0=10.38.11.59
host.1=10.38.11.134
[zkfc]
base_port=12300
[datanode]
base_port=12400
host.0=10.38.11.134
host.1=10.38.11.135
请问这些参数中base_port的设置是同hadoop的配置文件中的journalnode、namenode等的rpc端口号相同,还是自己随意定义的

How to use minos SHELL?

Minos supports a shell command, users can use this command to do operations on different clusters very conveniently, for example:

./deploy.py shell hdfs ${your-cluster-name} dfs -ls /
./deploy.py shell hdfs ${your-cluster-name} dfsadmin -refreshNetworkTopology
./deploy.py shell hbase ${your-cluster-name} shell

Notice that, before using this feature of minos, users should patch this issure to your hadoop-common: https://issues.apache.org/jira/browse/HADOOP-9223

bug in get_short_user_name

when the security(kerberos) is off, and there is no kerberos-related-command such as klist etc, in the minos machine . the following code will cause ValueError. details are added in program by comments

def get_short_user_name(args, cluster=None, jobs=None, current_job="", host_id=0):
if not getattr(args, "short_user_name", None):
# no ret-value by get_short_user_name_full()
args.short_user_name = get_short_user_name_full()[1]
return args.short_user_name

def get_short_user_name_full():
try:
cmd = ['klist']
output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT,)

centos_line_prefix = 'Default principal:'
macos_line_prefix = 'Principal:'
for line in output.split('\n'):
  if (line.strip().startswith(centos_line_prefix) or
      line.strip().startswith(macos_line_prefix)):
    # the program never hits here, when klist command can not found
    return True, line.split(':')[1].split('@')[0].strip()
# the program terminated here, 
# so no value will be returned which cause the caller function ValueError

except subprocess.CalledProcessError, e:
return False, getpass.getuser()

bug-fixed code:

def get_short_user_name_full():
succ = False
user = getpass.getuser()

try:
cmd = ['klist']
output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT,)

centos_line_prefix = 'Default principal:'
macos_line_prefix = 'Principal:'
for line in output.split('\n'):
  if (line.strip().startswith(centos_line_prefix) or
      line.strip().startswith(macos_line_prefix)):
    succ = True
    user = line.split(':')[1].split('@')[0].strip()
    break

except subprocess.CalledProcessError, e:
Log.print_critical('get short user name failed' + str(e))

return succ, user

the default config value for scheduler of resourcemanager in yarn-common.cfg is incompatible with different version of hadoop

currently, in yarn-common.cfg the following lines specify the scheduler.

use CapacityScheduler yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

# use DominantResourceCalculator to support cpu scheduling
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.server.resourcemanager.resource.DominantResourceCalculator

the scheduler and resource-calculator classes are distributed to different package in different versions . in my test, this config is incompatible with apache hadoop-2.2.0, so it is would be better to delete this two arguments.

访问YARN监控页面出错

我在访问yarn页面时显示
Server Error (500)

在owl/debug.log里看到错误信息
ERROR 2014-03-13 10:41:07,630 base 24854 140053866374912 Internal Server Error: /monitor/job/3/
Traceback (most recent call last):
File "/home/hadoop/minos/build/env/lib/python2.7/site-packages/django/core/handlers/base.py", line 114, in get_response
response = wrapped_callback(request, _callback_args, *_callback_kwargs)
File "/home/hadoop/minos/owl/monitor/views.py", line 404, in show_job
tsdb_metrics = metric_helper.make_metrics_query_for_job(endpoints, job, tasks)
File "/home/hadoop/minos/owl/monitor/metric_helper.py", line 132, in make_metrics_query_for_job
task_view_config = job_metrics_view_config(job)
File "/home/hadoop/minos/owl/monitor/metric_helper.py", line 77, in job_metrics_view_config
return metric_view_config.JOB_METRICS_VIEW_CONFIG[service][job]
KeyError: 'yarn'

请问这个错误怎么解决?

Setting up a ZooKeeper Cluster,Bootstrap the cluster fail

Your password is: 123456, you should store this in a safe place, because this is the verification code used to do cleanup
Bootstrapping task 0 of zookeeper on 192.38.11.59(0)
Bootstrap task 0 of zookeeper on 192.38.11.59(0) fail: No package found on package server of zookeeper
Bootstrap task 0 of zookeeper on 192.38.11.59(0) fail: 2
Starting task 0 of zookeeper on 192.38.11.59(0)
Start task 0 of zookeeper on 192.38.11.59(0) fail: You should bootstrap the job first

hdfs部署完成,但是使用shell命令后感觉有问题

执行[root@master client]# ./deploy shell hdfs dptst-ir dfs -ls /
结果如下,感觉相当不对

[root@master client]# ./deploy shell hdfs dptst-ir dfs -ls /
14/10/16 15:20:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 22 items
-rw-r--r--   1 root root          0 2014-10-15 16:07 /.autofsck
-rw-r--r--   1 root root          0 2014-10-13 11:50 /.autorelabel
dr-xr-xr-x   - root root       4096 2014-10-14 04:14 /bin
dr-xr-xr-x   - root root       4096 2014-10-13 15:21 /boot
drwxr-xr-x   - root root       3380 2014-10-15 16:07 /dev
drwxr-xr-x   - root root       4096 2014-10-15 16:07 /etc
drwxr-xr-x   - root root       4096 2014-10-15 17:19 /home
dr-xr-xr-x   - root root       4096 2014-06-10 10:14 /lib
dr-xr-xr-x   - root root      12288 2014-10-14 04:14 /lib64
drwx------   - root root      16384 2014-06-10 10:09 /lost+found
drwxr-xr-x   - root root       4096 2011-09-23 19:50 /media
drwxr-xr-x   - root root       4096 2011-09-23 19:50 /mnt
drwxr-xr-x   - root root       4096 2014-06-10 10:14 /opt
dr-xr-xr-x   - root root          0 2014-10-16 00:07 /proc
dr-xr-x---   - root root       4096 2014-10-16 15:18 /root
dr-xr-xr-x   - root root      12288 2014-10-14 04:14 /sbin
drwxr-xr-x   - root root       4096 2014-06-10 10:10 /selinux
drwxr-xr-x   - root root       4096 2011-09-23 19:50 /srv
drwxr-xr-x   - root root          0 2014-10-16 00:07 /sys
drwxrwxrwt   - root root       4096 2014-10-16 15:20 /tmp
drwxr-xr-x   - root root       4096 2014-06-10 10:10 /usr
drwxr-xr-x   - root root       4096 2014-06-10 10:14 /var

useless var in start.sh.tmpl

The followin line is useless
%service_env
It directly appears in final start.sh without be substitued by some value.

yarn deployed by minos cannot run mapreduce

try to run wordcount from mapreduce-example-xx.jar, i got the following error:

2014-01-03,11:25:47,793 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1388719502388_0001 failed 2 times due to
AM Container for appattempt_1388719502388_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

@wuzesheng
I guess it is something related to some un-set env var. maybe it is the %service_env as i mentiond in issue 8 which causes the error.

can u give me some advices for setting this variable?

HY

mi fan fa la he dian

About base_port in cfg file

base_port具体指什么?
以hdfs-*.cfg为例,namenode的base_port默认是12200,配置时使用这个就可以吗?还是需要和hadoop的配置相一致?

我只安装了OWL,配置是按照Client的说明进行的,现在OWL能正常启动,但是没有数据展示。

useless var in start.sh.tmpl

The followin line is useless
%service_env
It directly appears in final start.sh without be substitued by some value.

supervisor deploy erro

first, i did this
./deploy_supervisor.py
Traceback (most recent call last):
File "./deploy_supervisor.py", line 102, in
supervisor_config = '%s/supervisord.conf' % deploy_utils.get_config_dir()
File "./../client/deploy_utils.py", line 260, in get_config_dir
return get_deploy_config().get_config_dir()
File "./../client/deploy_utils.py", line 88, in get_deploy_config
return deploy_config.get_deploy_config()
File "./../client/deploy_config.py", line 122, in get_deploy_config
config_file = '%s/%s' % (os.path.dirname(file), DEPLOY_CONFIG)
NameError: global name 'DEPLOY_CONFIG' is not defined

then i check ../client/deploy_config.py , add MINOS_CONFIG_FILE env by export MINOS_CONFIG_FILE=/usr/local/minos/config/deploy_supervisor.cfg

so i did depoly command again. it print
./deploy_supervisor.py
Traceback (most recent call last):
File "./deploy_supervisor.py", line 102, in
supervisor_config = '%s/supervisord.conf' % deploy_utils.get_config_dir()
File "./../client/deploy_utils.py", line 260, in get_config_dir
return get_deploy_config().get_config_dir()
File "./../client/deploy_config.py", line 42, in get_config_dir
'default', 'config_dir'))
File "/usr/local/lib/python2.7/ConfigParser.py", line 607, in get
raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'default'

please tell me how to fix it ^_^
my gmail: [email protected]
qq:2562131239

启动ow时报错咨询

我在启动owl时报如下错误
minos/config/owl/collector.cfg配置文件内容如下:
service=hdfs
period=10
[hdfs]
clusters=dptst-example
jobs=journalnode namenode datanode
metric_url=/jmx?qry=Hadoop:*
need_analyze=false
报错内容如下:
[root@namenode minos]# ./build.sh start owl --owl_ip 10.38.11.59 --owl_port 8089
2014-05-15 18:35:03 Building owl
2014-05-15 18:35:03 Check and install prerequisite python libraries
Traceback (most recent call last):
File "/usr/local/test/minos/owl/manage.py", line 28, in
execute_from_command_line(sys.argv)
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/core/management/init.py", line 427, in execute_from_command_line
utility.execute()
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/core/management/init.py", line 419, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/core/management/base.py", line 288, in run_from_argv
self.execute(_args, *_options.dict)
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/core/management/base.py", line 329, in execute
saved_locale = translation.get_language()
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/utils/translation/init.py", line 172, in get_language
return _trans.get_language()
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/utils/translation/init.py", line 55, in getattr
if settings.USE_I18N:
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/conf/init.py", line 46, in getattr
self._setup(name)
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/conf/init.py", line 42, in _setup
self._wrapped = Settings(settings_module)
File "/usr/local/test/minos/build/env/lib/python2.7/site-packages/django/conf/init.py", line 110, in init
"Please fix your settings." % setting)
django.core.exceptions.ImproperlyConfigured: The TEMPLATE_DIRS setting must be a tuple. Please fix your settings.
2014-05-15 18:35:03 Command '['/usr/local/test/minos/build/env/bin/python', '/usr/local/test/minos/owl/manage.py', 'syncdb']' returned non-zero exit status 1

meet a problem when use owl to monitor yarn

当在owl的web页面上点击yarn的某个task id时,无法正常进入由opentsdb监控视图组成的页面,而是报错:“A server error occurred. Please contact the administrator.”

查看日志serve.log,发现以下问题:
[02/Jan/2014 15:49:15] "GET /monitor/task/225 HTTP/1.1" 301 0
Traceback (most recent call last):
File "/usr/local/lib/python2.7/wsgiref/handlers.py", line 85, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python2.7/site-packages/django/contrib/staticfiles/handlers.py", line 67, in call
return self.application(environ, start_response)
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 209, in call
response = self.get_response(request)
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 200, in get_response
response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 230, in handle_uncaught_exception
'request': request
File "/usr/local/lib/python2.7/logging/init.py", line 1154, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/local/lib/python2.7/logging/init.py", line 1246, in _log
self.handle(record)
File "/usr/local/lib/python2.7/logging/init.py", line 1256, in handle
self.callHandlers(record)
File "/usr/local/lib/python2.7/logging/init.py", line 1293, in callHandlers
hdlr.handle(record)
File "/usr/local/lib/python2.7/logging/init.py", line 740, in handle
self.emit(record)
File "/usr/local/lib/python2.7/site-packages/django/utils/log.py", line 106, in emit
connection=self.connection())
File "/usr/local/lib/python2.7/site-packages/django/core/mail/init.py", line 98, in mail_admins
mail.send(fail_silently=fail_silently)
File "/usr/local/lib/python2.7/site-packages/django/core/mail/message.py", line 284, in send
return self.get_connection(fail_silently).send_messages([self])
File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 92, in send_messages
new_conn_created = self.open()
File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 51, in open
self.connection = connection_class(self.host, self.port, **connection_params)
File "/usr/local/lib/python2.7/smtplib.py", line 239, in init
(code, msg) = self.connect(host, port)
File "/usr/local/lib/python2.7/smtplib.py", line 295, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "/usr/local/lib/python2.7/smtplib.py", line 273, in _get_socket
return socket.create_connection((port, host), timeout)
File "/usr/local/lib/python2.7/socket.py", line 567, in create_connection
raise error, msg
error: [Errno 111] Connection refused

请问这个问题可能由什么原因造成?

编译错误

执行./build.sh build出错,请问怎么解决?

hadoop@ubuntu:~/minos$ ./build.sh build
Creating virtual environment at /home/hadoop/minos/build/env
New python executable in /home/hadoop/minos/build/env/bin/python
Installing setuptools...................................
Complete output from command /home/hadoop/minos/build/env/bin/python -c "#!python
"""Bootstrap setuptoo...

" /home/hadoop/minos/build/virtu...7.egg:
Traceback (most recent call last):
File "", line 278, in
File "", line 239, in main
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/easy_install.py", line 1712, in main
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/easy_install.py", line 1700, in with_ei_usage
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/easy_install.py", line 1716, in
File "/usr/lib/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib/python2.7/distutils/dist.py", line 971, in run_command
cmd_obj.ensure_finalized()
File "/usr/lib/python2.7/distutils/cmd.py", line 109, in ensure_finalized
self.finalize_options()
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/easy_install.py", line 125, in finalize_options
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/easy_install.py", line 1121, in _expand
File "/usr/lib/python2.7/distutils/cmd.py", line 312, in get_finalized_command
cmd_obj.ensure_finalized()
File "/usr/lib/python2.7/distutils/cmd.py", line 109, in ensure_finalized
self.finalize_options()
File "/home/hadoop/minos/build/virtual_bootstrap/virtualenv_support/setuptools-0.6c11-py2.7.egg/setuptools/command/install.py", line 32, in finalize_options
File "/usr/lib/python2.7/distutils/command/install.py", line 321, in finalize_options
(prefix, exec_prefix) = get_config_vars('prefix', 'exec_prefix')
File "/home/hadoop/minos/build/env/lib/python2.7/distutils/init.py", line 78, in sysconfig_get_config_vars
real_vars = old_get_config_vars(*args)
File "/usr/lib/python2.7/distutils/sysconfig.py", line 495, in get_config_vars
func()
File "/usr/lib/python2.7/distutils/sysconfig.py", line 439, in _init_posix
from _sysconfigdata import build_time_vars
File "/usr/lib/python2.7/_sysconfigdata.py", line 6, in
from _sysconfigdata_nd import *

ImportError: No module named _sysconfigdata_nd

...Installing setuptools...done.
Traceback (most recent call last):
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 1482, in
main()
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 525, in main
use_distribute=options.use_distribute)
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 615, in create_environment
install_setuptools(py_executable, unzip=unzip_setuptools)
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 357, in install_setuptools
_install_req(py_executable, unzip)
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 333, in _install_req
cwd=cwd)
File "/home/hadoop/minos/build/virtual_bootstrap/virtual_bootstrap.py", line 586, in call_subprocess
% (cmd_desc, proc.returncode))
OSError: Command /home/hadoop/minos/build/env/bin/python -c "#!python
"""Bootstrap setuptoo...

" /home/hadoop/minos/build/virtu...7.egg failed with error code 1
/home/hadoop/minos/build/env ready
2014-03-06 18:07:05 Check and install prerequisite python libraries
2014-03-06 18:07:05 Installing configobj
2014-03-06 18:07:05 [Errno 2] No such file or directory

subprocess.call interrupted by system call may cause deployment operation failed

In deployment/rpcinterface.py file, commands are executed by call subprocess.call() function. Somtimes the bootstrap operation failed on different hosts with no rule. From the supervisord.log, it shows the operation is commonly failed by execute the following command:
tar zxf xxx.tar.gz -C root_dir
I checked that the package(xxx.tar.gz) did exist and execute this command mannually on the target machine, it worked with no error. i am using python 2.6.
I do some google search, and found somebody do encounter the similary problem:
https://mail.python.org/pipermail/pythonmac-sig/2006-September/018095.html
i don't know why, but it seems some signal from supervisord cause the popen.wait to exit.i replace all subprocess.call with os.system, so far it works great.

搭建tank启动错误

我刚在git下克隆的代码,
在centos6.4 64位上
安装了python2.7.6, Django1.6.1
./start_tank.sh
接着就出现了错误。目前对python和django都不是很擅长。请问这个错误该如何解决?

[hadoop@master11 tank]$ tail tank.log
CommandError: ":8000" is not a valid port number or address:port pair.

错误日志:
Traceback (most recent call last):
File "./manage.py", line 10, in
execute_from_command_line(sys.argv)
File "/usr/local/python276/lib/python2.7/site-packages/django/core/management/init.py", line 399, in execute_from_command_line
utility.execute()
File "/usr/local/python276/lib/python2.7/site-packages/django/core/management/init.py", line 392, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/python276/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
self.execute(_args, *_options.dict)
File "/usr/local/python276/lib/python2.7/site-packages/django/core/management/base.py", line 284, in execute
self.validate()
... ...

File "/usr/local/python276/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 35, in
raise ImproperlyConfigured("Error loading either pysqlite2 or sqlite3 modules (tried in that order): %s" % exc)
django.core.exceptions.ImproperlyConfigured: Error loading either pysqlite2 or sqlite3 modules (tried in that order): No module named _sqlite3
./start_tank.sh: line 10: host: command not found

owl监控界面无法获取图像问题?

我的owl启动后,可以获得相关datanode和namenode的信息了,但是无法显示metric的图片,图片裂了,这个要配置什么吗?
我的owl/server.log日志:
eNode:GetListingOps'}]), ('Rpc', [{'query': [u'&m=sum:ReceivedBytes{host=sx-master-50001,group=NameNode}&o=&yformat=%25.0s%25c byte(s)'], 'title': 'NameNode:ReceivedBytes'}, {'query': [u'&m=sum:SentBytes{host=sx-master-50001,group=NameNode}&o=&yformat=%25.0s%25c byte(s)'], 'title': 'NameNode:SentBytes'}, {'query': [u'&m=sum:RpcQueueTimeNumOps{host=sx-master-50001,group=NameNode}&o=&yformat=%25.0s%25c op(s)'], 'title': 'NameNode:RpcQueueTimeNumOps'}, {'query': [u'&m=sum:RpcQueueTimeAvgTime{host=sx-master-50001,group=NameNode}&o=&yformat=%25.0s%25c ms(s)'], 'title': 'NameNode:RpcQueueTimeAvgTime'}])]
[('Overall', [{'query': [u'&m=sum:BlockCapacity{host=sx-master-50001,group=NameNode}&o=&yformat=%25.0s%25c block(s)'], 'title': 'NameNode:BlockCapacity'}, {'query': [u'&m=sum:BlocksTotal{host=sx-master-50001,group=[26/Nov/2014 00:20:20] "GET /monitor/job/2/ HTTP/1.1" 200 15887
[26/Nov/2014 00:20:21] "GET /static/bootstrap/css/bootstrap.css HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/bootstrap/css/bootstrap-responsive.css HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/jquery/css/jquery-ui-1.9.2.custom.min.css HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/bootstrap/js/bootstrap.js HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/highcharts/highcharts.js HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/jquery/js/jquery-v1.8.3.js HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/jquery/js/jquery-ui-1.9.2.custom.min.js HTTP/1.1" 304 0
[26/Nov/2014 00:20:21] "GET /static/jquery/js/jquery-ui-timepicker-addon.js HTTP/1.1" 304 0
[26/Nov/2014 00:20:22] "GET /favicon.ico HTTP/1.1" 404 0
[26/Nov/2014 00:20:44] "GET /favicon.ico HTTP/1.1" 404 0

tank can not start

./start_tank.sh
OperationalError: unable to open database file

ls
backup.py backup.sh manage.py package_server README.md start_tank.sh static tank templates

seemed like missing sqlite/ directory in tank/ , please upload it .

useless var in start.sh.tmpl

The followin line is useless
%service_env
It directly appears in final start.sh without be substitued by some value.

关于监控数据写入hbase的方式

目前minos支持的是通过将监控数据首先存入文件中,再通过opentsdb的命令行方式写入hbase。个人觉得这种方式不够灵活,需要将opentsdb和minos布署在同一个节点上,可以考虑通过tcp的途径发送“put xxx"的方式直接将读取到的监控数据一条条的通过opentsdb写入的hbase,这种方式只需要提供opentsdb的ip和port,不要求opentsdb和minos在同一节点。是否可以考虑支持这种方式,或者同时支持两种方式供用户选择?

minos框架各组件解释梳理

比如我的集群有两个节点10.38.11.59和10.38.11.8一下简称59和8
Tank:负责包管理(我安装在59节点),只需要安装在一个节点上。
Supervisor:负责管理监控所有节点的子进程,需要安装在所有需要监控的节点上(安装在59和8)并且启动所有节点的 supervisor。
Client:命令行工具,集群管理入口,所有的操作我都在59节点操作,在我配置Zookeeper时只在59节点./deploy install zookeeper dptst和./deploy bootstrap zookeeper dptst操作,此时在59节点会自动启动一个zookeeper(此处不明白为什么,我在59的机器上已经手动部署了zookeeper但是为什么执行完命令又给我自动了一个呢?),在使用client管理集群的时候并没有在配置文件中配置需要管理的组件的端口号,client是怎么识别的呢?
希望大神能够给一个清晰的解释,谢谢

请问下,Minos支持引入现有集群监控吗?

我公司已经有一套集群在运行了,但是监控工具比较缺乏。现在想知道,我们可不可以用你们的Minos来对我们现在的集群进行相关监控。而不是去重新再装一个集群!

Setting Up HDFS Cluster,execute "./deploy bootstrap hdfs dptst-example" error

2014-05-14 18:21:24 Your password is: zxcvbn, you should store this in a safe place, because this is the verification code used to do cleanup
Traceback (most recent call last):
File "/usr/local/test/minos/client/deploy.py", line 284, in
main()
File "/usr/local/test/minos/client/deploy.py", line 281, in main
return args.handler(args)
File "/usr/local/test/minos/client/deploy.py", line 229, in process_command_bootstrap
return deploy_tool.bootstrap(args)
File "/usr/local/test/minos/client/deploy_hdfs.py", line 238, in bootstrap
bootstrap_job(args, hosts[host_id].ip, job_name, host_id, instance_id, first, cleanup_token)
File "/usr/local/test/minos/client/deploy_hdfs.py", line 201, in bootstrap_job
args.hdfs_config.parse_generated_config_files(args, job_name, host_id, instance_id)
File "/usr/local/test/minos/client/service_config.py", line 665, in parse_generated_config_files
args, self.cluster, self.jobs, current_job, host_id, instance_id))
File "/usr/local/test/minos/client/service_config.py", line 652, in parse_generated_files
file_dict[key] = ServiceConfig.parse_item(args, cluster, jobs, current_job, host_id, instance_id, value)
File "/usr/local/test/minos/client/service_config.py", line 596, in parse_item
new_item.append(callback(args, cluster, jobs, current_job, host_id, instance_id, reg_expr[iter]))
File "/usr/local/test/minos/client/service_config.py", line 218, in get_job_task_attribute
host_id, instance_id = parse_task_number(task_id, jobs[job_name].hosts)
File "/usr/local/test/minos/client/service_config.py", line 36, in parse_task_number
raise ValueError(str(task_id) + ' is not a valid task of cluster, please check your config')
ValueError: 1 is not a valid task of cluster, please check your config

add HADOOP_YARN_HOME for %service_env in deploy_yarn.py

Cloudera CDH4 is based on apache hadoop 2.0.0 version. Currently the latest version of hadoop is apache hadoop 2.2.0, big differences are exists between this two versions. In my test with minos + apache hadoop 2.2.0, the mapreduce job always failed due to some error like, "cannot find the main class xxxx, ClassNotDefined Exception".

After some debug things, it is proved the problem is caused by %service_env.

In hadoop 2.0.0, the follwing environment variables are used,
HADOOP_HDFS_HOME
HADOOP_COMMON_HOME
YARN_HOME

But in hadoop 2.2.0, the following variales are used
HADOOP_HDFS_HOME
HADOOP_COMMON_HOME
HADOOP_YARN_HOME
HADOOP_MAPRED_HOME,

so adding HADOOP_YARN_HOME and HADOOP_MAPRED_HOME to service_env makes it works for both CDH4 and apache hadoop 2.2.0 versions.

关于minos运行的疑问

经过这几天的安装配置minos我没有明白supervisor是怎么识别我集群中部署的hadoo、hbase的呢,我查看了一下hdfs-dptst-example.cfg、supervisord.conf等这些配置文件,我只看到了配置各节点ip的地方,并没有配置其他的信息,请问minos是怎么识别需要监控的集群的呢

关于client操作zookeeper问题咨询

我在10.38.11.59和10.38.11.8两个节点成功启动了supervisor,在59的几点上执行./deploy install zookeeper dptst和./deploy bootstrap zookeeper dptst命令均操作成功,详细信息如下:
2014-05-16 09:37:35 Bootstrap task 0 of zookeeper on 10.38.11.59(0) success
2014-05-16 09:37:35 Start task 0 of zookeeper on 10.38.11.59(0) success
2014-05-16 09:37:35 Bootstrap task 1 of zookeeper on 10.38.11.8(0) success
2014-05-16 09:37:36 Start task 1 of zookeeper on 10.38.11.8(0) success

当我在执行./deploy show zookeeper dptst,结果8的节点失败
详细信息如下
2014-05-16 09:37:49 Task 0 of zookeeper on 10.38.11.59(0) is RUNNING
2014-05-16 09:37:49 Showing task 1 of zookeeper on 10.38.11.8(0)
2014-05-16 09:37:49 Task 1 of zookeeper on 10.38.11.8(0) is FATAL

bootstraping ZKFC hangs forever

Forks may find that when bootstraping ZKFC after cleaning up, it will hang forever, to fix this, you need to patch the following diff to your ZKFC:

diff --git a/hadoop/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java b/hadoop/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
--- a/hadoop/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
+++ b/hadoop/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
@@ -195,6 +195,8 @@
           }   
         }   
         return formatZK(force, interactive);
+      } else if ("-clearZK".equals(args[0])) {
+        return clearZK();
       } else {
         badArg(args[0]);
       }   
@@ -272,6 +274,19 @@
     return 0;
   }   

+  private int clearZK()
+      throws IOException, InterruptedException {
+    if (elector.parentZNodeExists()) {
+      try {
+        elector.clearParentZNode();
+      } catch (IOException e) {
+        LOG.error("Unable to clear zk parent znode", e); 
+        return 1;
+      }   
+    }   
+    return 0;
+  }
+   
   private boolean confirmFormat() {
     String parentZnode = getParentZnode();
     System.err.println(

ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 364

两个节点部署了supervisord,其中59正常运行日志打印信息也正常,8节点supervisor显得是状态总是fatal,
如下是通过10.38.11.8:9001访问显示的supervisor status信息:
State Description Name
fatal Exited too quickly (process log may have details) crashmailbatch-monitor
fatal Exited too quickly (process log may have details) processexit-monitor
fatal Exited too quickly (process log may have details) zookeeper--dptst--zookeeper
部分日志信息如下:
2014-05-16 13:17:07,290 INFO spawned: 'zookeeper--dptst--zookeeper' with pid 22291
2014-05-16 13:17:07,290 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:07,299 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 364
2014-05-16 13:17:07,299 INFO exited: zookeeper--dptst--zookeeper (exit status 126; not expected)
2014-05-16 13:17:07,299 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:08,300 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 365
2014-05-16 13:17:08,304 INFO spawned: 'zookeeper--dptst--zookeeper' with pid 22295
2014-05-16 13:17:08,305 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:08,314 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 366
2014-05-16 13:17:08,314 INFO exited: zookeeper--dptst--zookeeper (exit status 126; not expected)
2014-05-16 13:17:08,315 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:10,315 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 367
2014-05-16 13:17:10,319 INFO spawned: 'zookeeper--dptst--zookeeper' with pid 22299
2014-05-16 13:17:10,319 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:10,319 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 368
2014-05-16 13:17:10,329 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 369
2014-05-16 13:17:10,329 INFO exited: zookeeper--dptst--zookeeper (exit status 126; not expected)
2014-05-16 13:17:10,329 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:13,330 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 370
2014-05-16 13:17:13,332 INFO spawned: 'zookeeper--dptst--zookeeper' with pid 22303
2014-05-16 13:17:13,332 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:13,343 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 371
2014-05-16 13:17:13,343 INFO exited: zookeeper--dptst--zookeeper (exit status 126; not expected)
2014-05-16 13:17:13,344 INFO supervisord wrote sub-process pidfile
2014-05-16 13:17:14,344 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 372
2014-05-16 13:17:14,344 INFO gave up: zookeeper--dptst--zookeeper entered FATAL state, too many start retries too quickly
2014-05-16 13:17:15,344 ERRO pool crashmailbatch-monitor event buffer overflowed, discarding event 373

启动tank时会报错,tank访问不了

innosql@db-43:~/raolh/bigdata/minos$ ./build.sh start tank
2015-12-29 14:27:17 Building tank server
2015-12-29 14:27:17 Check and install prerequisite python libraries
2015-12-29 14:27:17 Installing django
Downloading/unpacking django>=1.5.5
Downloading Django-1.9-py2.py3-none-any.whl (6.6MB): 6.6MB downloaded
Installing collected packages: django
Compiling /mnt/ddb/1/innosql/minos/build/env/build/django/django/conf/app_template/apps.py ...
SyntaxError: ('invalid syntax', ('/mnt/ddb/1/innosql/minos/build/env/build/django/django/conf/app_template/apps.py', 4, 7, 'class {{ camel_case_app_name }}Config(AppConfig):\n'))

Compiling /mnt/ddb/1/innosql/minos/build/env/build/django/django/conf/app_template/models.py ...
SyntaxError: ('invalid syntax', ('/mnt/ddb/1/innosql/minos/build/env/build/django/django/conf/app_template/models.py', 1, 26, '{{ unicode_literals }}from django.db import models\n'))

Successfully installed django
Cleaning up...
2015-12-29 14:27:24 The component tank is built successfully
2015-12-29 14:27:24 Starting Tank server
Unknown command: 'syncdb'
Type 'manage.py help' for usage.
2015-12-29 14:27:26 Start Tank server success

虽然显示启动成功,但是通过ip:port无法访问,查看log日志有如下报错:

Traceback (most recent call last):
File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
self.result = application(self.environ, self.start_response)
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 158, in call
self.load_middleware()
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 51, in load_middleware
mw_class = import_string(middleware_path)
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/utils/module_loading.py", line 20, in import_string
module = import_module(module_path)
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 3, in
from django.contrib.auth.backends import RemoteUserBackend
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/contrib/auth/backends.py", line 4, in
from django.contrib.auth.models import Permission
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/contrib/auth/models.py", line 6, in
from django.contrib.contenttypes.models import ContentType
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/contrib/contenttypes/models.py", line 159, in
class ContentType(models.Model):
File "/home/innosql/raolh/bigdata/minos/build/env/local/lib/python2.7/site-packages/django/db/models/base.py", line 103, in new
"application was loaded. " % (module, name))
RuntimeError: Model class django.contrib.contenttypes.models.ContentType doesn't declare an explicit app_label and either isn't in an application in INSTALLED_APPS or else was imported before its application was loaded.
[29/Dec/2015 15:28:39] "GET /favicon.ico HTTP/1.1" 500 59

这个问题是不是和下载的django版本有关系 如何解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.