bloomberg / chef-bach Goto Github PK

Chef recipes for Bloomberg's deployment of Hadoop and related components

License: Apache License 2.0

Shell 8.72% Ruby 66.91% Python 0.89% HTML 23.42% Dockerfile 0.07%

chef-bach's Introduction

Chef BACH

Overview

This is a set of Chef cookbooks to bring up Hadoop and Kafka clusters. In addition, there are a number of additional services provided with these cookbooks - such as DNS, metrics, and monitoring - see below for a partial list of services provided by these cookbooks.

Hadoop

Each Hadoop head node is Hadoop component specific. The roles are intended to be run so that they can be layered in a highly-available manner. E.g. multiple BCPC-Hadoop-Head-* machines will correctly build a MySQL, Zookeeper, HDFS JournalNode, etc. cluster and deploy the named component as well. Further, for components which support HA, the intention is one can simply add the role to multiple machines and the right thing will be done to support HA (except in the case of HDFS).

To setup HDFS HA, please follow the following model from your Bootstrap VM:

Install the cluster once with a non-HA HDFS:
- with a BCPC-Hadoop-Head-Namenode-NoHA role
- with the following node variable [:bcpc][:hadoop][:hdfs][:HA] = false
- ensure at least three machines are installed with BCPC-Hadoop-Head roles
- ensure at least one machine is a datanode
- run cluster-assign-roles.sh <Environment> Hadoop successfully
Re-configure the cluster with an HA HDFS:
- change the BCPC-Hadoop-Head-Namenode-NoHA machine's role to BCPC-Hadoop-Head-Namenode
- set the following node variable [:bcpc][:hadoop][:hdfs][:HA] = true on all nodes (e.g. in the environment)
- run cluster-assign-roles.sh <Environment> Hadoop successfully

Setup

These recipes are currently intended for building a BACH cluster on top of Ubuntu 14.04 servers using Chef 11. When setting this up in VMs, be sure to add a few dedicated disks (for HDFS data nodes) aside from boot volume.

You should look at the various settings in cookbooks/bcpc/attributes/default.rb and tweak accordingly for your setup (by adding them to an environment file).

Cluster Bootstrap

The provided scripts which sets up a chef-server via Vagrant.

Once the Chef server is set up, you can bootstrap any number of nodes to get them registered with the Chef server for your environment - see the next section for enrolling the nodes.

Make a cluster

To build a new BACH cluster, you have to start with building head nodes first. (This assumes that you have already completed the bootstrap process and have a Chef server available.) Since the recipes will automatically generate all passwords and keys for this new cluster, the nodes must temporarily become admin's in the chef server, so that the recipes can write the generated info to a databag. The databag will be called configs and the databag item will be the same name as the environment (Test-Laptop in this example). You only need to leave the node as an admin for the first chef-client run. You can also manually create the databag & item (as per the example in data_bags/configs/Example.json) and manually upload it if you'd rather not bother with the whole admin thing for the first run.

To assign machines a role, one can update the cluster.txt file and ensure all necessary information is provided as per cluster-readme.txt.

Using the script tests/automated_install.sh, one can run through what is the expected "happy-path" install for a single machine running (by default) four Vagrant VMs. This simple install supports only changing DNS, proxy and VM resource settings. (This is the basis of our automated build tests.)

Note: To run more than one test cluster at a time with VirtualBox: One may export BACH_CLUSTER_PREFIX to set their desired cluster name prefix. This will set the namespace so that the cluster's virtual machines do not collide on the hypervisor. Resulting in names following the convention:

      ${BACH_CLUSTER_PREFIX}-bcpc-bootstrap
      ${BACH_CLUSTER_PREFIX}-bcpc-vm1
      ${BACH_CLUSTER_PREFIX}-bcpc-vm2
      ${BACH_CLUSTER_PREFIX}-bcpc-vm3

Lacking a $BACH_CLUSTER_PREFIX tests/automated_install.sh will not assign a cluster prefix to the cluster hosts or bootstrap. One also needs to ensure their management, float and storage network ranges differ between clusters in the environment and cluster.txt) -- update them to be unique. Further, one needs to have each cluster's repository in a different parent directory (to avoid the cluster directory from colliding).

Note: For man-in-the-middle proxy or local repository users: One need ensure local SSL certificate authority certificates are located on your hypervisor at /usr/local/share/ca-certificates this will populate your bootstrap with the necessary certificates. Further, to not use a proxy for specific hosts, one can set $additional_no_proxy to a comma separated list of hosts or *-wildcard domains. (This is specifically useful for local APT, Maven or Ruby repositories.)

Other Deployment Flavors

In addition to the "happy-path" integration test using automated_install.sh there are ways to deploy to bare-metal hosts. Lastly, for those using test-kitchen there are various test-kitchen suites one can run as well.

A view of the various full-cluster deployment types:

Using a BACH cluster

Once the nodes are configured and bootstrapped, BACH services will be accessible via the floating IP. (For the Test-Laptop environment, it is 10.0.100.5.)

For example, you can go to https://10.0.100.5:8888 for the Graphite web interface. To find the automatically-generated service credentials, look in the data bag for your environment.

vagrant@bootstrap:~$ knife data bag show configs Test-Laptop | grep mysql-root-password
mysql-root-password:       abcdefgh

For example, to check on HDFS:

vagrant@bcpc-vm1:~$ HADOOP_USER_NAME=hdfs hdfs dfsadmin -report
Configured Capacity: 40781217792 (37.98 GB)
Present Capacity: 40114298221 (37.36 GB)
DFS Remaining: 39727463789 (37.00 GB)
DFS Used: 386834432 (368.91 MB)
DFS Used%: 0.96%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 10.0.100.13:50010 (bcpc-vm3.bcpc.example.com)
Hostname: bcpc-vm3.bcpc.example.com
Decommission Status : Normal
Configured Capacity: 40781217792 (37.98 GB)
DFS Used: 386834432 (368.91 MB)
Non DFS Used: 666919571 (636.02 MB)
DFS Remaining: 39727463789 (37.00 GB)
DFS Used%: 0.95%
DFS Remaining%: 97.42%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 12
Last contact: Fri Aug 14 21:08:23 EDT 2015

Chef-BACH Philosophies

The philosophy behind BACH cluster operation is that no single machine is special and all services are multi-master or have sufficiently fast failover to prevent failure in application data paths and availability. Commits to the codebase should be deployable without requiring path dependence from the previous repository state. For example, a machine should be able to be PXE-booted fresh into a particular version of the code, while an existing machine should be able to simply run Chef to upgrade into a particular Chef-BACH version. Unhealthy machines should always be able to be torn down and reinstalled from scratch without disruption. Any Chef-BACH version which requires manual interaction is considered BREAKING (as a GitHub tag) and should be avoided as much as possible; our mantra is that all operations are handled automatically. All services should be secured and kerberized as appropriate. Yet, testing should be done both with a kerberized VM cluster.

BACH Services

BACH currently relies upon a number of open-source packages:

Thanks to all of these communities for producing this software!

Contributing

See our contributing document for more.

chef-bach's People

Contributors

Stargazers

Watchers

Forkers

amithkanand bijugs cbaenziger drraywang visar arunvoma mindis pu239ppy http-418 leochen4891 dcallao mlongob vshulkin aceofsteel nibbs38 sjain1991 mkoni ekund zakhark asears tspannhw keyaaron nimbusgo easonmyang zhuwchicago montana zhuwbigdata clukasikhw snukavarapu themodernlife dbist yodasantu vt0r vijayk jmh045000 ishchu salsa-dev vineshcpaul desirit vsawhney-oracle autoclayb saumilmayani mehrdad-shokri macmaster toczkos rashidaligee srve4 ultimaterich84 maduhu kiiranh maxpeal efisher-ias partomia amolthacker hbnworkstation khileshchauhan stjordanis slashdeepa

chef-bach's Issues

Unable to drop table in Hive

Drop table in Hive doesn't work and throws SQL JDO exception. Initial investigation shows that the prepared statement doesn't have the actual value causing the statement to not execute. More details on this soon...

bcpc-hadoop::yarnproxy recipe is not used anywhere

bcpc-hadoop::yarnproxy is not used in any of the roles and hence yarn-proxyserver and yarn-historyserver are not installed in the cluster.

Codify HDFS Clean-Up and Space Management

We know we want to expire YARN logs and temporary data in /tmp automatically; what else should we expire? Do we want to periodically dump the FSImage to find this data?

Hannibal Pulls in Maven and Breaks Internet Disconnected Clusters

Today, if one is running a bootstrap node copying only the bins directory then they will bomb out with:

[2015-02-17T19:57:35+00:00] ERROR: ark[maven] (maven::default line 28) had an error: Chef::Exceptions::MultipleFailures: Multiple failures occurred:
* Errno::ETIMEDOUT occurred in chef run: remote_file[/var/chef/cache/maven-3.1.1.tar.gz] (/var/chef/cache/cookbooks/ark/providers/default.rb line 45) had an error: Errno::ETIMEDOUT: Error connecting to http://apache.mirrors.tds.net/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz - Connection timed out - connect(2)

This is because today we have a Hannibal-Build role which is not guarded on having a pre-built Hannibal.

Ideally, this would be idempotent for a bootstrap node on the Internet (and building Hannibal) or off the Internet (with a pre-built Hannibal).

Updating JAVA_HOME property in config file doesn't restart services

There are two bugs:

I updated the node["bcpc"]["hadoop"]["java"] attribute to the oracle java home and rechefed the nodes (on VM cluster using cluster-assign-roles). The hadoop services (hbase master for eg) were not restarted. I killed the java processes on the nodes and then rechefed. Then the services were started with the updated JAVA_HOME property.
cookbooks/bcpc-hadoop/templates/default/hb_hbase-env.sh.erb explicitly exports JAVA_HOME instead of using the node["bcpc"]["hadoop"]["java"] attribute (line 108).

Hive Services have incorrect start order

In current implementation, hive-server is created and started before the hive metastore is started. This causes HiveServer to not come up as it requires Hive metastore to be up and running. We need to move hive-serverservice to bcpc-hadoop::hive_metastore recipe after hive-metastore service. Also, we do not need bash resource as starting hiveserver is taken care by hive-server init script.

Move attributes defined in environment file to default_attribute section

Currently many important attributes are defined in environment file under override_attribute section. This should be moved to default_attribute section as these are default values for the cluster.

As per chef documentation:

At the beginning of a chef-client run, all default, override, and automatic attributes are reset. 
The chef-client rebuilds them using data collected by Ohai at the beginning of the chef-client run and by 
attributes that are defined in cookbooks, roles, and environments. Normal attributes are never reset. All 
attributes are then merged and applied to the node according to attribute precedence. At the conclusion 
of the chef-client run, all default, override, and automatic attributes disappear, leaving only a collection 
of normal attributes that will persist until the next chef-client run.

Add dfs.balance.bandwidthPerSec parameter to hdfs-site.xml

We have experienced issues w.r.t to how HDFS data is stored on different nodes. We have seen that some of the nodes are consuming more disk space causing disk usage to go upto 100% where other nodes stay at very low disk usage of around 10%. To re-balance the data across all the nodes we can use hdfs balancer and to use that we must specify network bandwidth that can be used by balancer utility. I recommend adding dfs.balance.bandwidthPerSec parameter to hdfs-site.xml to control the network bandwidth while re-balancing the cluster.

get_req_node_attributes fails a chef run with unhelpful "undefined method `[]' for nil:NilClass"

If one is searching a node for an attribute it does not have, instead of properly erroring out and reporting the node object was incomplete, we today see the following failure:

================================================================================
Recipe Compile Error in /var/chef/cache/cookbooks/kafka-bcpc/recipes/setattr.rb
================================================================================


NoMethodError
-------------
undefined method `[]' for nil:NilClass


Cookbook Trace:
---------------
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:239:in `block (3 levels) in get_req_node_attributes'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:239:in `each'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:239:in `reduce'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:239:in `block (2 levels) in get_req_node_attributes'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:238:in `each'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:238:in `block in get_req_node_attributes'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:236:in `each'
  /var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:236:in `get_req_node_attributes'
  /var/chef/cache/cookbooks/kafka-bcpc/recipes/setattr.rb:16:in `from_file'


Relevant File Content:
----------------------
/var/chef/cache/cookbooks/bcpc-hadoop/libraries/utils.rb:

232:  # [ { :node_number => "val", :hostname => "nameval" }, ...]
233:  #
234:  def get_req_node_attributes(node_objects,srch_keys)
235:    result = Array.new
236:    node_objects.each do |obj|
237:      temp = Hash.new
238:      srch_keys.each do |name, key|
239>>       val = key.split('.').reduce(obj) {|memo, key| memo[key]}
240:        temp[name] = val
241:      end
242:      result.push(temp)
243:    end
244:    return result
245:  end
246:
247:  #
248:  # Restarting of hadoop processes need to be controlled in a way that all the nodes

Namenode's name directory is defined twice

Current implementation defines namenode's name directory twice in in hdp_hdfs-site.xml.erb , once using deprecated property dfs.name.dir and once using new property dfs.namenode.name.dir and both the parameters point to different directory location that causes confusion when debugging any issue related to hdfs. Also, since new property overrides deprecated property, keeping deprecated one is unnecessary. Below is the output from unix:

$ ls -l /disk/*/dfs/
/disk/0/dfs/:
total 0
drwx------ 2 hdfs hdfs  6 Jan  7 20:31 namedir
drwxr-xr-x 3 hdfs hdfs 20 Jan  8 21:27 nn

The deprecated property should be removed from the configuration file.

Add oozie-client to worker role/node

Currently, oozie-client is only installed on the node that is running oozie-server process causing worker/gateway nodes to not have oozie-client. With this approach users are not able to access oozie server from any of the worker/gateway node and are bounded to log on to the node running oozie-server process.
To fix this we need add oozie-client installation to worker/gateway nodes.

Graphite Handler Prevents node Object Saving

With the introduction of the Chef handler for Chef metrics introduced in Chef-BACH #44 : "Changes to collect stats from Chef client runs", we have an issue when standing up a pure Zookeeper/Kafka cluster.

If the metrics stack is not included somewhere the handler will cause Chef to end with:

Running handlers:
[2015-02-04T15:43:51-05:00] ERROR: Running exception handlers
  - GraphiteReporting
Running handlers complete

[2015-02-04T15:43:51-05:00] ERROR: Exception handlers complete

Unfortunately, this will prevent the node from being saved back to the Chef Server, and that is required for:

cluster-assign-roles.sh
Automated chef-client runs where the node needs its run_list provided by the Chef server

One can work around this in two ways:
Add the monitoring stack to the BCPC-Kafka-Head-Zookeeper role:

      "recipe[bcpc::networking]",
+    "recipe[bcpc::mysql]",
+    "recipe[bcpc::keepalived]",
+    "recipe[bcpc::haproxy]",
+    "recipe[bcpc::graphite]",
      "recipe[bcpc-hadoop::disks]",

Remove the Graphite Chef handler from the Basic role:

-     "recipe[bcpc::graphite_handler]",

Or perhaps we can look at extending the Graphite Chef handler to not return a failure if it can not update Graphite dependent on a node attribute.

Chef test vms in parallel in automated_install.sh

Currently, in tests/automated_install.sh lines 79-82, there is a for loop which chefs (cluster-assign-roles) each Hadoop vm serially. This does not setup the services on the vms correctly.
Replace the for loop with a single cluster-assign-roles statement without specifying a VM.
This is supposed to set /etc/hosts correctly on all the nodes without having to chef them in a particular order.

Duplicate entry for jmxtrans from BCPC-Kafka-Head-Server role

Role BCPC-Kafka-Head-Server has bcpc_jmxtrans defined twice and that is not needed. One of the entry should be removed from the role.

Missing links to compression libraries

HDP installation for LZO compression lib installs libraries under /usr/lib/hadoop/lib/native/Linux-amd64-64 directory. This causes mapreduce job to throw an error:

2014-11-10 13:55:32,159 ERROR [main] com.hadoop.compression.lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)

The fix for this issue is to create links under /usr/lib/hadoop/lib/native that point to lzo libraries installed under /usr/lib/hadoop/lib/native/Linux-amd64-64 directory.

Incorrect github branch name in bootstrap.rb recipe

After moving to chef-bach, we are at master branch instead of hadoop_hortonworks branch. bcpc::bootstrap creates a cron entry that pulls the hadoop_hortonworks branch which is not true anymore and will cause local repository to not refresh with latest changes. The branch name should be master instead of hadoop_hortonworks

Need mapred user on namenode for mapreduce usage

On a test cluster, one currently sees the following errors on the namenode when trying to run YARN jobs:
2015-01-16 23:57:32,456 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user mapred 2015-01-16 23:57:32,457 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=READ_EXECUTE, inode="/user/history/done_intermediate/ubuntu":ubuntu:mapred:drwxrwx---

This is due to the default in our clusters of resolving a user's group on the namenode.

Vagrant baremetal does not regenerate admin.pem

Today, there are two issues which can bite one while setting up clusters from a generic bootstrap node image (e.g. using the Vagrant baremetal approach).

If one mis-specifies the node[:chef_client][:server_url] attribute (e.g. to another Chef server) one wil get a failure on provisioning at best; or will reconfigure an innocent Chef server. There should be an interlock between node[:bcpc][:bootstrap][:server] and node[:chef_client][:server_url] to ensure they are the same.

The risk of reconfiguring an innocent Chef server spawns from not re-generating the /etc/chef-server/admin.pem and associated key on the Chef server as well.

Provide Parquet for Pig

A Parquet wrapper is provided in Pig by PIG-3445, but to actually use ParquetStorer and ParquetLoader one needs to add parquet-pig-bundle.jar. Let's add this to our cluster builds by default.

"java.io.IOException: The region's reads are disabled"

We are seeing following exception in log file: "java.io.IOException: The region's reads are disabled".

To fix this we need to change hbase-site.xml and define zookeeper port as an independent property instead of combining it with hostname in the form of hostname:port under hbase.zookeeper.quorum property. The new property is hbase.zookeeper.property.clientPort.

Axioms of Chef-BACH Architecture

We should have an architecture plan for our sprawling codebase. This would help us understand where components should go and what future development should look like.

Code Placement

Today

We have code in the following cookbooks today:

bcpc - Cluster Core OS
bcpc-hadoop - Cluster Hadoop Centric Ecosystem Components
kafka-bcpc - Kafka Centric Wrappers
bcpc_jmxtrans - JMXTrans Wrappers
to be upstream Chef community cookbooks - cobblerd, hannibal, etc.

Proposal for the future

One direction we can move to looks like:

bcpc -> bach_os
- Remove all OpenStack components
bcpc-hadoop -> hadoop
- Eventaully this would be wrapper cookbooks around a core of community Hadoop ecosystem component cookbooks
- Move graphite_to_zabbix to bach_os as it is not Hadoop related
kafka-bcpc -> bach_kafka (or kafka_wrapper?)
bcpc_jmxtrans -> bach_jmxtrans (or jmxtrans_wrapper?)

Creation of `bach_bootstrap`

In the above envisioned architecture, the bach_bootstrap cookbook is envisioned. It would consist of the following code today:

bcpc::bootstrap*
bcpc-hadoop::{java,maven}_config wrappers
build_bins.sh would move to a recipe(s)

Axioms guiding cookbook development:

General axioms guiding Chef-BACH

Code should be runnable on a periodic basis (e.g. every five minutes) without ill effect
Ideally code can be upgraded in place; upgrade which can not happen in place should be achievable via a reinstall of the machine to get an identical system with the new features
- For a cluster node this is achieved via HA, data replication, etc. and a PXE re-install
- For a bootstrap node how do we achieve this?

Axioms guiding `bach_bootstrap`:

Recipes can be run Internet connected once
- All necessary binaries from the Internet would be stored in what is today chef-bcpc/bins
Recipes run with a pre-stashed artifact in chef-bcpc/bins should nolonger try to talk to Internet

Axioms guiding `bach_os`:

Machines do not talk to Internet
bach_os may setup an OS for Hadoop, Kafka or unaffiliated types of systems (e.g. Solr machines)

Axioms guiding `hadoop`:

Systems serving an HTTP interface running on a "head" node should be active-active
Services deployed on the cluster should be built elsewhere and only the service deployed in hadoop (e.g. hannibal_build.rb run on bootstrap versus hannibal_deploy.rb run on the Hadoop cluster)

Fuzzy Areas:

Though not strictly following Role Cookbooks and Wrapper Cookbooks patterns, it is good to not rely on roles or overload the environment too an unnecessary degree -- but where to draw the line?
If a wrapper cookbook is just setting attributes, can they be reasonably set as attributes on the bcpc-hadoop cookbook? (Or should they explicitly be wrapper cookbooks?)
Where do we put generic cluster operations library code (e.g. get_nodes_for or wait_until_ready)
- Do we make this all go in a bach_cluster_ops cookbook
- Or, do we bundle this code as a gem?

Setup Kafka-Zookeeper and Kafka-Server role on the same node.

In current implementation we setup Kafka cluster using separate set of nodes for Kafka Zookeeper and Kafka Server. We have seen issues with Kafka server registering with zookeeper when both the roles (BCPC-Kafka-Head-Zookeeper and BCPC-Kafka-Head-Server) are assigned to the same node. For some reason, Kafka server seems to require a proper zookeeper quorum up and running before registering it's broker with Zookeeper. This needs to be further investigated and fixed so that we can assign both the roles to same node.

Incorrect get_all_nodes is called

Currently we have get_all_nodes methods defined in two separate cookbooks (bcpc and bcpc-hadoop). The difference between two methods is that bcpc method returns all the nodes irrespective of role assigned to a node where bcpc-hadoop returns all the nodes that have any BCPC* role assigned to them. During chef compilation phase get_all_nodes method defined under bcpc-hadoop cookbook overwrites the definition of get_all_nodes defined under bcpc cookbook causing the method to return only BCPC specific nodes.
Consider a scenario when new nodes are added to to chef server that are not BCPC specific but perform some other activity for the cluster. In such cases the bcpc::networking recipe will fail to generate correct /etc/hosts file as the get_all_nodes method is only returning BCPC specific nodes.

As of today, get_all_nodes method in bcpc-hadoop cookbook is not used/called in any recipe.

$ grep -R "get_all_nodes" .
./cookbooks/bcpc-centos/recipes/networking.rb:  variables( :servers => get_all_nodes )
./cookbooks/bcpc-hadoop/libraries/utils.rb:def get_all_nodes
./cookbooks/bcpc/recipes/ceph-work.rb:        storage_ips = get_all_nodes.collect{|x| x['bcpc']['storage']['ip']}
./cookbooks/bcpc/recipes/networking.rb:    variables( :servers => get_all_nodes )
./cookbooks/bcpc/recipes/powerdns.rb:get_all_nodes.each do |server|
./cookbooks/bcpc/recipes/nova-head.rb:        all_hosts = get_all_nodes.collect{|x| x['hostname']}
./cookbooks/bcpc/recipes/mysql.rb:  variables( :max_connections => [get_nodes_for('mysql','bcpc').length*50+get_all_nodes.length*5, 200].max,
./cookbooks/bcpc/libraries/utils.rb:def get_all_nodes

I suggest that we should do one of the following:

Rename the method in bcpc-hadoop cookbook to get_all_bcpc_nodes and update any rerference to the same (at this time none)
Remove the method

Namenode disk replacement not automatically handled

When a namenode has a disk replaced, the namenode daemon appears to need a bounce to start repopulating the disk.

Once a disk is recovered (e.g. its mount point goes from returning EIO to working again; possibly needing a fix for opscode/chef issue #2680) it appears the namenode will not recognize the disk as recovered and start repopulating it automatically. Our namenode_standby recipe has a check to ensure all namenode disks have a current directory. However, we have two issues:

Bootstrapping the namenode again requires it to be shutdown (so we need a check to ensure the namenode is stopped before running bootstrap)
Bootstrapping a namenode requires the edits to be caught up reasonably; rolling the edits can do this (so we need a dfsadmin -rollEdits)

Logic to enable YARN RM HA is incorrect

The code in yarn-site expects more than 2 nodes to run YARN RM which need to be changed to check for 2 RMs in the cluster.

Add more JMX metrics for hadoop components to report to graphite

Graphite currently shows very limited metrics for hadoop components. There are many more useful metrics available that can help to understand behavior of various hadoop components. These new metrics should be included to our Graphite/JMX deployment.

Zookeeper connections are precious and other ZK administrivia

If one is running a large cluster and does not scale node[:bcpc][:hadoop][:zookeeper][:maxClientCnxns] then they will have the joys of Zookeeper unavailability and spewing log entries like:

2014-12-16 17:11:38,921 [myid:12] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /1.2.3.4 - max is 500

Ideally, we can create monitoring in Zabbix from the provided JMX metrics:
org.apache.ZooKeeperService -> ReplicatedServer_idnodeNumber -> replica.nodeNumber -> Attributes-> Follower ->

Attributes ->
- PendingRevalidationCount
- AvgRequestLatency
- MaxRequestLatency
- MinRequestLatency
- NumAliveConnections
- OutstandingRequests
- PacketsReceived
- PacketsSent
Connections ->
- client IP ->
  - connection ptr ->
    - OutstandingRequest
    - PacketsReceived
    - PacketsSent
    - MinRequestLatency
    - MaxRequestLatency
    - AvgRequestLatency
    - LastLatency
- ...
InMemoryDataTree ->
- Attributes ->
  - NodeCount
  - WatchCount

The question is how to see rejected connections which I'm not seeing here. Regardless I think a lot of useful cluster monitoring can be done here.

Vagrantfile.baremetal has hard coded URL for box image

Vagrantfile for standing up a bootstrap machine for a baremetal cluster has hard coded URL. This is a bug and causes failure while bringing up bootstrap host.

Provide non-Java Pig UDF support

Pig supports Python UDFs but one needs Jython on the system (similarly one needs Rhino for JavaScript, JRuby for Ruby or Groovy-all for Groovy). Let's find a way to help folks who need to do custom logic not have to write and compile Java UDFs.

Vagrant Baremetal DHCP Client Overwrites Default Route

For the baremetal bootstrap node, it would be nice to reach the machine via its off-host routable bridged interface; the alternative the VBox NAT'd interface is not reachable off-host by default.

However, dhclient-script is rather agressive in always overwriting the host's default route when it RENEW's the lease. Luckily, dhclient can be told to allow overrides with the following being added into /etc/dhcp/dhclient.conf:
supersede routers _router IP_;

Template hv_hive-server2.erb has incorrect command to start hive-server2

In template hv_hive-server2.erb, hiveserver2 command is being used to start the hive-server2 process. The start-stop-daemon is passing parameters to hiveserver2 that are not required as hiveserver2 doesn't accept any parameters. This results in script failure resulting in hive-server2 process not started at all.

One of the following approach can be used to fix the issue:

Replace hiveserver2 command with hive comamnd OR
Remove parameters for hive server from start-stop-daemon command line option

Incompatible MySQL connector installed by Hive-Metastore recipe

Current bcpc-hadoop::hive-metastore recipe installs libmysql-java package that in turn installs mysql-connector-java-5.1.16.jar which is not compatible with percona-5.6 mysql that we use as of today. Old version (5.1.16) of mysql connection should be replaced with a newer version (5.1.34 or later if avaialble)

Add topology script

We should have a way to provide a cluster topology script via a configurable template for HDFS rack awareness.

Install "HiveServer" provided by HDP

In current implementation we only install Hive-Metastore package from HDP repository that creates /etc/init.d/hive-metastore scripts and starts the metastore process. For Hive-Server the startup script /etc/init.d/hive-server2 is created through a template resouce to start the service. In order to stay in sync with changes implemented by Hortonworks, hive-server package should be installed from HDP repository.

Past History

This codebase started life as a branch on the repo Chef-BCPC and as such there is significant development tracked on that repository. Particularly tagged for the Hadoop work has been:

Issues

bloomberg/chef-bcpc#83 - On a mac, sed -i cmd file should be sed -i -e cmd file Hadoop
bloomberg/chef-bcpc#84 - Mac bootstrap VM create warning Hadoop
bloomberg/chef-bcpc#85 - Warning message: ERROR: RuntimeError: Please set EDITOR environment variable Hadoop
bloomberg/chef-bcpc#96 - Pip Breaks Behind MITM Proxies Hadoop upstream
bloomberg/chef-bcpc#126 - automated_install.sh script error on Mac Hadoop
bloomberg/chef-bcpc#127 - automated_install.sh script error on Mac due to difference in "nc" syntax Hadoop
bloomberg/chef-bcpc#128 - Percona is Breaking Build Hadoop
bloomberg/chef-bcpc#129 - Error in hive-site.xml format causing issue with using Hive Hadoop
bloomberg/chef-bcpc#130 - Cookbook naming causing issues with LWRP bug Hadoop upstream
bloomberg/chef-bcpc#133 - Bringing up hbase-master on VM2 before VM3 is ready Hadoop
bloomberg/chef-bcpc#136 - Graphite tables are not created Hadoop
bloomberg/chef-bcpc#144 - Sending data to Graphite on vip Hadoop
bloomberg/chef-bcpc#153 - Port values of graphite not persisted as node attributes Hadoop
bloomberg/chef-bcpc#162 - Searching node attributes in Chef-Server Hadoop
bloomberg/chef-bcpc#163 - Set up ntp orphan mode and manycast enhancement Hadoop OpenStack
bloomberg/chef-bcpc#176 - Cookbook attribute definition inconsistency Hadoop
bloomberg/chef-bcpc#178 - Kafka broker meta data port Hadoop
bloomberg/chef-bcpc#183 - Hadoop config file changes and service restarts Hadoop
bloomberg/chef-bcpc#186 - Error in Zabbix web interface Hadoop
bloomberg/chef-bcpc#187 - Zabbix server - agent connection failure Hadoop
bloomberg/chef-bcpc#189 - Change in HDP 2.0 public repo Hadoop
bloomberg/chef-bcpc#194 - Provide Rolling Configuration Updates enhancement Hadoop
bloomberg/chef-bcpc#199 - oozie recipe fails in stop-oozie-for-war-setup Hadoop
bloomberg/chef-bcpc#207 - Hadoop configuration properties Hadoop
bloomberg/chef-bcpc#208 - Enable remote monitoring of Kafka JMX Hadoop
bloomberg/chef-bcpc#210 - Hardcoded Graphite web url port Hadoop
bloomberg/chef-bcpc#215 - Undefined variable being used bug Hadoop
bloomberg/chef-bcpc#217 - Error in cluster installation due to mysql error bug Hadoop OpenStack
bloomberg/chef-bcpc#218 - Missing "name" tags from metadata.rb bug Hadoop
bloomberg/chef-bcpc#222 - Zookeeper quorum is failing to start Hadoop
bloomberg/chef-bcpc#224 - Data node start up failure. Incorrect URI for directory location Hadoop
bloomberg/chef-bcpc#227 - Make Zookeeper and independent component Hadoop
bloomberg/chef-bcpc#228 - hbase thrift service creates conflicts with hbase master Hadoop
bloomberg/chef-bcpc#234 - HBase JMX port collision Hadoop
bloomberg/chef-bcpc#236 - HBase Master advertises wrong hostname in VM setup Hadoop upstream
bloomberg/chef-bcpc#245 - Split bcpc-hadoop::configs recipe enhancement Hadoop
bloomberg/chef-bcpc#267 - knife role from file roles/*.rb generates error Hadoop question
bloomberg/chef-bcpc#276 - cluster-assign-role doesn't work as expected for kafka install when single node is passed bug Hadoop
bloomberg/chef-bcpc#277 - Make setting up disks a dynamic process enhancement Hadoop
bloomberg/chef-bcpc#295 - Update zk_formatted to use znode_exists enhancement Hadoop
bloomberg/chef-bcpc#298 - Dynamically compute values for parameters enhancement Hadoop
bloomberg/chef-bcpc#299 - Disable zabbix agent process Hadoop
bloomberg/chef-bcpc#303 - Networking is restarted needlessly bug Hadoop
bloomberg/chef-bcpc#304 - HBase bits to heart-beat bidirectionally enhancement Hadoop
bloomberg/chef-bcpc#307 - Update get_mysql_nodes to use get_nodes_for enhancement Hadoop
bloomberg/chef-bcpc#309 - Improvement: Use attributes for port numbers Hadoop
bloomberg/chef-bcpc#310 - Improvement: Remove duplicate library functions Hadoop
bloomberg/chef-bcpc#311 - Improvement: Present JMX data on Graphite in a user friendly way Hadoop
bloomberg/chef-bcpc#313 - Cleanup Recipes for multiple components enhancement Hadoop
bloomberg/chef-bcpc#316 - broke get_zk_nodes in kafka-bcpc bug Hadoop #331 - PR
bloomberg/chef-bcpc#318 - qjournal nodes are independent of ZK nodes bug Hadoop
bloomberg/chef-bcpc#322 - Enhancement: Avoid using global and instance variables in library functions Hadoop
bloomberg/chef-bcpc#323 - Enhancement: Create library namespaces to avoid name collisions Hadoop
bloomberg/chef-bcpc#324 - Abort cluster creation if any of the downloads in the vbox_create fails Hadoop
bloomberg/chef-bcpc#326 - Need to change the PXE Boot ROM download URL details Hadoop OpenStack
bloomberg/chef-bcpc#329 - cluster-assign-roles.sh should save run_list and node object bug Hadoop
bloomberg/chef-bcpc#330 - Enhancement: Static port and recipe name in HDFSDir provider Hadoop
bloomberg/chef-bcpc#334 - Enhancement: Decouple the test cluster creation components and process enhancement Hadoop
bloomberg/chef-bcpc#335 - Enhancement: Add test cases to Hadoop Chef BCPC enhancement Hadoop
bloomberg/chef-bcpc#338 - Hive installation on Hadoop worker node incorrect Hadoop
bloomberg/chef-bcpc#339 - Duplicate values in BCPC-Hadoop roles enhancement Hadoop
bloomberg/chef-bcpc#340 - Hadoop test cluster creation doesn't install hadoop components on worker nodes Hadoop
bloomberg/chef-bcpc#342 - DDLs issued through HIVE shell is failing Hadoop
bloomberg/chef-bcpc#343 - zabbixapi 2.4.1 gem doesn't support zabbix 2.2.2 Hadoop
bloomberg/chef-bcpc#345 - Enhancement: Define and implement support for multiple versions of BCPC components Hadoop
bloomberg/chef-bcpc#346 - Need to handle ZK connection failures gracefully in "znode_exists?" lib function Hadoop
bloomberg/chef-bcpc#347 - Enhancement: enable graceful_shutdown of Kafka servers Hadoop
bloomberg/chef-bcpc#349 - Enhancement: Implement option to skip hadoop service restart coordination Hadoop
bloomberg/chef-bcpc#350 - Enhancement: Improve chef run performance by saving only the required attributes to server Hadoop

Pull Requests

bloomberg/chef-bcpc#135 - Install History Server Hadoop
bloomberg/chef-bcpc#154 - Fixed the URL for percona-xtrabackup_2.1.9-744-1.precise_amd64.deb Hadoop
bloomberg/chef-bcpc#157 - Run through fixes Hadoop
bloomberg/chef-bcpc#159 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#161 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#164 - Added code to create roles using ruby (*.rb) files. Hadoop
bloomberg/chef-bcpc#165 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#170 - Additions/Modifications for Kafka installation Hadoop
bloomberg/chef-bcpc#175 - Changes to use JMXTrans for JMX stats collection Hadoop
bloomberg/chef-bcpc#177 - jmxtrans wrapper cookbook Hadoop
bloomberg/chef-bcpc#180 - Generalize jmxtrans component for use with chef bcpc Hadoop
bloomberg/chef-bcpc#181 - Fix for issue #178. Kafka broker meta data port has been added. Hadoop
bloomberg/chef-bcpc#182 - Changes to make sure Kafka servers starts and registers with ZK Hadoop
bloomberg/chef-bcpc#184 - Added a new method to check znode existence enhancement Hadoop
bloomberg/chef-bcpc#185 - Enable remote monitoring of Zookeeper jmx Hadoop
bloomberg/chef-bcpc#190 - Changes for zabbix upgrade and HDP repo URL changes Hadoop
bloomberg/chef-bcpc#192 - Changes to support Kafka 0.8.1.1 release enhancement Hadoop
bloomberg/chef-bcpc#193 - Changes to kafka roles to support 0.8.1.1 release enhancement Hadoop
bloomberg/chef-bcpc#196 - Added library method to search and return ZK nodes for Kafka installatio... enhancement Hadoop
bloomberg/chef-bcpc#197 - A new recipe is added to extend 'service[kafka]' resource Hadoop
bloomberg/chef-bcpc#198 - Added kafka_install method Hadoop
bloomberg/chef-bcpc#200 - Fix for issue #199 Hadoop
bloomberg/chef-bcpc#201 - Changes to Kafka roles to support 0.8.1.1 release Hadoop
bloomberg/chef-bcpc#204 - Miscellaneous Bug Fixes bug Hadoop
bloomberg/chef-bcpc#205 - Added if condition to return node object from memory Hadoop
bloomberg/chef-bcpc#211 - Fix for issue 210 Hadoop
bloomberg/chef-bcpc#213 - Changes to send data from Graphite to Zabbix enhancement Hadoop
bloomberg/chef-bcpc#216 - Fix for issue #215 Hadoop
bloomberg/chef-bcpc#219 - Added name tag to metadata.rb for ChefSpec to work bug Hadoop
bloomberg/chef-bcpc#220 - Fix for issue number 208 Hadoop
bloomberg/chef-bcpc#221 - Fix for issue 217 bug Hadoop OpenStack
bloomberg/chef-bcpc#223 - Fix for issue 222 Hadoop
bloomberg/chef-bcpc#225 - Fix for issue #224 Hadoop
bloomberg/chef-bcpc#226 - Changes to make zookeeper installation work with Kafka cluster. Hadoop
bloomberg/chef-bcpc#229 - Fix for issue 228 Hadoop
bloomberg/chef-bcpc#232 - Changes to roles/recipes for ZK setup for Kafka cluster Hadoop
bloomberg/chef-bcpc#235 - Fix for issue 234 Hadoop
bloomberg/chef-bcpc#237 - Hbase powerdns support enhancement Hadoop
bloomberg/chef-bcpc#238 - Collect HBase region server JMX stats enhancement Hadoop
bloomberg/chef-bcpc#241 - Changes to fix issue #240 bug Hadoop
bloomberg/chef-bcpc#242 - Fix for issue #239 Hadoop
bloomberg/chef-bcpc#244 - Changes to fix issue #243 Hadoop
bloomberg/chef-bcpc#246 - Split conifgs recipe to create separate configuration recipe for each component enhancement Hadoop
bloomberg/chef-bcpc#258 - Changes to collect jmx stats from hbase regionserver enhancement Hadoop
bloomberg/chef-bcpc#260 - Changes to call shell script when a Zabbix triggers occurs enhancement Hadoop
bloomberg/chef-bcpc#271 - Fix for issue 266 Hadoop
bloomberg/chef-bcpc#272 - Changes to define dependencies for Zabbix triggers enhancement Hadoop
bloomberg/chef-bcpc#273 - Small Fixes enhancement Hadoop
bloomberg/chef-bcpc#275 - Fix for issue 274 Hadoop
bloomberg/chef-bcpc#279 - Added zookeeper implementation recipe enhancement Hadoop
bloomberg/chef-bcpc#283 - Enable JMXStats to collect Zookeeper JMX data enhancement Hadoop
bloomberg/chef-bcpc#285 - Allow multiple users to admin bootstrap node enhancement Hadoop
bloomberg/chef-bcpc#287 - Fix for issue #286 enhancement Hadoop
bloomberg/chef-bcpc#288 - Changes to restart jmxtrans if any of the dependent processes is restarted bug Hadoop
bloomberg/chef-bcpc#292 - Added check for setting up correct environment variables Hadoop
bloomberg/chef-bcpc#297 - Fix for issue #296 Hadoop
bloomberg/chef-bcpc#301 - Changes to Zookeeper start-up script bug Hadoop upstream
bloomberg/chef-bcpc#306 - Fix for issue 305 enhancement Hadoop
bloomberg/chef-bcpc#314 - fix_issue_307 Hadoop
bloomberg/chef-bcpc#316 - Fix for req #308 & #307 bug Hadoop
bloomberg/chef-bcpc#320 - Fix for issue #315 and issue #317 Hadoop
bloomberg/chef-bcpc#325 - Fix for issue 324 Hadoop
bloomberg/chef-bcpc#328 - Configuration Cleanups Hadoop
bloomberg/chef-bcpc#332 - Modify kafka-bcpc::setattr to mock format introduced by PR #316 bug Hadoop
bloomberg/chef-bcpc#333 - Changes to perform rolling restart of HDFS datanode Hadoop
bloomberg/chef-bcpc#336 - Fix for issue #294 and other clean-ups related to ZK installation Hadoop
bloomberg/chef-bcpc#337 - Changes to allow Kafka set-up to be able to use existing Hadoop Zookeeper quorum Hadoop
bloomberg/chef-bcpc#341 - Fix for issue #339 Hadoop
bloomberg/chef-bcpc#344 - Change to use version 2.2.1 of zabbixapi gem Hadoop
bloomberg/chef-bcpc#348 - Fix for issue #340 bug Hadoop
bloomberg/chef-bcpc#351 - Changes to allow users to skip datanode restart coordination logic

While the above work will be wrapped up in place, future work wil be tracked on this repository.

Dynamically configure bootstrap machine/host

In our current implementation, the bootstrap machine is created with values that are statically defined in Vagrantfile or bootstrap machine's box image itself. Some of these values are:

Chef Environment Name
IP Address for bootstrap machine
Host name for bootstrap machine
Hard coding of these values makes it challenging when there is a requirement to build multiple bootstrap machines when working with multiple clusters.

One of the idea to automate this is to create a generic bootstrap box image file that will not have any configuration in it and then use Vagrant to provision the bootstrap machine with correct values that are defined in an environment file. The Vagrantfile should be able to do following:

Parse an environment file that has information about the cluster to be built
Create and configure network interface on the bootstrap host
Upload correct environment to bootstrap host
Create the correct environment within chef-server running on the bootstrap host
Remove any stale boostrap vm from the chef-server running on bootstrap host
Create and configure bootstrap node with correct values

Hive Beeline doesn't use scratch directory defined in hive-site.xml

Currently we use Hive 0.12 and the behavior is different between Hive CLI and Hive BEELINE in regards to how hive scratchdir is created/used. Hive CLI picks up the value defined by hive-exec.scratchdir parameter in hive-site.xml where Hive BEELINE uses a static value of /tmp/hive-hive. The static nature of BEELINE causes issues as /tmp/hive-hive is created automatically and is owned by the user who executed the very first query preventing write access for all other users of BEELINE. I filed a case with Hortonworks and as per them this behavior is fixed in Hive 0.14.
To fix this issue in Hive 0.12:

Create a top level directory /tmp/hive-hive
Change owner to hive:hive and make it world read/writable. Users submitting their queries through beeline will still have a subdir under top level directory and it will be owned by them. The subdir will be in the form of hive_yyyy_mm_dd_hh_mm_ss_999_999999999. Also these subdirs will be removed automatically. Multiple users won't be able to view each others temporary output.

Note: Beeline uses HiveServer2 and HiveServer2 has a bug as per Hive-6847

Restarting jmxtrans takes a long time and is done often

It seems that we restart jmxtrans quite often when updating configuration files. Instead of restarting jxmtrans once if a service had restarted, we seem to restart it once per service. I suspect we need to set the service resource name to be the same at bcpc_jmxtrans::default.rb#L74-L79 and bcpc_jmxtrans::default.rb#L74-L79 opposed to service specific.

We could use a log resource to report which services for which we are bouncing jmx first to still give that insight. And we could create a block of subscribes and guard conditions for a single service resource.

I question if we are not restarting jmxtrans when we should not need to as well.

encoding issue when running auto-install test vms script on Mac OS

This is one line in automated_install.sh got error when running on MAC OS:

[[ -n "$PROXY" ]] && $SEDINPLACE "s#\(\"bootstrap\": {\)#\1\n\"proxy\" : \"http://$PROXY\",\n#" environments/${ENVIRONMENT}.json

it executed and translated to: "bootstrap:n"proxy" : "http://proxy....",n in the Test-Laptop Json file. This results in error.
Because MAC OS takes BSD's Sed version that is different from Unix's Sed.

Chef Handler to Gather Metrics

Since we heavily rely on Chef to push configuration changes and to keep our systems in "policy" a metrics framework to gather that information is certainly needed. There is a the Chef Handler framework to provide this information from a Chef run. Further, there is an interesting gem to send that data to graphite. And of course someone made a cookbook to install and configure everything.

Now for the question: how's it all actually work out?

Cluster Installation Fails while starting MySQL Service

MySQL database services doesn't come up while building a new cluster. Initial investigation shows that there is change in the version of Percona software that is now being pulled from upstream repository. It looks like the new version requires some new configuraiton. Further investigation is needed to figure out which configuration now needs to go in to support the latest version of Percona.

bcpc_jmxtrans::default returns invalid date

On a number of physical clusters we are seeing:

  * service[restart jmxtrans on dependent service manual restart] action restart
================================================================================
Error executing action `restart` on resource 'service[restart jmxtrans on dependent service manual restart]'
================================================================================


ArgumentError
-------------
invalid date


Cookbook Trace:
---------------
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:50:in `parse'
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:50:in `block (2 levels) in process_require_restart?'
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:48:in `each'
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:48:in `block in process_require_restart?'
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:44:in `each'
/var/chef/cache/cookbooks/bcpc_jmxtrans/libraries/utils.rb:44:in `process_require_restart?'
/var/chef/cache/cookbooks/bcpc_jmxtrans/recipes/default.rb:88:in `block (2 levels) in from_file'

Hive Services (Metastore, HiveServer2) doesn't come up after system restart

Hive services metastore and hiveserver2 won't come up if the machine that has these services installed is restarted. The reason for this is that init script daemons for both the services are missing creation of /var/run/hive directory before attempting to start the service and hence causes these processes to not able to create pid files under /var/run/hive directory. In current implementation hive-server2 init script is created using this template and hive-metastore init script is created when hive-metastore package is installed from HDP 2.0 repository. To resolve this issue we need to do following:

Add directory creation logic to hive-server2 template
Create a new template for hive-metastore init script and add directory creation logic to it.
Remove installation of hive-metastore package from bcpc-hadoop::hive_metastore

hive-site.xml is missing hive.exec.scratchdir parameter

In current implementation hive-site.xml is missing hive.exec.scratchdir parameter. When a query is executed, using ODBC/JDBC that connects to a hiveserver2 process, temporary results are stored in directory specified by hive.exec.scratchdir parameter. Not having this parameter in configuration file causes all the temporary output to be redirected to /tmp/hive-{process owner} on hdfs and user executing the query gets a permission denied error. Specifying hive.exec.scratchdir and pointing it to "/tmp" on hdfs will fix the issue.

Add external Json SerDe jar file to Hive Installation

Hive installation comes with a default SerDe jar file that has JsonSerDe class in it to work with JSON data. However, during testing I found that the default class doesn't work while querying the data and throws "Unable to de-serialize" exception. As per hive wiki we can use an external SerDe jar to handle to work with JSON data.

Past History

This codebase started life as a branch on the repo Chef-BCPC and as such there is significant development tracked on that repository. Particularly tagged for the Hadoop work has been:

Issues

bloomberg/chef-bcpc#83 - On a mac, sed -i cmd file should be sed -i -e cmd file Hadoop
bloomberg/chef-bcpc#84 - Mac bootstrap VM create warning Hadoop
bloomberg/chef-bcpc#85 - Warning message: ERROR: RuntimeError: Please set EDITOR environment variable Hadoop
bloomberg/chef-bcpc#96 - Pip Breaks Behind MITM Proxies Hadoop upstream
bloomberg/chef-bcpc#126 - automated_install.sh script error on Mac Hadoop
bloomberg/chef-bcpc#127 - automated_install.sh script error on Mac due to difference in "nc" syntax Hadoop
bloomberg/chef-bcpc#128 - Percona is Breaking Build Hadoop
bloomberg/chef-bcpc#129 - Error in hive-site.xml format causing issue with using Hive Hadoop
bloomberg/chef-bcpc#130 - Cookbook naming causing issues with LWRP bug Hadoop upstream
bloomberg/chef-bcpc#133 - Bringing up hbase-master on VM2 before VM3 is ready Hadoop
bloomberg/chef-bcpc#136 - Graphite tables are not created Hadoop
bloomberg/chef-bcpc#144 - Sending data to Graphite on vip Hadoop
bloomberg/chef-bcpc#153 - Port values of graphite not persisted as node attributes Hadoop
bloomberg/chef-bcpc#162 - Searching node attributes in Chef-Server Hadoop
bloomberg/chef-bcpc#163 - Set up ntp orphan mode and manycast enhancement Hadoop OpenStack
bloomberg/chef-bcpc#176 - Cookbook attribute definition inconsistency Hadoop
bloomberg/chef-bcpc#178 - Kafka broker meta data port Hadoop
bloomberg/chef-bcpc#183 - Hadoop config file changes and service restarts Hadoop
bloomberg/chef-bcpc#186 - Error in Zabbix web interface Hadoop
bloomberg/chef-bcpc#187 - Zabbix server - agent connection failure Hadoop
bloomberg/chef-bcpc#189 - Change in HDP 2.0 public repo Hadoop
bloomberg/chef-bcpc#194 - Provide Rolling Configuration Updates enhancement Hadoop
bloomberg/chef-bcpc#199 - oozie recipe fails in stop-oozie-for-war-setup Hadoop
bloomberg/chef-bcpc#207 - Hadoop configuration properties Hadoop
bloomberg/chef-bcpc#208 - Enable remote monitoring of Kafka JMX Hadoop
bloomberg/chef-bcpc#210 - Hardcoded Graphite web url port Hadoop
bloomberg/chef-bcpc#215 - Undefined variable being used bug Hadoop
bloomberg/chef-bcpc#217 - Error in cluster installation due to mysql error bug Hadoop OpenStack
bloomberg/chef-bcpc#218 - Missing "name" tags from metadata.rb bug Hadoop
bloomberg/chef-bcpc#222 - Zookeeper quorum is failing to start Hadoop
bloomberg/chef-bcpc#224 - Data node start up failure. Incorrect URI for directory location Hadoop
bloomberg/chef-bcpc#227 - Make Zookeeper and independent component Hadoop
bloomberg/chef-bcpc#228 - hbase thrift service creates conflicts with hbase master Hadoop
bloomberg/chef-bcpc#234 - HBase JMX port collision Hadoop
bloomberg/chef-bcpc#236 - HBase Master advertises wrong hostname in VM setup Hadoop upstream
bloomberg/chef-bcpc#245 - Split bcpc-hadoop::configs recipe enhancement Hadoop
bloomberg/chef-bcpc#267 - knife role from file roles/*.rb generates error Hadoop question
bloomberg/chef-bcpc#276 - cluster-assign-role doesn't work as expected for kafka install when single node is passed bug Hadoop
bloomberg/chef-bcpc#277 - Make setting up disks a dynamic process enhancement Hadoop
bloomberg/chef-bcpc#295 - Update zk_formatted to use znode_exists enhancement Hadoop
bloomberg/chef-bcpc#298 - Dynamically compute values for parameters enhancement Hadoop
bloomberg/chef-bcpc#299 - Disable zabbix agent process Hadoop
bloomberg/chef-bcpc#303 - Networking is restarted needlessly bug Hadoop
bloomberg/chef-bcpc#304 - HBase bits to heart-beat bidirectionally enhancement Hadoop
bloomberg/chef-bcpc#307 - Update get_mysql_nodes to use get_nodes_for enhancement Hadoop
bloomberg/chef-bcpc#309 - Improvement: Use attributes for port numbers Hadoop
bloomberg/chef-bcpc#310 - Improvement: Remove duplicate library functions Hadoop
bloomberg/chef-bcpc#311 - Improvement: Present JMX data on Graphite in a user friendly way Hadoop
bloomberg/chef-bcpc#313 - Cleanup Recipes for multiple components enhancement Hadoop
bloomberg/chef-bcpc#316 - broke get_zk_nodes in kafka-bcpc bug Hadoop #331 - PR
bloomberg/chef-bcpc#318 - qjournal nodes are independent of ZK nodes bug Hadoop
bloomberg/chef-bcpc#322 - Enhancement: Avoid using global and instance variables in library functions Hadoop
bloomberg/chef-bcpc#323 - Enhancement: Create library namespaces to avoid name collisions Hadoop
bloomberg/chef-bcpc#324 - Abort cluster creation if any of the downloads in the vbox_create fails Hadoop
bloomberg/chef-bcpc#326 - Need to change the PXE Boot ROM download URL details Hadoop OpenStack
bloomberg/chef-bcpc#329 - cluster-assign-roles.sh should save run_list and node object bug Hadoop
bloomberg/chef-bcpc#330 - Enhancement: Static port and recipe name in HDFSDir provider Hadoop
bloomberg/chef-bcpc#334 - Enhancement: Decouple the test cluster creation components and process enhancement Hadoop
bloomberg/chef-bcpc#335 - Enhancement: Add test cases to Hadoop Chef BCPC enhancement Hadoop
bloomberg/chef-bcpc#338 - Hive installation on Hadoop worker node incorrect Hadoop
bloomberg/chef-bcpc#339 - Duplicate values in BCPC-Hadoop roles enhancement Hadoop
bloomberg/chef-bcpc#340 - Hadoop test cluster creation doesn't install hadoop components on worker nodes Hadoop
bloomberg/chef-bcpc#342 - DDLs issued through HIVE shell is failing Hadoop
bloomberg/chef-bcpc#343 - zabbixapi 2.4.1 gem doesn't support zabbix 2.2.2 Hadoop
bloomberg/chef-bcpc#345 - Enhancement: Define and implement support for multiple versions of BCPC components Hadoop
bloomberg/chef-bcpc#346 - Need to handle ZK connection failures gracefully in "znode_exists?" lib function Hadoop
bloomberg/chef-bcpc#347 - Enhancement: enable graceful_shutdown of Kafka servers Hadoop
bloomberg/chef-bcpc#349 - Enhancement: Implement option to skip hadoop service restart coordination Hadoop
bloomberg/chef-bcpc#350 - Enhancement: Improve chef run performance by saving only the required attributes to server Hadoop

Pull Requests

bloomberg/chef-bcpc#135 - Install History Server Hadoop
bloomberg/chef-bcpc#154 - Fixed the URL for percona-xtrabackup_2.1.9-744-1.precise_amd64.deb Hadoop
bloomberg/chef-bcpc#157 - Run through fixes Hadoop
bloomberg/chef-bcpc#159 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#161 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#164 - Added code to create roles using ruby (*.rb) files. Hadoop
bloomberg/chef-bcpc#165 - Changed the order of hive and mapred roles for bcpc-vm2 Hadoop
bloomberg/chef-bcpc#170 - Additions/Modifications for Kafka installation Hadoop
bloomberg/chef-bcpc#175 - Changes to use JMXTrans for JMX stats collection Hadoop
bloomberg/chef-bcpc#177 - jmxtrans wrapper cookbook Hadoop
bloomberg/chef-bcpc#180 - Generalize jmxtrans component for use with chef bcpc Hadoop
bloomberg/chef-bcpc#181 - Fix for issue #178. Kafka broker meta data port has been added. Hadoop
bloomberg/chef-bcpc#182 - Changes to make sure Kafka servers starts and registers with ZK Hadoop
bloomberg/chef-bcpc#184 - Added a new method to check znode existence enhancement Hadoop
bloomberg/chef-bcpc#185 - Enable remote monitoring of Zookeeper jmx Hadoop
bloomberg/chef-bcpc#190 - Changes for zabbix upgrade and HDP repo URL changes Hadoop
bloomberg/chef-bcpc#192 - Changes to support Kafka 0.8.1.1 release enhancement Hadoop
bloomberg/chef-bcpc#193 - Changes to kafka roles to support 0.8.1.1 release enhancement Hadoop
bloomberg/chef-bcpc#196 - Added library method to search and return ZK nodes for Kafka installatio... enhancement Hadoop
bloomberg/chef-bcpc#197 - A new recipe is added to extend 'service[kafka]' resource Hadoop
bloomberg/chef-bcpc#198 - Added kafka_install method Hadoop
bloomberg/chef-bcpc#200 - Fix for issue #199 Hadoop
bloomberg/chef-bcpc#201 - Changes to Kafka roles to support 0.8.1.1 release Hadoop
bloomberg/chef-bcpc#204 - Miscellaneous Bug Fixes bug Hadoop
bloomberg/chef-bcpc#205 - Added if condition to return node object from memory Hadoop
bloomberg/chef-bcpc#211 - Fix for issue 210 Hadoop
bloomberg/chef-bcpc#213 - Changes to send data from Graphite to Zabbix enhancement Hadoop
bloomberg/chef-bcpc#216 - Fix for issue #215 Hadoop
bloomberg/chef-bcpc#219 - Added name tag to metadata.rb for ChefSpec to work bug Hadoop
bloomberg/chef-bcpc#220 - Fix for issue number 208 Hadoop
bloomberg/chef-bcpc#221 - Fix for issue 217 bug Hadoop OpenStack
bloomberg/chef-bcpc#223 - Fix for issue 222 Hadoop
bloomberg/chef-bcpc#225 - Fix for issue #224 Hadoop
bloomberg/chef-bcpc#226 - Changes to make zookeeper installation work with Kafka cluster. Hadoop
bloomberg/chef-bcpc#229 - Fix for issue 228 Hadoop
bloomberg/chef-bcpc#232 - Changes to roles/recipes for ZK setup for Kafka cluster Hadoop
bloomberg/chef-bcpc#235 - Fix for issue 234 Hadoop
bloomberg/chef-bcpc#237 - Hbase powerdns support enhancement Hadoop
bloomberg/chef-bcpc#238 - Collect HBase region server JMX stats enhancement Hadoop
bloomberg/chef-bcpc#241 - Changes to fix issue #240 bug Hadoop
bloomberg/chef-bcpc#242 - Fix for issue #239 Hadoop
bloomberg/chef-bcpc#244 - Changes to fix issue #243 Hadoop
bloomberg/chef-bcpc#246 - Split conifgs recipe to create separate configuration recipe for each component enhancement Hadoop
bloomberg/chef-bcpc#258 - Changes to collect jmx stats from hbase regionserver enhancement Hadoop
bloomberg/chef-bcpc#260 - Changes to call shell script when a Zabbix triggers occurs enhancement Hadoop
bloomberg/chef-bcpc#271 - Fix for issue 266 Hadoop
bloomberg/chef-bcpc#272 - Changes to define dependencies for Zabbix triggers enhancement Hadoop
bloomberg/chef-bcpc#273 - Small Fixes enhancement Hadoop
bloomberg/chef-bcpc#275 - Fix for issue 274 Hadoop
bloomberg/chef-bcpc#279 - Added zookeeper implementation recipe enhancement Hadoop
bloomberg/chef-bcpc#283 - Enable JMXStats to collect Zookeeper JMX data enhancement Hadoop
bloomberg/chef-bcpc#285 - Allow multiple users to admin bootstrap node enhancement Hadoop
bloomberg/chef-bcpc#287 - Fix for issue #286 enhancement Hadoop
bloomberg/chef-bcpc#288 - Changes to restart jmxtrans if any of the dependent processes is restarted bug Hadoop
bloomberg/chef-bcpc#292 - Added check for setting up correct environment variables Hadoop
bloomberg/chef-bcpc#297 - Fix for issue #296 Hadoop
bloomberg/chef-bcpc#301 - Changes to Zookeeper start-up script bug Hadoop upstream
bloomberg/chef-bcpc#306 - Fix for issue 305 enhancement Hadoop
bloomberg/chef-bcpc#314 - fix_issue_307 Hadoop
bloomberg/chef-bcpc#316 - Fix for req #308 & #307 bug Hadoop
bloomberg/chef-bcpc#320 - Fix for issue #315 and issue #317 Hadoop
bloomberg/chef-bcpc#325 - Fix for issue 324 Hadoop
bloomberg/chef-bcpc#328 - Configuration Cleanups Hadoop
bloomberg/chef-bcpc#332 - Modify kafka-bcpc::setattr to mock format introduced by PR #316 bug Hadoop
bloomberg/chef-bcpc#333 - Changes to perform rolling restart of HDFS datanode Hadoop
bloomberg/chef-bcpc#336 - Fix for issue #294 and other clean-ups related to ZK installation Hadoop
bloomberg/chef-bcpc#337 - Changes to allow Kafka set-up to be able to use existing Hadoop Zookeeper quorum Hadoop
bloomberg/chef-bcpc#341 - Fix for issue #339 Hadoop
bloomberg/chef-bcpc#344 - Change to use version 2.2.1 of zabbixapi gem Hadoop
bloomberg/chef-bcpc#348 - Fix for issue #340 bug Hadoop

While the above work will be wrapped up in place, future work wil be tracked on this repository.

Vagrant Baremetal Bring-Up Assumes HTTP(S) Proxy

In kinfe.rb.erb used by the bcpc::bootstrap_knife recipe, it is assumed the environment is using http_proxies. We should make this dynamic based on the environment.

Generic bootstrap does not regenerate UFW rules to allow TFTP access

If one setups a VM bootstrap image and then uses Vagrant.baremetal to re-provision the UFW rules are not initially updated. The rules from ufw.rb do not update on the re-provision but seem to update subsequently. Disabling the ufw service and re-chefing the bootstrap seems to add the new IP in for the TFTP rule (though the old IP is left in as well).

If one sees TFTP boot failing, simply check iptables -L under the ufw-user-input chain for TFTP and ensure the IP address listed is for the correct machine.

Chef is grep'ing for HBase services

It appears for our hbase-master service is falling back to walking the process list for the regular expression /hbase-master/ which means should a user be running a processes like:

user1 14724 13749  0 Dec16 pts/4    00:00:00 tail -f hbase-hbase-master-cluster1-r1n8.log

Then the Chef run will frustratingly not start the service should it be down; running Chef with debug logging yields:

[2014-12-17T17:57:52-05:00] INFO: Processing service[hbase-master] action enable (bcpc-hadoop::hbase_master line 68)
[2014-12-17T17:57:52-05:00] DEBUG: Platform ubuntu version 12.04 found
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] falling back to process table inspection
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] attempting to match 'hbase-master' (/hbase-master/) against process list
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] running: true
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 0, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 1, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 2, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 3, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 4, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 5, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 6, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] already enabled - nothing to do
[2014-12-17T17:57:52-05:00] INFO: Processing service[hbase-master] action start (bcpc-hadoop::hbase_master line 68)
[2014-12-17T17:57:52-05:00] DEBUG: Platform ubuntu version 12.04 found
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] falling back to process table inspection
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] attempting to match 'hbase-master' (/hbase-master/) against process list
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] running: true
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 0, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 1, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 2, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 3, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 4, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 5, action start, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] runlevel 6, action stop, priority 20
[2014-12-17T17:57:52-05:00] DEBUG: service[hbase-master] already running - nothing to do

We need to set supports :restart => false, :reload => false, :status => false; perhaps we can even set the default for our cookbook?

Disable IPv6 for Hadoop Clusters at OS level

We used to have code for disabling IPv6 at OS level that was removed as part of this commit assuming that the java property java.net.preferIPv4Stack=true will take care of using only IPv4. While building cluster I found that the java property did not do what is expected from it and caused networking issues while running MapReduce job. The issue was resolved after disabling IPv6 at OS level.