cdsw_install's People
Forkers
guidooswald pamulapati skpabba jessielin2008 myloginid mglaserna ashkri999 naswiz jmuselaerscdsw_install's Issues
Ensure ulimit is set to 1048576
Saw this in the cdsw init
output:
WARNING: Cloudera Data Science Workbench recommends that all users have a max-open-files limit set to 1048576.
It is currently set to [32768] as per 'ulimit -n'
I think I need to add lines thus:
cat >/etc/security/limits.conf <<EOF
* soft nofile 1048576
* hard nofile 1048576
EOF
And also to set both hard and soft limits in the currently running system:
ulimit -n 1048576
Azure SECRETS not well documented
I was asked what the SECRETS file should contain for Azure - need to make sure I've documented that properly.
Need to create the hdfs cdsw directory
The cdsw workshop expects the /user/cdsw directory to have been created in hdfs - currently this is not done, nor is it documented anywhere ...
clean up cdsw disks - single disk only needed for 1.0+
migrate to CDH 5.12/Director 2.5
Update the documentation to make it easier for people to understand this project
Few people have just cloned this project and got it to work. I believe that's because the README is too sense - they just gloss over it and never get to grips with what they've got to do.
I propose dramatically shortening the README and replacing it with a wiki page that describes the steps etc. in a more easily consumed fashion.
link to how to make a directory instance using the cloud-lab scripts
Update build structure to ensure stability and manageability
I think I should be much more prescriptive in the build to make it easier for people to use.
In particular I should fix the exact versions of CDH and other parcels, and then tag these releases. That way one can check out a specific tag and know that it worked with that specific release of CDH, CDSW, Anaconda, Spark etc. etc.
UPgrade to CDH 5.13.1, 2.6.1
Port parcel build to google
Change the conf files to property files and note in the readme that these are to be changed
Remove reverse lookup from AWS bind server
There might be no need to provide reverse lookup capabilities using the local bind server in AWS - simply handling the cdsw.cdh-cluster.internal
domain should be sufficient, provided that's a CNAME to the internal machines.
Add parcel install for AWS
Update to cdsw 1.0.1
Modify repo to use baseurl of 1.0.1
[cloudera-cdsw]
# Packages for Cloudera's Distribution for data science workbench, Version 1, on RedHat or CentOS 7 x86_64
name=Cloudera's Distribution for cdsw, Version 1
# old baseurl
# baseurl=https://archive.cloudera.com/cdsw/1/redhat/7/x86_64/cdsw/1/
baseurl=https://archive.cloudera.com/cdsw/1/redhat/7/x86_64/cdsw/1.0.1/
gpgkey =https://archive.cloudera.com/cdsw/1/redhat/7/x86_64/cdsw/RPM-GPG-KEY-cloudera
gpgcheck = 1
probably
sed -i 's|/1/$|/1.0.1/' ....
bootstrapScript is deprecated. Please switch to bootstrapScripts
rpcbind needs to be stopped and started
creating multiple CDSW Clusters for difference environments DEV / STG / PRD
Dear Toby,
I would like to investigate with you the possibility of adding an ENVIRONMENT variable for deployment.
Indeed, it is quite common to deploy a DEV / STG / PRD environment.
Which only differs on 3 parameters: instance prefix (common.conf
& aws/instance.conf
), instance type (aws/instance.conf
), instance count (common.conf
).
My proposal in multiple steps:
1°) Will it be possible to create a new instance type for CM in aws/instance.conf
?
Then I will be able to centralise instance prefix, type in the same file.
2°) Will it be possible to add an environment variable in provider.properties
?
For example: ENVIRONMENT=DEV
And update the prefix with --cdsw-${name}
So that we do not modify the prefix under aws/instance.conf
, but only the instance types.
3°) A way to variabilise the number of instance per environment ... any suggestions?
BR
OWNER is duplicated in 2 different .properties files
Hi Toby,
Updated the AWS directory .properties files.
And noticed that the OWNER global variable has been duplicated in provider.properties.
grep OWNER aws/*
aws/instances.conf: owner: ${OWNER}
aws/owner_tag.properties:OWNER=aheib
aws/provider.properties:OWNER=aheib
BR
Consolidate the multiple provider specific files into a smaller number.
At least one set of users found it confusing to have multiple .properties and .conf files for each cloud provider.
Perhaps it would be easier to focus on providing all of the user settable information in a single .properties file.
Of course it would be nice to have a set of sensible defaults, and only have the user need to worry about specific values they must change. Maybe just divide the properties file into two sections, mandatory and default.
The goal is to make this as low touch as possible to enable a user to create a simple cluster. Thus minimizing configuration to essential only is important.
Another thing that I want to do is to make it clear which bits of the system are not going to be put into git. A GITIGNORED directory might be the way to go, and then put the SECRETS and the ssh private keys in there.
cdsw can fail to mount nfs
Sometimes cdsw fails to mount nfs properly.
The solution is to ensure that the rpcbind service is started:
systemctl start rpcbind
GCP - make the workers be stoppable
With the current release the workers have Local SSD Scratch Disks and they cannot be stopped. This means that we have to delete and then recreate a cluster, which is painful.
gcp/instances.conf
contains:
worker : ${common-instance-properties} {
type: n1-highmem-4
instanceNamePrefix: worker-${name}
dataDiskCount:1
dataDiskType: Standard
}
I think the disk type needs to be something like pd-standard
...
Ensure public IPs only available when explicitly requested
Add support for Azure
Replace XIP.IO with NIP.IO
New Anaconda version
Hi Toby,
Just ran into an issue on the deployment.
There has been some version incompatibility during our latest deployment (with Andre Molinaar).
Found errors in cluster configuration:
- ErrorInfo{code=NO_PARCEL_FOUND_WITH_VERSION, properties={availableProductVersions=4.3.1, clusterProduct=Anaconda, clusterProductVersion=4.2}, causes=[]}
The workaround is to modify the "common.conf", and replaced Anaconda: 4.2 to Anaconda: 4.3.1.
Not sure if it is the best approach.
BR
CDH Packages not visible on AWS with 5.12
#fc45bce
When I use an AMI that doesn't have CDH prepackaged the system fails and the following error is seen in the director server log:
Waiting for product versions to be visible in CM: {CDH=5.12.0-1.cdh5.12.0.p0.29}
I've tried this with the following AMIs in West-2 region:
- ami-a3fa16c3 - rhel 72 ami taken from the faster-bootstrap system
- ami-5dd3743d - community rhel 72
- ami-e2167182 - community rhel 73
I have no idea why the specific CDH product version is being searched for - it certainly isn't with the URLs that I gave
Update README to indicate that aws.conf creates the cdsw user
Use csds rather than downloading the Spark 2 material directly
In the cloudera-manager
section use this:
csds: [
"http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.1.0.cloudera1.jar"
]
and not loading everything via a boostrap script.
install of cdsw failed because data2 wasn't found
Filippo had an issue where his disks were mounted as data1 and data2, not data0 and data1 - he's using Director 2.5. Maybe that's the problem?
To fix the issue we had to remove data2 from /etc/fstab and to edit /etc/cdsw/config/cdsw.conf to setup DBD
Add support for BIND as an XIP replacement
Clock offset problem
Test of the host clock's offset from its NTP server.
Bad : The host's NTP service could not be located or did not respond to a request for the clock offset.
Actions
Change Host Clock Offset Thresholds for all hosts
Change Host Clock Offset Thresholds for this host
Advice
This is a host health test that checks if the host's system clock appears to be out-of-sync with its NTP server(s). The test uses the 'ntpdc -np' (if ntpd is running) or 'chronyc sources' (if chronyd is running) command to check that the host is synchronized to an NTP peer and that the absolute value of the host's clock offset from that peer is not too large. If the command fails, NTP is not synchronized to a server, or the host's NTP daemon is not running or cannot be contacted, the test returns "Bad" health.
The 'ntpdc -np' or 'chronyc sources' output contains a row for each of the host's NTP servers. The row starting with a '' (if ntpdc) or '^' (if chronyc) contains the peer to which the host is currently synchronized. No row starting with a '' or '^' indicates that the host is not currently synchronized. Communication errors, and an offset between the peer and the host time that is too large, are examples of conditions that can lead to a host being unsynchronized.
Make sure that UDP port 123 is open on any firewall that is in use. Check the system log for ntpd or chronyd messages related to configuration errors. If running ntpd, use 'ntpdc -c iostat' to verify that packets are sent and recieved between the different peers. More information about the conditions of each peer can be found by running the command 'ntpq -c as'. The output of this command includes the association ID that can be used in combination with 'ntpq -c "rv "' to get more information about the status of each peer. Use the command 'ntpq -c pe' to return a summary of all peers and the reason they are not in use. If running chronyd, use 'chronyc activity' to check how many NTP sources are online/offline. More information about the conditions of each peer can be found by running the command 'chronyc sourcestats'. To check chrony tracking, issue the command 'chronyc tracking'.
If NTP is not in use on the host, disable this check for the host, using the configuration options shown below. Cloudera recommends using NTP for time synchronization of Hadoop clusters.
A failure of this health test can indicate a problem with the host's NTP service or configuration.
This test can be configured using the Host Clock Offset Thresholds host configuration setting.
Allow for multiple cdsw clusters using the same DNS server
The fqdn for a cdsw node is currently fixed. Need to allow for different clusters to have different cdsw names so that they can coexist
ensure selinux is off before running cdsw init
It seems that simply doing a setenforce 0
might be insufficient; the selinux state is PERMISSIVE
(or something like that) and it might be necessary to reboot to setup selinux to DISABLED
replace the embedded private key with a file reference
I believe Director 2.4 allows one to simply refer to a local private key file rather than embed the private key directly. I'll have to try this out.
Change hostname for gateway from ec2.* to cdsw.*
update to cdh 5.14
Include TLS
Delete my gcp secret key from this repo
separate the scripting from the conf file
In Director 2.4 its possible to simply refer to local script files rather than put all the text into the conf file. This would greatly simplify the conf files and allow for easier maintenance.
cdsw disk space needs to be sufficient to meet all tests
minimum docker block device size is 500G
Minimu application block device size is 500G
Minimum root volume size is 100G
refactor the files to share the common parts
Update README to reflect the need to add an alias for rhel7
Add SECRET file to handle secrets
Use Relative Path includes
HOCON allows for a relative path include as per [https://github.com/typesafehub/config/blob/master/HOCON.md#include-semantics-file-formats-and-extensions]:
if the included file is a relative path, then it should be located relative to the directory containing the including file. The current working directory of the process parsing a file must NOT be used when interpreting included paths.
Allow for different domain names
The domain name is fixed at cdh-cluster.internal - might want to consider changing this and allowing the user to specify the name.
Kerberos add principle command error in README.md
Tried running sudo kadmin.local addprinc cdsw -pw Cloudera1
Getting an error:
[centos@ip-10-0-0-33 ~]$ sudo kadmin.local addprinc cdsw -pw Cloudera1
usage: add_principal [options] principal
options are:
[-randkey|-nokey] [-x db_princ_args]* [-expire expdate] [-pwexpire pwexpdate] [-maxlife maxtixlife]
[-kvno kvno] [-policy policy] [-clearpolicy]
[-pw password] [-maxrenewlife maxrenewlife]
[-e keysaltlist]
[{+|-}attribute]
attributes are:
allow_postdated allow_forwardable allow_tgs_req allow_renewable
allow_proxiable allow_dup_skey allow_tix requires_preauth
requires_hwauth needchange allow_svr password_changing_service
ok_as_delegate ok_to_auth_as_delegate no_auth_data_required
where,
[-x db_princ_args]* - any number of database specific arguments.
Look at each database documentation for supported arguments
I ran sudo kadmin.local addprinc -pw Cloudera1 cdsw
and that worked.
update GCP Director instructions to be less version sensitive and include restart
The SSH_KEYFILE setting is unused - delete it
investigage use of additional ebs volumes and remove if necessary
Add instructions on how to build the AMI
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.