fgci-org / fgci-ansible Goto Github PK
View Code? Open in Web Editor NEW:microscope: Collection of the Finnish Grid and Cloud Infrastructure Ansible playbooks
License: MIT License
:microscope: Collection of the Finnish Grid and Cloud Infrastructure Ansible playbooks
License: MIT License
Reported by Ulf
If your mail-server (for the adminMailAddr e-mail address) has an IPv6 address and IPv6 is kind of disabled (the default) - e-mails to adminMailaddr will not work.
In order to access iDRACs Virtual console, admin node needs javaws
When I run the prep_local.yml playbook, the system did not rename the example files into yml files. But that I could do manually too. The other thing was that tar was in /bin/tar and I had to make a softlink /usr/bin/tar for the script (this way I did get the backup in fgci-ansible/backup).
user's using ubuntu
in order to have PXE boot working dnsmasq needs to be installed on install node.
A role have to be built to handle this.
Afterwards /etc/resolv.conf from nodes have to point to install node
figure out a way in which the same munge.key file is provided from ansible-pull and admin's workstation
https://github.com/willshersystems/ansible-sshd
PermitRootLogin without-password
First time it fails/hangs and one must CTRL+C and start the provisioning.yml playbook again.
VMs confs need to be changed to survive an hypervisor reboot
Right now the kickstart sets up ansible-pull cronjob that runs every 15 minutes.
Then the ansible-pull itself changes that to every 120 minutes.
This means on a reinstall there's quite a long downtime before the the node is configured.
Possible solution:
Probably some more safety checks can be added here to make sure all the requirements are set somewhat sanely before continuing.
If ext_ip_addr variable is not set on the host - then we'll get
IPADDR={#ext_ip_addr#}
in ifcfg-eth* .
It would be quite useful to have some module usage statistics. First part of this is to write to syslog whenever a module is loaded.
The lua script we use on Taito is available here:
/homeappl/appl_taito/opt/lmod/apps/SitePackage.lua
Make a role that sends an annotation to cassini indicating that a configuration change was made with ansible. See https://github.com/CSC-IT-Center-for-Science/fgci-ansible/blob/master/tools/grafana_annotation.sh and https://github.com/CSC-IT-Center-for-Science/ansible-role-elasticsearch-demo-event/blob/master/tasks/main.yml for examples of how it can be done.
Generate SSH host keys and install them on compute nodes.
Currently they must be deleted from all the right places or there will be errors on node reinstall.
The dns role sets up /etc/resolv.conf so that install node queries itself first. This is OK, after dnsmasq is configured and installed.
Need to run the dns role later.
It would be nice to get it installed by KS
A MOTD updater would be nice.
Some feature requests:
https://github.com/CSC-IT-Center-for-Science/fgci-ansible/blob/master/tools/reinstall_node.yml
kickstart files matching kickstart_profile shall be placed on the install node.
boot.py from ansible-role-pxe_bootstrap will try to fetch those from there.
Example:
http://1.1.1.2/ks/FGCI-compute-node
With many interfaces that are up - provisioning becomes slow because of DHCP timeouts.
Means setting
adminremove_passwords: True
https://github.com/CSC-IT-Center-for-Science/ansible-role-users/blob/master/defaults/main.yml#L8
PXE boot process requires that a file named after the name of the node being installed is created on install-node:/var/www/provision/reinstall/
This needs to be ensured for first installation and possibly a playbook could be run whenever a node needs a reinstall
The fgci repo's EL7 build of cvmfs-repofiles-fgi does not contain /etc/cvmfs/default.local
These are the file in the FGI EL6 rpm:
CVMFS_CACHE_BASE=/var/cache/cvmfs2 CVMFS_REPOSITORIES=fgi CVMFS_DEFAULT_DOMAIN=csc.fi CVMFS_HTTP_PROXY=DIRECT CVMFS_CACHE_DIR=/var/cache/cvmfs2 CVMFS_QUOTA_LIMIT=20480
Enable systemctl httpd.service through ansible
We should build a "acceptance test" role that assesses the performance of Infiniband and Disks/RAID.
Interconnect tests
Connectivity tests on each InfiniBand network fabric installed (connected to a switch) on
a cluster.
The FDR Infiniband interconnect tests will consist of Bandwidth tests and Latency tests
on all ports. These will be tested with ib_write_bw and ib_send_lat and to pass the
bandwidth must be better than 5000 MB/s and latency less than 2 microseconds. These
criteria must be met by all point-to-point links between any two nodes.
The tests will be performed with RHEL/CentOS package perftest-2.2-1.el6.x86_64 or
equivalent.
Disk tests
All Disk drives must work without errors, all RAID systems must work without errors.
Disk drives will be tested with smartctl, hdparm, dd to check bandwidth. Output of
dmesg and/or system logs will be checked for seek errors.
Related document: https://confluence.csc.fi/download/attachments/51890567/Liite_3_Hyvaksymismenettely.pdf?api=v2
Add an option to enable IPv6:
initially these are required:
Since commit https://github.com/CSC-IT-Center-for-Science/fgci-ansible/blob/b200e101cab6faae9d1f8f17b6b9f36aa8bebe16/tests/Dockerfile or so we run the travis-ci test inside a CentOS7 Docker container. Have not been able to get mariadb to start at all inside this container.
Right now it's:
Could we perhaps use ansible-pull on the compute nodes?
Use {{ ansible_hostname }} instead of {{ inventory_hostname }}
Only have
There now, should have namserver2 too.
Reported by Ulf
make the ansible-pull command in ansible-pull-script.sh use a configurable branch.
TASK: [ansible-role-yum | install software that do not need extra configuration] ***
failed: [io-install.fgci.csc.fi] => (item=git,nfs-utils,bash-completion,wget,pdsh) => {"changed": false, "failed": true, "item": "git,nfs-utils,bash-completion,wget,pdsh", "rc": 0, "results": []}
msg: No Package matching 'pdsh' found available, installed or updated
FATAL: all hosts have already failed -- aborting
[root@io-install ~]# ls -la /var/www/provision/reinstall/
total 8
drwxr-xr-x 2 apache apache 4096 Dec 9 15:55 .
drwxr-xr-x 4 root root 4096 Nov 26 14:06 ..
-rw-r--r-- 1 root root 0 Dec 9 15:55 io1,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io2,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io3,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io4,gpu
Blocked by authors of these roles?
The roles:
For delivering IP addresses to iDRACs
check other nodes as well
The kickstart_partitions variable in group_vars/login,grid,compute/ is used by ansbile-role-pxe_config and it is not ideal.
On our test compute nodes it creates a /home partition which is not used because they get /home from NFS.
Skip creating a /home and create a /tmp perhaps?
To make sure the RPMs are cached at the proxy server before continuing.
TASK: [ansible-role-fgci-install | synchronize group_vars/ to /var/www/html/group_vars/ - for ansible-pull - secrets and firewalls are not needed] ***
failed: [io-install.fgci.csc.fi -> 127.0.0.1] => {"cmd": "rsync --delay-updates -F --compress --recursive --rsh 'ssh -S none -o StrictHostKeyChecking=no' --exclude=secrets.yml --exclude=trusted_networks.yml --out-format='<>%i %n%L' "/home/lalves/code/fgci-ansible/group_vars/" "[email protected]:/var/www/html/group_vars"", "failed": true, "rc": 127}
msg: X11 forwarding request failed on channel 0
bash: rsync: command not found
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: remote command not found (code 127) at io.c(226) [sender=3.1.1]
FATAL: all hosts have already failed -- aborting
A new role or task in an existing role should copy in the x509 certificate to the grid node.
install node
So it seems that eth1 doesn't come up as it should
Reported by Ulf
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.