Coder Social home page Coder Social logo

fgci-org / fgci-ansible Goto Github PK

View Code? Open in Web Editor NEW
54.0 20.0 17.0 1.02 MB

:microscope: Collection of the Finnish Grid and Cloud Infrastructure Ansible playbooks

License: MIT License

Shell 53.88% Python 4.89% Dockerfile 11.37% Jinja 29.86%
ansible ansible-roles slurm grid wlcg cscfi hpc provisioning cluster

fgci-ansible's People

Contributors

a1ve5 avatar enitinggous avatar gehock avatar jabl avatar mantti avatar martbhell avatar mhakala avatar tiggi avatar villes1 avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fgci-ansible's Issues

log module usage to syslog

It would be quite useful to have some module usage statistics. First part of this is to write to syslog whenever a module is loaded.

The lua script we use on Taito is available here:
/homeappl/appl_taito/opt/lmod/apps/SitePackage.lua

install playbook - fgci repo is needed to install pdsh

TASK: [ansible-role-yum | install software that do not need extra configuration] ***
failed: [io-install.fgci.csc.fi] => (item=git,nfs-utils,bash-completion,wget,pdsh) => {"changed": false, "failed": true, "item": "git,nfs-utils,bash-completion,wget,pdsh", "rc": 0, "results": []}
msg: No Package matching 'pdsh' found available, installed or updated

FATAL: all hosts have already failed -- aborting

motd updater

A MOTD updater would be nice.

Some feature requests:

  • pull in data from RSS feed
  • different MOTD on different machines in the cluster
  • idid / lastdone commands to easily update motd with a small notification?

Improve partitioning

The kickstart_partitions variable in group_vars/login,grid,compute/ is used by ansbile-role-pxe_config and it is not ideal.

On our test compute nodes it creates a /home partition which is not used because they get /home from NFS.

Skip creating a /home and create a /tmp perhaps?

prep_local.yml playbook malfunctioning

When I run the prep_local.yml playbook, the system did not rename the example files into yml files. But that I could do manually too. The other thing was that tar was in /bin/tar and I had to make a softlink /usr/bin/tar for the script (this way I did get the backup in fgci-ansible/backup).

user's using ubuntu

Pre-Generate ssh host keys

Generate SSH host keys and install them on compute nodes.
Currently they must be deleted from all the right places or there will be errors on node reinstall.

Where to source /etc/cvmfs/default.local ?

The fgci repo's EL7 build of cvmfs-repofiles-fgi does not contain /etc/cvmfs/default.local

These are the file in the FGI EL6 rpm:

CVMFS_CACHE_BASE=/var/cache/cvmfs2
CVMFS_REPOSITORIES=fgi
CVMFS_DEFAULT_DOMAIN=csc.fi
CVMFS_HTTP_PROXY=DIRECT
CVMFS_CACHE_DIR=/var/cache/cvmfs2
CVMFS_QUOTA_LIMIT=20480 

dnsmasq on install node is still missing

in order to have PXE boot working dnsmasq needs to be installed on install node.
A role have to be built to handle this.
Afterwards /etc/resolv.conf from nodes have to point to install node

Make a "Performance test" Playbook

We should build a "acceptance test" role that assesses the performance of Infiniband and Disks/RAID.

Interconnect tests
Connectivity tests on each InfiniBand network fabric installed (connected to a switch) on
a cluster.
The FDR Infiniband interconnect tests will consist of Bandwidth tests and Latency tests
on all ports. These will be tested with ib_write_bw and ib_send_lat and to pass the
bandwidth must be better than 5000 MB/s and latency less than 2 microseconds. These
criteria must be met by all point-to-point links between any two nodes.
The tests will be performed with RHEL/CentOS package perftest-2.2-1.el6.x86_64 or
equivalent.

Disk tests
All Disk drives must work without errors, all RAID systems must work without errors.
Disk drives will be tested with smartctl, hdparm, dd to check bandwidth. Output of
dmesg and/or system logs will be checked for seek errors.

Related document: https://confluence.csc.fi/download/attachments/51890567/Liite_3_Hyvaksymismenettely.pdf?api=v2

rsync is needed in install

TASK: [ansible-role-fgci-install | synchronize group_vars/ to /var/www/html/group_vars/ - for ansible-pull - secrets and firewalls are not needed] ***
failed: [io-install.fgci.csc.fi -> 127.0.0.1] => {"cmd": "rsync --delay-updates -F --compress --recursive --rsh 'ssh -S none -o StrictHostKeyChecking=no' --exclude=secrets.yml --exclude=trusted_networks.yml --out-format='<>%i %n%L' "/home/lalves/code/fgci-ansible/group_vars/" "[email protected]:/var/www/html/group_vars"", "failed": true, "rc": 127}
msg: X11 forwarding request failed on channel 0
bash: rsync: command not found
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: remote command not found (code 127) at io.c(226) [sender=3.1.1]

FATAL: all hosts have already failed -- aborting

Speed up node reinstall

Right now the kickstart sets up ansible-pull cronjob that runs every 15 minutes.
Then the ansible-pull itself changes that to every 120 minutes.
This means on a reinstall there's quite a long downtime before the the node is configured.

Possible solution:

  • make kickstart run the ansible-pull-script.sh in /etc/rc.local
  • have ansible-pull remove it from there and add it as a cronjob

PXE boot node reinstall flag file

PXE boot process requires that a file named after the name of the node being installed is created on install-node:/var/www/provision/reinstall/

This needs to be ensured for first installation and possibly a playbook could be run whenever a node needs a reinstall

reinstall files weren't properly created

[root@io-install ~]# ls -la /var/www/provision/reinstall/
total 8
drwxr-xr-x 2 apache apache 4096 Dec 9 15:55 .
drwxr-xr-x 4 root root 4096 Nov 26 14:06 ..
-rw-r--r-- 1 root root 0 Dec 9 15:55 io1,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io2,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io3,gpu
-rw-r--r-- 1 root root 0 Dec 9 15:55 io4,gpu

munge.key

figure out a way in which the same munge.key file is provided from ansible-pull and admin's workstation

Enable IPv6

Add an option to enable IPv6:

initially these are required:

  • update the network_interfaces role to also configure IPv6
  • enable firewalls on compute nodes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.