Coder Social home page Coder Social logo

clusterinthecloud / docs Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 8.0 171 KB

A tutorial to set up a running compute cluster on cloud resources

Home Page: https://cluster-in-the-cloud.readthedocs.io/en/latest/

License: Other

Makefile 11.75% Batchfile 12.85% Python 75.40%
cluster-in-the-cloud

docs's People

Contributors

christopheredsall avatar milliams avatar willfurnass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

docs's Issues

Ansible unreachable error in slurm playbook.

I'm attempting to set up a small cluster-in-the-cloud on my OCI tenancy. The only changes I've made to the original are to use a couple of VM.GPU2.1 nodes as the compute nodes. It all appeared to be going well, until the finish script failed at:

rc = subprocess.call(['ansible-playbook', '--inventory=/home/opc/hosts', '--extra-vars=@/home/opc/users.yml', 'finalise.yml'], cwd='/home/opc/slurm-ansible-playbook')

with the following output:

[opc@mgmt ~]$ ./finish

PLAY [finalise] *********************************************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] **************************************************************************************************************************************************************************************************************************************************************************************************
fatal: [mgmt]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"mgmt\". Make sure this host can be reached over ssh", "unreachable": true}
fatal: [compute002]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"compute002\". Make sure this host can be reached over ssh", "unreachable": true}
fatal: [compute001]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"compute001\". Make sure this host can be reached over ssh", "unreachable": true}
        to retry, use: --limit @/home/opc/slurm-ansible-playbook/finalise.retry

PLAY RECAP **************************************************************************************************************************************************************************************************************************************************************************************************************
compute001                 : ok=0    changed=0    unreachable=1    failed=0   
compute002                 : ok=0    changed=0    unreachable=1    failed=0   
mgmt                       : ok=0    changed=0    unreachable=1    failed=0   


#############################################
Error: Ansible run did not complete correctly
#############################################

I'm able to ssh into the instances, so I'm not sure what's going wrong here? Any help would be much appreciated!

Ansible play fails after new version of CitC

Hi,

I just updated to the newest version of CitC and tried to start a cluster. When the ansible notebook is played, I get the following error:

Starting Ansible Pull at 2019-04-15 06:12:47
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=master --inventory=/root/hosts management.yml
[WARNING]: Could not match supplied host pattern, ignoring: mgmt
[WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
mgmt.subnetad1.clustervcn.oraclevcn.com | CHANGED => {
"after": "ece81babcd538714aa1a61a57c7eee75a51a79c6",
"before": null,
"changed": true
}
[WARNING]: Could not match supplied host pattern, ignoring: mgmt

PLAY [finisher script] *********************************************************
...
TASK [slurm : install Slurm] ***************************************************
Monday 15 April 2019 06:14:55 +0000 (0:00:00.221) 0:02:04.207 **********
failed: [mgmt.subnetad1.clustervcn.oraclevcn.com] (item=slurmctld) => changed=false
item: slurmctld
msg: No package matching 'slurm-slurmctld-17.11*' found available, installed or updated
rc: 126
results:

  • No package matching 'slurm-slurmctld-17.11*' found available, installed or updated
    failed: [mgmt.subnetad1.clustervcn.oraclevcn.com] (item=slurmdbd) => changed=false
    item: slurmdbd
    msg: No package matching 'slurm-slurmdbd-17.11*' found available, installed or updated
    rc: 126
    results:
  • No package matching 'slurm-slurmdbd-17.11*' found available, installed or updated
    ....
    PLAY RECAP *********************************************************************
    mgmt.subnetad1.clustervcn.oraclevcn.com : ok=37 changed=35 unreachable=0 failed=1

Sincerely,
Joe

Expand "Configuring node images" to outline how to set up compute nodes for compilation

Since it is recommended to compile applicaitons on a compute node (rather than management node) I think it might be helpful to give a couple of general pointers to how to edit /home/citc/compute_image_extra.sh to get the node images ready for general compilation, e.g. uncomment

sudo yum -y groupinstall "Development Tools"

and install Python with unversioned Python

sudo yum -y install python3
sudo alternatives --set python /usr/bin/python3

Public keys need adding in the variables file

$ vim terraform.tfvars
There's a few variables which can be changed in here, but by default you should not have to change anything.

I found the above confusing, as I did have to change that file to include the public keys I want to use to log into the management machine.

Without adding them in, I couldn't log into the mgmt node.

Move docs to follow Diátaxis

Following https://diataxis.fr we could reorganise the docs to split them into tutorials, how-tos and explanation. There's not much of a place for "reference" in CitC I think though.

The reorganisation could look like:

  • Tutorials
    • Installing a cluster
    • Running your first job
    • Deleting a cluster
  • How-tos
    • Setting up ARM nodes on AWS
    • Setting up GPU nodes
    • Installing and managing software
    • Updating CitC
  • Explanation
    • How the elastic scaling works
    • Why are nodes managed by name in the way they are

etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.