Coder Social home page Coder Social logo

acto-cloudlab's People

Contributors

tylergu avatar whentojump avatar

Watchers

 avatar  avatar

acto-cloudlab's Issues

Startup would sometimes exit with error code 2

TL;DR

We suspect it's an issue with CloudLab.

An easy workaround for the time being is to rerun the startup manually:

sudo su - geniuser
bash /local/repository/scripts/cloudlab_startup_run_by_geniuser.sh
exit

Details

Behavior

The startup occasionally fails. The "Startup" column will finally become Exited (2) instead of Finished.

Possible causes

One of our captured logs says:

TASK [Install python packages using pip] ***************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "cmd": ["/usr/bin/python3", "-m", "pip.__main__", "install", "-r", "/users/alice/workdir/acto/requirements.txt"], "msg": "stdout: Collecting deepdiff~=6.3.0\n\n:stderr:  WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d130>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl\n WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d0d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl\n WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d1f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl\n WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d3a0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl\n WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d5e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl\nERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='[files.pythonhosted.org](http://files.pythonhosted.org/)', port=443): Max retries exceeded with url: /packages/fe/b3/81bb598d24f1a48eaceb32243a91016385c0599196a59eaff6cd29299334/deepdiff-6.3.1-py3-none-any.whl (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0ca2a1d7c0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))\n\n"}

which suggests it's possibly a DNS problem.

The default resolv.conf is

nameserver 130.127.132.51
search clemson.cloudlab.us
nameserver 155.98.60.2

of which the entries are within the clemson cluster.

...

To-do's

Things yet to understand:

  • The exact reason. Is it truly DNS?
  • Does it have anything to do with machine type, say c8220?
  • Does it always occur during startup? What about in the middle of experiment?
  • ...

What makes things hard is the problem is occasional and unpredictable, thus hard to reproduce.

Long-term solutions:

  • Retry. Implemented in #3
  • If we can confirm DNS is the very problem, we may also add some public name servers to resolv.conf.

Ansible is always checking out the `sosp-ae` branch of Acto

(Continued from here: xlab-uiuc/acto#247)

Currently we always checkout the sosp-ae branch in the Ansible script:

version: sosp-ae

But this repo (acto-cloudlab) should better be used not only for AE. And this would become a tiny problem when we indeed want to run code on main and others. Possible solutions:

  1. Since we will have to log in to CloudLab machines after all and now that the Acto repo is actually "cloned", we can manually checkout.
  2. On the other hand, changing this value on a per-branch basis looks... awkward. I don't know if there're cleverer ways.

Hardwired `$PATH` can be problematic in corner cases

line: export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

E.g. certain software (other than Acto and its dependencies) will modify $PATH before this line or even before .bashrc and this line would invalidate all their changes.

Not sure what was the intention here so let me open an issue first instead of directly changing it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.