Coder Social home page Coder Social logo

2i2c-org / infrastructure Goto Github PK

View Code? Open in Web Editor NEW
104.0 16.0 54.0 12.57 MB

Infrastructure for configuring and deploying our community JupyterHubs.

Home Page: https://infrastructure.2i2c.org

License: BSD 3-Clause "New" or "Revised" License

Python 52.73% HCL 28.63% Dockerfile 0.25% Jupyter Notebook 1.20% Smarty 0.06% Jsonnet 17.12%

infrastructure's Introduction

Infrastructure for deployments

This repository contains deployment infrastructure and documentation for a federation of JupyterHubs that 2i2c manages for various communities.

See the infrastructure documentation for more information.

Building the documentation

The documentation is built with the Sphinx documentation engine.

Automatically with nox

The easiest way to build the documentation in this repository is to use the nox automation tool, a tool for quickly building environments and running commands within them. This ensures that your environment has all the dependencies needed to build the documentation.

To do so, follow these steps:

  1. Install nox

    $ pip install nox
  2. Build the documentation:

    $ nox -s docs

This should create a local environment in a .nox folder, build the documentation (as specified in the noxfile.py configuration), and the output will be in docs/_build/dirhtml.

To build live documentation that updates when you update local files, run the following command:

$ nox -s docs -- live

Manually with conda

If you wish to manually build the documentation, you can use conda to do so.

  1. Create a conda environment to build the documentation.

    conda env create -f docs/environment.yml -n infrastructure-docs
  2. Activate the new environment:

    conda activate infrastructure-docs
  3. Build the documentation:

    make html

This will generate the HTML for the documentation in the docs/_build/dirhtml folder. You may preview the documentation by opening any of the .html files inside.

Build the documentation with a live server

You can optionally build the documentation with a live server to automatically preview the changes as you build the docs. To use this, run make live instead of make html.

Check for broken links

You can check for broken links in our documentation with the Sphinx linkcheck builder. This will build the documentation and test every link to make sure that it resolves properly. We use a GitHub Action to check this in our CI/CD, so this generally shouldn't be needed unless you want to manually test something. To check our documentation for broken links, run the following command from the docs/ folder:

make linkcheck

This will build the documentation, reporting broken links as it goes. It will output a summary of all links in a file at docs/_build/linkcheck/output.txt.

infrastructure's People

Contributors

2i2c-token-generator-bot[bot] avatar abkfenris avatar aidea775 avatar batpad avatar benlee0423 avatar betolink avatar choldgraf avatar colliand avatar consideratio avatar damianavila avatar dependabot[bot] avatar emiliom avatar ericvd-ucb avatar freitagb avatar georgianaelena avatar ianabc avatar j08lue avatar jbusecke avatar jmunroe avatar jnywong avatar maxrjones avatar pnasrat avatar pre-commit-ci[bot] avatar rabernat avatar ranchodeluxe avatar sean-morris avatar sgibson91 avatar slesaad avatar yarikoptic avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infrastructure's Issues

Document our PDF generation options

A lot of workflows require downloading notebooks as PDF. There are
quite a few ways to do this, and we should provide clear guidance
on what they are.

Berkely's DataHub Issue
has more details.

I want us to provide:

  1. Regular, LaTeX based nbconvert PDF generation. The default
  2. betatim's notebook-as-pdf,
    which uses Chrome for a more HTML native PDF generation
  3. Filtered PDF output, via something like otter-grader export.
    Very useful when you are grading only some of the parts of a notebook.
    Depends on (1)

Provide an easy way to switch between RStudio, Notebook & Lab

Description

Currently, if you wanna switch between JupyterLab / RStudio / Notebook,
you have to edit the URL. This is not very user friendly.

Instead, it should be easy for folks to switch between these interfaces via the GUI.

Benefit / value

This would be useful for anyone who doesn't already know the right url pattern to use, or who isn't comfortable hand-editing URLs in general. This is probably "almost everybody" because clicking buttons is much easier than remembering what URL to use.

This would also be a useful way for others to discover what other interfaces are available on a hub.

Implementation details

Here are a few places where this work could happen:

Classic Notebook

We need a notebook extension that does the following:

  1. Shows other UI options more visibly (currently they're under 'New ->').
    A bit more in-your-face in the tree view
  2. Adds an easy way to 'open current notebook in JupyterLab'

JupyterLab

The launcher already provides an easy way to get to RStudio, so we don't
actually need to do anything there for that. But...

  1. Add an entry for classic notebook in the launcher, so people can switch
    back if needed
  2. Add an easy way to 'open current notebook in classic notebook'

RStudio

We don't have a lot of expertise in modifying & extending RStudio. Similar
switching functionality would be great, and might be provided by writing
RStudio Extensions

Home page

We should allow users to choose the interface they wanna get into from the
home page - similar to what we do in the UToronto Hub

How are environments specified?

How shall people specify the environments for each hub? I guess we'll ask people to just tell us packages that they need installed and we just build a docker image for them using repo2docker (and manually specify the image in the hubs.yaml file?)

In addition, what happens when somebody installs libraries within their session? Does it persist over time?

Basically, I think we should add a section to the usage docs about how to customize the environment, and trying to figure out both the workflow of the user and the administrator.

Provide way to repeatably deploy new clusters

All clusters we support must have a few things in common:

  1. Kubernetes clusters with multiple nodepools, autoscaling, etc
  2. An NFS server (in-cluster, ideally) providing home directories
  3. Prometheus & Grafana
  4. An ingress controller + cert-manager, for HTTPS & wildcard DNS support

This should be as uniform as possible across the various places we deploy.

We should build terraform scripts and helm charts to automate this properly.

FS Mounting error when logging in on a new hub

I just created demo.cloudbank.2i2c.cloud and tried to log in, and ran into this error when starting up the server. It seems related to mounting the filesystem:

2020-10-22T18:49:55Z [Warning] MountVolume.SetUp failed for volume "demo-home-nfs" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/716b4d9d-dddd-4cf4-a791-ab83ca26a832/volumes/kubernetes.io~nfs/demo-home-nfs --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs -o noatime,soft,vers=4.2 nfs-server-01:/export/home-01/homes/demo /var/lib/kubelet/pods/716b4d9d-dddd-4cf4-a791-ab83ca26a832/volumes/kubernetes.io~nfs/demo-home-nfs Output: Running scope as unit: run-r064b93d0547b4c2284a02ac58f794e2f.scope Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft,vers=4.2 nfs-server-01:/export/home-01/homes/demo /var/lib/kubelet/pods/716b4d9d-dddd-4cf4-a791-ab83ca26a832/volumes/kubernetes.io~nfs/demo-home-nfs] Output: mount.nfs: mounting nfs-server-01:/export/home-01/homes/demo failed, reason given by server: No such file or directory

To fix this

Until we get an auto-fix, here are the steps to fix this for a new hub:

  1. Log in to the NFS Server (called nfs-server-01).

    If you have gcloud installed, you can do this with gcloud compute ssh nfs-server-01 after logging in to gcloud

  2. Create a directory named /export/home-01/homes/<hub-name>:

    sudo mkdir /export/home-01/homes/<hub-name>

  3. Make it be owned by the 'ubuntu' user

    sudo chown ubuntu:ubuntu /export/home-01/homes/<hub-name>

Allow per-hub user image customization

For the pilot hubs to be 'low touch', we need to not be responsible for specifying the user image. There should be a baseline image we maintain, with things we want to provide as 'features' to all the hubs - nbgitpuller, jupyter-tree-download, etc. We want users to be able to customize a few things on top:

  1. Install new apt packages (apt.txt)
  2. Install conda / pip packages (environment.yml or requirements.txt)
  3. Install R packages (install.R)
  4. Run arbitrary commands post installation (postBuild)

We could use repo2docker for this, but it has a few disadvantages:

  1. The images it builds are quite big
  2. We can't really use newer R versions yet
  3. We want to make sure that some packages (nbgitpuller, etc) are always installed.

The approach I prefer for this is instead what I helped do for the PANGEO images - you have a base that is maintained by us, and specific customizations via ONBUILD. See https://github.com/pangeo-data/pangeo-docker-images/blob/master/base-image/Dockerfile#L68 for how that would work.

To recap, the system we want would:

  1. Give us control of the base image
  2. Allow hub admins to do specific additional customizations on top

We might also just allow them to use repo2docker, but not to begin with.

Move CloudBank institutions to the new cloudbank cluster

Once #35 happens, we'll probably want to deploy hubs that are funded by CloudBank to <hubname>.cloudbank.2i2c.cloud (instead of .pilot.2i2c.cloud). This includes

  • spelman
  • ccsf
  • elcamino

I think right now none of these have actual users, so we can probably just delete them and move them over-straightaway. Or, we could use it as an opportunity to document how a hub can be moved between projects.

Describe our resource limits properly

Each hub restricts individual users to not use more than a specified amount of
compute resources. The primary resources we care about, in order of 'caring' are:

  1. Memory. This is inflexible - if you go over the amount of RAM available,
    your kernel dies. This is the most important resource to understand for our
    use cases - almost all educational hubs are memory bound, rather than CPU or
    storage bound.

  2. CPU. More flexible resource, since CPU availability is decided dynamically
    by the linux kernel. It can give a user 1full CPU for a minute, but only 0.01
    for the next 5 minutes, without any issues (other than a slowdown). This isn't
    possible with memory. We wanna make sure that users have as much CPU as they
    need, but it's not usually an issue - especially because cloud providers won't
    let you get a lot of memory without enough CPU.

  3. Storage. Home directory storage is persistent between users. Only code
    should be kept in home directories, ideally from git repositories. Many
    repos can be multiple hundreds of megs, and users can also accidentally write
    code that fills up the entire storage. We should ideally restrict users to
    something like a max of 10G. This doesn't require us to provision 10G for
    each user - we can easily overprovision this, since we would need a large
    number of users to exceed 10G at the same time to cause issues.

It's important for instructors to know these limits. Memory limit is a prime
driver in designing courses. These limits also define many other hosted
notebook providers - Colab's claim to fame is a free GPU, for example.

Set limits & resource requests on all core pods

Description

All 'core' pods should have CPU & memory requests and limits set.

Benefit

Setting explicit resources makes sure that critical processes / pods don't run out of resources and die, as has happened in cases like #526. This helps prevent major disruptions to the service.

Tasks to complete

  • Make sure all components from support/ chart have requests and limits set
  • Make sure all hub components have requests and limits set
  • Make sure all dask-gateway components have requests and limits set

Rename this repository?

I feel like low-touch-hubs feels a bit underwhelming as a name. It might come across as "we don't want to put effort into these hubs", which isn't quite true, what we want is more like "we want to put a lot of careful effort into it once, now, so we don't have to put effort into it over time later".

Since we will likely direct people to this repository for the sake of "transparent infrastructure", I wonder if we should rename this to something more like "auto-hubs" or "pilot-hubs"?

or @yuvipanda feel free to tell me I am over-thinking this haha

Add grafana dashboards for resource usage

In a recent incident, one of the nodes ran out of memory and CPU. This wasn't immediately obvious from the grafana dashboards. We should think about the resources we need to track across the nodes, and have a dashboard for particularly important ones.

Support deploying to multiple clusters

Pilot hubs might be paid for in different ways:

  1. 2i2c puts up the money
  2. CloudBank gives a grant for that institution
  3. ???

This repo should be able to run across cloud providers & projects. Currently,
it only works for one project on one cloud provider.

To begin with, we should add a clusters directory, inside which we can have
a YAML file for each cluster. This would specify:

  1. Name of cluster
  2. Cloud provider & how to authenticate to it
  3. Overrides we might have for helm on that cluster (hostnames, etc)

This should let CI (and eventually, the web application) deploy to multiple
clusters as the need arises

Add 2i2c staff as admins to all hubs

Summary

Currently, we must manually add 2i2c engineers as "admin users" for all of the hubs that we deploy. For example, here we manually list these users:

https://github.com/2i2c-org/pilot-hubs/blob/master/config/hubs/2i2c.cluster.yaml#L93

Instead of this, we should have a single list of 2i2c Engineers and automatically add this list as admins on all of our hubs.

Value

This would de-duplicate information across our hub configuration, and reduce the amount of toil that is needed to add new admins to hubs.

Tasks to complete

  • Patch the deploy script the "hacky way" to do this, even though we know this would be best done in upstream improvements
  • Open an issue in JupyterHub to ask about best practices around this topic (potentially just pointing to @GeorgianaElena's issue here if that covers all we need to do: jupyterhub/jupyterhub#3525)

Discuss how to handle overcommit ratios for Memory

Our default overcommit ratio for RAM is:

  • Guaranteed: 512MB
  • Limit: 2G

This has caused some issues on hubs where a lot of users were using a lot of memory all at once. We should discuss our rationale behind the current ratios, as well as a decision-tree for when and why to change them.

How to delete hubs?

I noticed that when I deleted my test hub from the hubs.yml it didn't show up in the GHA logs at all, which I guess makes sense, but it makes me wonder if/how we handle deleting hubs so that the helm namespace doesn't become cluttered with a buncha cruft over time.

Provide better error on failed login

We require admins to add each user they wanna permit into the admin
panel. When users who haven't been expressly added, they get a standard
403 Forbidden error.

This page should be a little more informational, and contain information
on why they couldn't log in, and asking them to contact the admins
to get them added. To begin with, I don't think we need to expose who
the admins are, but we can add that optionally later if needed.

Document whether / how user data is backed up

Some folks have asked questions like "should I tell my students to create backups of their work in case their data gets lost?" What should I respond in those cases? Does our NFS server have backups set up? If not, should we recommend people download .zips of their files occasionally? Have we ever had an issue where we lost user storage data?

Host folder of HTML files JupyterHubs simple via HTTP proxy

I have been thinking more about our conversations around creating custom pages for things like generating nbgitpuller links / environments / etc. It gave me an idea and I'm curious what @yuvipanda thinks about it:

What if we made it really easy for hub administrators to host a static website at some path on their hub (maybe default is myhub.org/docs. Then, that can be a single extension point that hub administrators could use for building out their hub-specific documentation. Rather than us finding ways to keep extending the JupyterHub-specific interfaces, we could just add a docs option to each JupyterHub toolbar.

It would also let us do things like turn the nbgitpuller link generator into a Sphinx extension (if the docs were hosted with Sphinx) so that administrators could include a link generator on their hub simply by putting a page like this in their docs:

at myhub.pilot.2i2c.cloud/docs/link-generator:

# My page title
Link generator will be embedded below.

```{nbgitpuller-generator}
:hub: myhub.org
```

For current plan, see #58 (comment)

Create a shared folder for each hub

In Slack @yuvipanda mentioned wanting to create a shared folder each hub. This would be read/write for hub administrators, and read-only for hub non-admins. It would be used as a kind-of "distribution folder" so admins can put stuff there that users all have access to.

We may also want to explore ways to let non-admins deposit things in a shared folder (e.g. for collaboration or sharing work). @yuvipanda mentioned that we might be able to use Jitsi for this.

Create logs of user activity?

I'm interested in cross-referencing our costs with the usage patterns over time for our hubs, so we can get an idea of what this is costing us per hub / per user. @yuvipanda do we have a way of dumping that information in a reliable way so that we can run some simple analytics on it, assuming we can also get the google cloud payment info?

Deploy grafana for the pilot hubs

Grafana is useful in getting a quick view of things that are happening on the cluster, and keeping track of its usage. We should deploy one for these hubs!

Establish a home directory retention policy

We don't wanna keep users' data around forever. We should establish a policy for
when we'll clear out users' home directories, and how to do so.

This is Berkeley's.
We probably wanna adopt something less expansive, and more reliant on users
downloading their own data.

Run NFS servers in-cluster

Description

We currently run a separate, hand-rolled VM for NFS.
Instead we should run an in-cluster NFS server - one per cluster
most likely (for overprovisioning reasons).

I'm slightly concerned here, since the NFS server node going down
means all the hubs are out. But that's also true for the proxy,
nginx-ingress & other pods, so probably something we should be ok
with.

Benefit

Our current setup (separate VMs for NFS) is a single point of failure, not repeatably built, and a bit icky.
It also runs a VM fulltime, without a lot of resource utilization.

This change would make it easier to set up a cluster and go, and makes our whole set up a lot more
repeatable.

This will also let us add features we wanted for a while:

  1. Per-user storage quotas, probably with XFS quotas
  2. Automated snapshots, with VolumeSnapshots

Implementation details

We should watch out for accidental deletion - maybe make sure
the PV isn't deleted when the PVC is?

I'd like to use nfs-ganesha
for this, so I don't have to run a privileged container for nfs-kernel-server.
Seems to get wide enough use.

Tasks to complete

Add some way for people to securely write to GitHub

Many users may wish to push back up to GitHub. We should figure out a way to do this securely and safely on the hubs. Some things to consider:

  • User identity. Do we want users to have identity when they do this, or use generic credentials that work for anyone
  • Permissions-per-hub. How do we split up the permissions for each hub?
  • UI/UX - do users just use the command line as they normally would, or do we add some kind of extension/UI that allows them to push back up to GitHub?

How to integrate our documentation into the hubs home page template?

The U Toronto home page has a lot of useful FAQ-like information that @yuvipanda put together for that hub specifically. I feel like it'll be easier to update and scale that information if we had a centralized source. Our pilot hubs docs have been serving as the documentation for all hubs. I wonder how we can incorporate that information into the home page so that we don't have to keep that info hard-coded as HTML inside of the hub template file.

One thought is we could just iframe a new page like 2i2c.org/pilot/get_started that has all of that information on it. Curious what @yuvipanda thinks

Create separate core and user pod pools

In a recent incident user pods required enough resources that it prevented core pods from doing their job. We should separate these pods out into two separate node pools.

Add a template for the login page

I think that we could use a slightly customized login page to highlight that 2i2c is the organization providing the hub. Nothing as fancy as the U. Toronto Hub, but I wonder if we could use two fields in the hubs.yaml file that would interpolate into a login template we use to build hubs:

- long_name: A longer name used for displaying in other pages
- logo: A path/URL to a logo

Then we could have a login page template that included bits like:

                    {logo}
This is a 2i2c Pilot Hub provided for {long_name}.
        Click the button below to log-in.
    For more information, see 2i2c.org/pilot

Create documentation for how to gain kubectl access to the hubs

A user had questions about their hub, and I wanted to try and investigate things but realized I have no idea how to access the kubernetes deployments etc. So this means that right now @yuvipanda is the only one who can do any introspection of the hubs, which seems sub-optimal :-)

Perhaps @yuvipanda can document the steps needed for someone new to gain access to the underlying hub infrastructure? This will also be needed so that @jamesgspercy can start looping in on support and operation.

Define a private channel for support

Now that a few folks have started using these hubs, we've had support requests come in that suggest we need at least two different communications channels:

  1. Support requests that are generic and can be public
  2. Requests that identify individuals or accounts, or cannot be public for other reasons

For 1, we can suggest github issues, but for 2 we should have an option for people to discuss more privately. What should we do for this?

My first thought was to create a [email protected] that me, @yuvipanda , @jamesgspercy have access to that could be a single account for private support questions. If a question could be asked in public then we ask people to post there.

Longer term I am not sure of the best approach.

What do folks think?

Define the pattern of user interaction

This looks great - I like the pattern of a single YAML file that configures / deploys hubs.

I am curious about the anticipated user trajectory here. Will we expect them to make PRs to the YAML file? Or will we expect to do this automatically via a little GUI or something like this?

Document how to get helm access to the 2i2c Pilot Hubs

There are a few spots in the docs that say "Use Helm to do XXX" (I assume kubectl applies as well). However we haven't documented how 2i2c folks should connect their credentials etc to a 2i2c Hub. We should document this process.

Setup authentication for Grafana

We've two grafanas - grafana.cloudbank.2i2c.cloud, and grafana.pilot.2i2c.cloud.

Ideally, we'd only have one grafana - but then we'd have to do some network-foo to expose the other prometheus. We don't actually want do that.

Instead, we should just setup GitHub authentication for both the grafanas.

Prepare Pilot Hubs for Spring 2021

We expect a lot more usage of our hubs - both cloudbank backed & pilot - over Spring 2021. This issue lists various things we need to do to prepare for that.

Things to accomplish

  • Support pathway for reporting & fixing technical issues quickly
  • Monitoring & alerting pathway for us to know when things are bad without users needing to report them3
  • Reliable way to add more hubs quickly, in case more hubs are needed
  • Proper backup for user home directories, we don't wanna lose them!
  • Smoothness & UX touchups, so we don't just have an 'acceptable' experience but a good one
  • Define processes & contact points with DSEP

This is a tracking issue.

Demo Hub

I have set up a hub on CloudBank that has really only been used to demonstrate the technology stack.

Could this hub be useful as a demo (showroom) hub that potential clients could be directed to to test out a live hub environment?

Does such a service already exist?

Host the documentation

it took me a while to realize that this repo had docs because I couldn't see a link for it anywhere. Where should we put documentation like this? A few ideas:

  • docs.2i2c.cloud/low-touch-hubs
  • readthedocs.io
  • docs/2i2c.org/low-touch-hubs

Any thoughts?

Remove NoVNC from the launchers

I noticed that there's still a "launch desktop" launcher:

image

but it doesn't seem to work (and I believe @yuvipanda said this is expected not to work). I looked in the environment files and couldn't figure out where it was being installed, so opening an issue that we should probably remove it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.