Coder Social home page Coder Social logo

governance's People

Contributors

nebari-sensei avatar pavithraes avatar trallard avatar

Watchers

 avatar  avatar

governance's Issues

RFD - Managing Nebari dependencies

Status Open for comments 💬
Author(s) @iameskild
Date Created 2022-11-28
Date Last updated 2023-03-15
Decision deadline ---

Managing Nebari dependencies

Summary

Pin all the things

Let me start by stating that Nebari is not your typical Python package. For Python packages that are intended to be installed alongside other packages, pinning all of your dependencies will likely cause version conflicts and result in failed environment builds.

Nebari on the other hand is a package that is responsible for managing your infrastructure and the last thing you want is for the packages that Nebari relies on to introduce breaking changes. This has happened now twice this week alone (the week of 2023-01-16, issue 1622 and issue 1623).

As part of this RFD, I propose pinning all packages that Nebari requires or uses. This includes the following:

  • Python package dependencies set in pyproject.toml and making sure the package can be built on conda-forge
  • Maximum acceptable Kubernetes version
  • Terraform provider versions
    • Already pinned (see next header for proposal to central these as much as possible.)
  • Docker image tags (used by Nebari services not the images in nebari-docker-images repo)
  • Helm chart release versions

Set pinned dependencies used by Terraform in constants.py

In Nebari, the Python code is used primarily to pass the input variables to the Terraform scripts. As such, I propose that any of the pinned versions - be they pinned Terraform providers, image/tags combinations, etc. - used by Terraform be set somewhere in the Python code and then passed to Terraform.

As an example, I recently did this with the version of Traefik we use:

https://github.com/nebari-dev/nebari/blob/bd777e6448b5e2d6339bc3d9ef35672163ae1945/nebari/constants.py#L4

Which is then used as input for this Terraform variable:

https://github.com/nebari-dev/nebari/blob/bd777e6448b5e2d6339bc3d9ef35672163ae1945/nebari/template/stages/04-kubernetes-ingress/variables.tf#L19-L25

https://github.com/nebari-dev/nebari/blob/bd777e6448b5e2d6339bc3d9ef35672163ae1945/nebari/template/stages/04-kubernetes-ingress/modules/kubernetes/ingress/main.tf#L215

Regularly review and upgrade dependencies

Once packages start getting pinned, it's important to regularly review and upgrade these dependencies in order to keep up-to-date with upstream changes. We have already discussed the important of testing these dependencies and I believe we should continue with that work (See issue 1339).

As part of this RFD, I propose we review, test and upgrade our dependencies once per quarter as part of our release process.

Although we may not need to update each dependency for every release, we might want to consider updating dependencies in a staggered fashion.

  • For release X: update all Python dependencies in the pyproject.toml and ensure that the package is buildable on conda-forge.
  • For release X+1: update the maximum Kubernetes version and any Helm charts
  • For release X+2: update Terraform provider versions
  • ... and repeat

We don't necessarily need to make the update process this rigid but the idea is to update a few things at a time and ensure that nothing breaks. And if things do break, fix them promptly to avoid running into situations where we are forced to make last-minute fixes.

User benefit

In my opinion, there are a few benefits to this approach:

  • Increased platform stability; running Nebari version X will work the day it was released and in two years from now.
  • Instead of having pinned versions scattered through the Terraform scripts, we can centralize their location. This makes it easier to quickly check what version of what is being used.
  • This can be the start of dependency tracking. With a centralized location for all pinned dependencies, we can more easily write a script that updates and tests these dependencies.

Design Proposal

The design proposal is fairly straightforward, simply move the pinned version of the Terraform provider or image-tag used to the constants.py. This would likely require an additional input variable (as demonstrated by the Traefik example above).

User impact

We can be sure that when we perform our release testing and cut a release, that version will be stable from then on out. This is currently NOT the case.

What to do with the other Nebari repos?

This RFD is mostly concerned with the main nebari package and doesn't really cover how we should handle:

  • nebari-docker-images
  • nebari-jupyterhub-theme

I think these are less of a concern for us since once the nebari-jupyterhub-theme is included in the Nebari Jupyterhub Docker image, and once the images are built they don't change, there is little chance that users will be negatively affected by dependency updates. The only exception would be if users pull the image tag main which is updated with every new merge into nebari-docker-images - this does not follow best-practices and we will continue to advise against it.

Unresolved questions

I still need to test if this is possible for the pinned version of a particular Terraform provider used, such as:
https://github.com/nebari-dev/nebari/blob/bd777e6448b5e2d6339bc3d9ef35672163ae1945/nebari/template/stages/04-kubernetes-ingress/versions.tf#L1-L13

  • Tried this recently and from what I can tell this is not possible (at least not without relying on some kind of templating). Therefore, Terraform provider versions will need to be set directly in their respective required_provider block (usually in the version.tf file).
    • This might be possible with a tool like tfupdate.

RFD - Upgrade Integration tests - WIP

Status Draft 🚧 / Open for comments 💬/ Accepted ✅ /Implemented 🚀/ Obsolete 🗃 / Rejected ⛔️
Author(s) @viniciusdc
Date Created 02-02-2023
Date Last updated --
Decision deadline --

Summary

Currently, our integration tests are responsible for deploying a target version of Nebari (generally based on main/develop) to test stability and confirm that the code is deployable in all cloud providers. These tests can be divided into three categories: "Deploy", "User-Interaction," and "Teardown".

The user interaction is executed by using Cypress to mimic the steps a user would take to use the basic functionalities of Nebari.

Blank diagram (7)

The general gist of the workflow can be seen in the diagram above. Some providers like GCP have yet another intermediate job right after the deployment, where a slightly small change is made in the nebari-config.yaml to assert that the inner actions (those that come with Nebari) are working as expected.

While the above does help when testing and asserting everything "looks" OK, we still need to double-check in every release doing yet another independent deployment to carefully test all features/services and ensure everything is working as expected. This seems like extra work that takes some time to complete (remember that a new deployment on each cloud provider takes around 15~20 min, + any additional checks).

That said, there are still a lot of functionalities that we might need to remember to test that are part of the daily use of Nebari, and making sure all of that works in all providers would become impractical.

Design proposal

what we could do to enhance our current testing suit. These are divided into three major updates:

Stabilizing/backend test

Refactor the "deploy" phase of the workflow so instead of executing the full deployment in serial (aka. just run nebari deploy), we could instead deploy each stage of nebari in parts, and this would give us the freedom to do more testing around each new artifact/resource added in each stage. This can now be easily done due to the recent additions of a Nebari dev command in the CLI. A way to achieve this would be adding an extra dev flag to the neabari deploy command to stop at certain checkpoints (which in this case, are the beginning of a new stage)

  • CI runs nebari deploy -c .... --stop-at 1. This would be responsible for deploying nebari until the first stage (generating the corresponding terraform state files for state tracking). The CI would then execute a specialized test suit (could be pytest, python scripts...) to assert that:
    • The cloud resources created are indeed present in the cloud infrastructure (can be done using the cloud provider CLI tools)
    • Check that kubernets-related resources exist as expected (kubectl extra commands checks)
    • Atest that all available endpoints exist, and run appropriate functions to each API (in case of extensions/services like Argo etc..)
  • After the above tests are complete, execute nebari deploy -c .... --stop-at 2, which would refresh the previous resources and create the new ones. Then stop and run tests accordingly....
    • ...

End-to-End testing (User experience)

Now that the infrastructure exists and is working as planned, we can mimic the user interaction by running a bigger testing suit for cypress (we could also migrate to another tool for easier maintenance). Those tests would then be responsible for checking that Jupyter-related services works, Dask, any extra services like Argo, kbatch, VScode, Dashboards, conda-store...

Teardown

Once all of this completes, we can then move to destroy all the components, right now there is no extra changes to this step, but something we could add it would be beneficial are this:

  • Develop cloud specific scripts for removing lingering resources in case of failing nebari destroy
  • Save information around the error (why it failed) as artifacts like status about the cluster, roles, etc. that could help us identify why some resources keep staying after destruction and how we could try to reduce it (or at least catalog those in the docs)

User benefit

The user, in this case, would be the maintainers and developers of Nebari who would be able to trust more in the integration tests and retrieve more information on each runs, reducing a lot of the time used by testing all features as well as the confidence that all services and resources were tested and validated before release.

Alternatives or approaches considered (if any)

Best practices

User impact

Unresolved questions

RFD - Allow users to customize and use their own images

Status Open for comments 💬
Author(s) @iameskild
Date Created 2022-11-22
Date Last updated --
Decision deadline --

Allow users to customize and use their own images

Summary

At present, this repo builds and pushes standard images for JupyterHub, JupyterLab and Dask-Workers. These images are the default used by all Nebari deployments.

However, many users have expressed an interest in adding customize packages (conda, apt or otherwise) to their default JupyterLab image and doing so at the moment is not really feasible (at least not without a decent amount of extra leg work). To accommodate users, we have often simply resorted to adding their preferred package to these default images. This solution is not scalable.

User benefit

By giving Nebari users the ability to customize these images, we greatly open up what is possible for them. This will give users further control over what packages get installed and how they use and interact with their Nebari cluster.

I have already heard from a decent number users that this would be a much-appreciated feature.

Design Proposal

Ultimately, we want to allow users to add whatever packages (and possibly other configuration changes) they want to their JupyterHub, JupyterLab, and Dask-Worker images. We also want to make this process as simple and straightforward as possible.

Users should NOT need to know:

  • how to write a Dockerfile
  • how to use docker or build images
  • how to push or pull from a registry

In the nebari code base we already have a way of generating gitops and nebari-linter workflows for GitHub-Actions and GitLab-CI (for clusters that leverage ci_cd redeployments). We currently do this by building up these workflows from basic pydantic classes that were modeled off of the JSON schema for GitHub-Actions workflows and GitLab-CI pipelines respectively.

Why not do the same thing for building and pushing docker images?

With some additional work, we can render a build-push workflow (or pipeline) that can do just that. This proposed build-push workflow would look something like:

  1. Using the existing default Nebari docker image as a base image, add user specific packages.
    • users might add/remove packages to an environment.yaml, apt.txt, etc. that resides in an images folder in their repo.*
  2. Use the docker/build-push-action (or similar for GitLab-CI) to build and push images to GHCR (or similar for GitLab-CI)
  • This new workflow would live in the same repo that the deployment resides in so there is no need for managing multiple repos.

As I currently see it, this would require:

  • an added section to the nebari-config.yaml (perhaps under the ci_cd section) that can be used as a trigger to render this new workflow file
  • a way to render this new build-push workflow file
    • as mentioned above, this can be completed in a similar manner to how we render gitops or nebari-linter
  • a Dockerfile template for each image (JHub, JLab, Dask)
    • that pulls a base image from quay.io/nebari
  • a template folder (images) that contains an environment.yaml, apt.txt, etc.

Alternatives or approaches considered (if any)

Best practices

User impact

No user impact unless they decide to use this feature.

Unresolved questions / other considerations

There are a few other enhancements that we could make to make:

  • some users may want their images pushed to private registries
  • allow users to add additional Dockerfile stanzas for even more customization

[DOC] - Decision-making process

Preliminary Checks

Summary

Create guidelines on how we make decisions as a team, including:

  • How/when to open RFD issues, and what should the deadline be?
  • What are the expectations for discussions?
  • How to build consensus?
  • When is an RFD accepted?

Steps to Resolve this Issue

TBD

RFD - Bitnami retention policy considerations

Status Draft 🚧
Author(s) @viniciusdc
Date Created 08-12-2022
Date Last updated 08-12-2022
Decision deadline NA

Title

Considerations around Bitnami retention policies

Summary

Just a note regarding using Bitnami as the repo source for helm charts, as happened in the past with Minio, they have a 6m retention policy for their repo index, which means that old versions will be dropped from the main index after that period. This is, in the future, our deployments are bound to break if the version is not found by Helm.

User benefit

  • Right now, we are bound to have broken deployments of the old Nebari versions in the future; as an example, v0.4.0 and v0.4.1 are still (at this date) broken due to the fact these versions have in their source code a pointer to a Minio version that does not exist anymore in the main index.yaml (fixed on v0.4.2).

Design Proposal

Alternatives or approaches considered (if any)

There are some ways to suppress this problem, each one with their pros x cons:

  • Every six months, we update our chart versions or validate these somehow, originally proposed here
  • We pin the repository source on each service with the last available hash for the index, the same as we did for Minio
  • Increase the release schedule to have monthly releases...

Best practices

User impact

Unresolved questions

[WIP] Update "Team" on GitHub

Update the team structure and verify permissions for each team:

Note: The following plan is tentative to get us started, and will be updated after further discussion.

  • Rename "Maintainers" to "Core"
  • Rename "Emeritus maintainers" to "Emeritus core"
  • Create a new "Owners" team (name TBD) -- folks who have owner role and can add/remove team members. Alternatively, we can give "Core" team these rights?
  • Verify permissions for each team:
    • "Triage" team should be able to add/remove issue and pr labels across all repos in the org, ensure they can transfer issues across repos
    • "Design" team should have merge rights only on nebari-dev/design
    • "Documentation" team should have merge rights only on nebari-dev/nebari-doc
    • "Contributors" team should have merge rights across all repos in the org
    • "Core" team can update things at the org level in addition to "Contributor" privileges (?) Alternatively, this team can be for recognizing decision makers.
    • "Emeritus core" members should have no special access, this team is for recognizing past contributions.

References:

cc @trallard

Bi-weekly community meeting for Nebari

Context

As we adopt a community-first approach to Nebari development, it will be nice to open our team syncs to everyone (which are currently internal to Quansight).

Proposal:

  • Timing: Every other Tuesday, 3:30-4pm GMT
  • Notes, options:
    • Open an issue (with a dedicated label) against this repo?
    • Open google/notion/hackmd doc?

Value and/or benefit

...

Anything else?

No response

[ENH] - Adopt Code of conduct and enforcement procedures

We want to foster an inclusive, supportive and safe environment for all our community members. Need to adopt a CoC with the following:

  • 1. Explicit: acceptable and unacceptable behaviour
  • 2. Scope: where is this applicable
  • 3. Enforcement
  • 4. Reporting
  • 5. Social rules
  • 6. Other items that might not fit or are borderline CoC
  • 7. CoC response protocol

[DOC] Revisit CoC

Since we are moving to a more community-driven project, we should revisit the CoC

[DOC] - Write team compass

Items to add:

  • Reviewer guidelines
  • Onboarding/offboarding new team members @costrouc
  • Release guidelines #2
  • Development workflow - git
  • Contribution guidelines
  • Code of conduct #5
  • Triaging flow
  • Inclusivity statement #6
  • GH guidelines and conventions
  • Roadmap
  • Styleguide
  • Making changes to live infra

RFD - Make `nebari` internals aggressively private

Status Open for comments 💬
Author(s) @pmeier
Date Created 07-04-2023
Date Last updated 07-04-2023
Decision deadline xx

Make nebari internals aggressively private

Summary

Currently, all internals of nebari are public and with that there comes a set of expectations from the users with the main one being backwards compatibility. Although there is no such thing as true private functionality in Python, it is the canonical understanding of the community that a leading underscore in a module / function / class name implies privacy and thus no BC guarantees.

AFAIK, nebari does not have an API. Thus, I propose to "prefix" everything with a leading underscore to avoid needing to keep BC for that.

User benefit

This proposal brings no benefit for the user, but rather for the developers. As explained above, having a fully public API brings BC guarantees with it. At least that is what users expect. With them in place it can be really hard to refactor / change internals later on although we never intended that to happen.

Design Proposal

The canonical understanding for privacy in Python is that it is implied by a leading underscore somewhere in the "path". For example

  • _foo
  • foo._bar,
  • _foo.bar,
  • foo._Bar.baz
  • foo.Bar._baz
  • _foo.Bar.baz
  • _foo.Bar.baz

are all considered private. This gives us multiple options to approach this:

  1. Make all endpoints private: prefix every function / method / class with an underscore. This is fairly tedious also somewhat impacts the readability.
  2. Make all namespaces under the main nebari package private, e.g. nebari._schema rather than nebari.schema. Since we aren't exposing anything from the main namespace this would effectively make everything private.
  3. Inject an intermediate private namespace into the package, i.e. create nebari._internal and move everything under that. This is what pip does.
  4. Rename the main package to _nebari, but still provide the script under the nebari name. This makes it a little awkward for invoking it through Python, i.e. python -m _nebari. If this is something we want to support, we can also create a coexisting nebari package that does nothing else but importing the CLI functionality from _nebari. This is what pytest does.

These are ordered in increasing order of my preference.

Alternatives or approaches considered (if any)

Instead of fixing our code to be private, we could also put a disclaimer in the documentation that we consider all internals private and thus there are no BC guarantees. However, we need to be honest with ourselves here. Although this would suffice from a governance stand point, we are making it easier for users to shoot themselves in the foot. And that is rarely a good thing.

User impact

If we want to adapt this proposal or something similar, we need to do it sooner than later. Since this change is BC breaking for everyone who is already importing our internals, we should do it as long as the user base is fairly small and thus even fewer people (hopefully none) are doing something we don't want for the the future.

Depending on how much disruption we anticipate, we could also go through a deprecation cycle with the prompt to get in touch with us in case a user depends on our internals. Maybe there is actually a use case for a public API?

RFD - Ways to Limit Argo Workflows Permissions - Mounting Volumes [Draft]

Status Draft 🚧
Author(s) Adam-D-Lewis
Date Created 03-31-2023
Date Last updated 03-31-2023
Decision deadline ?

In Argo Workflows users with permissions to use Argo Workflows can mount any other users home directory. This is not acceptable. I discuss some options to limit this behavior. Some options include:

  1. Use a Kubernetes operator to limit what subpaths can be mounted by particular pods (or put users in their own namespaces then limit which subPaths can be mounted in that namespace with a CRD and an Operator)
    • The Problem with this is that we could only kill the Workflow after it's created, potentially allowing for something bad to happen in the meantime. (Delete all users files, etc.)
  2. Limit users to running particular Argo Workflow templates
  3. Argo Workflows has plugins which could allow us to crash any workflows with wrong volumes mounted.
    - We'd have to use this with restricting users to use templates which has the same disadvantages as above.
  4. Create Nebargo, a fastapi server that all users submit workflows to. It examines the workflow to see if the user is mounting volumes they shouldn't and forwards the request to argo-server or not accordingly.
    • this limits what tools you can use - no hera, no argo CLI :(
  5. AdmissionController
  6. Pod Security Admission/Pod Security Standards
    1. https://kubernetes.io/docs/concepts/security/pod-security-policy/
    2. Might work, but I'm not sure it's flexible enough
  7. Limit users to their own namespace
    - Because PVs are cluster wide, I don't think this would help with preventing users from mounting volumes to pods that they shouldn't.

I think the AdmissionController is the best way forward at the moment.

RFD - Vault for Deployment and Dynamic User Secrets

Status Open for comments 💬
Author(s) costrouc
Date Created 14-02-2022
Date Last updated 14-02-2022
Decision deadline 14-02-2022

Vault for Deployment and Dynamic User Secrets

Summary

Have spent around 2 days familiarizing myself with Vault and trying it out by using Hashicorp's managed vault deployment. We have the features that we would need to allow:

  • storing deployment secrets
  • allowing users/groups to create their own secrets which they can then update/delete and share with other services all with a rich permissions model

User benefit

There are two users I have in mind this this proposal end users e.g. regular users/developers on Nebari and devops/it sysadmins managing the deployment of Nebari. This proposal would satisfy both of these.

Design Proposal

Implementation

  • would be nice to have monitoring (prometheus/grafana) in a separate stage since vault can export metrics to prometheus (optional)
  • 2 new stages before 07-kubernetes-services
    • vault deployment via helm
    • configure vault
  • after this is done we store all the secrets created during the deployment in vault using kubernetes authentication via a service account we created during the deployment.

Notice user does not have to store/remember any secrets!

How would we configure vault:

  • auth provider:
    • kubernetes auth using kubernetes service accounts (for deployments and services)
    • oidc auth using keycloak (for users)
  • policies would be:
    • created for users/groups to have paths e.g. users/<username>/* and group/<group>/* to write arbitrary secrets
    • created for services to that the given service only has access to those specific secrets
  • how to mount the secrets. vault has several options (sidecar looks the most promising since it allows for dynamically updating secrets) https://www.hashicorp.com/blog/injecting-vault-secrets-into-kubernetes-pods-via-a-sidecar
# patch-basic-annotations.yaml
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-secret-helloworld: "secrets/helloworld"
        vault.hashicorp.com/role: "myapp"

Kubernetes service accounts are at the heart of this. Would would assign identities to users/services via attaching a service account e.g. <namespace>/service-<service-name> or <namespace>/user-<username>

Alternatives or approaches considered (if any)

There is currently a proposal for using SOPS for secret management #29.

  • SOPS to me has several downsides
    • yet another place to store the secrets
    • managing private keys to encrypt/decrypt secrets
    • support for multiple public/private keys?
    • does not address the regular user use case of wanting dynamic secrets on the cluster
    • additional things to manage in the nebari github repo
    • dangers of commiting secrets to repo unencrypted
  • Advantages to me:
    • significantly simpler to deploy

Best practices

User impact

Unresolved questions

Do we use separate namespaces for users?

[DOC] - Remove `Releases.md`

Preliminary Checks

Summary

We have a new release process documented here: https://www.nebari.dev/docs/community/maintainers/release-process-branching-strategy

We can have the documentation page (at nebari.dev) as the source of truth and move undocumented details from the governance repo to the official docs. :)

Steps to Resolve this Issue

...

RFD - Support gitops staging/development/production deployments

Status Open for comments 💬
Author(s) @costrouc
Date Created 12/08/2023
Date Last updated 12/08/2023
Decision deadline 22/08/2023

Title

Support gitops staging/development/production deployments

Summary

This RFD is constructed from issue nebari-dev/nebari#924. We need to have the ability to easily deploy several nebari clusters to respresent dev/staging/production etc within a gitops model. Whatever solution we adopt it should be backwards compatible and easy to adopt.

User benefit

There are several benifits:

  • testing changed before forcing them on users
  • cost savings since it might be possible to deploy on the same kubernetes cluster
  • for larger enterprise customers this is a must have

Design Proposal

I propose using folders for the different nebari deployments. The current folder structure is:

.github/workflows/nebari-ops.yaml
stages/...
nebari-config.yaml

For backwards compatibility we keep this structure and add new namespaced ones based on the filename extension.

For example nebari-config.dev.yaml would imply the following files are written

.github/workflows/nebari-ops.dev.yaml
dev/stages/...

The github/gitlab workflows will be templated to watch and trigger only on updates to the specific files for that environment. This approach is independent of git branching.

Alternatives or approaches considered (if any)

  • separate branch for production/dev/staging. I think this approach in my mind is the strongest contender. However, I strongly oppose this approach. Often times dev/prod/staging have different configuration intentionally e.g. dev would have smaller node groups etc. Thus dev -> staging -> prod is not always how changes flow. It is also hard to compare production vs. dev side by side without diffs.
  • separate repository per deployment. This is possible as is but as with our current experience this is not easy to manage

Best practices

This would provide an easy way for users to have different deployments on the same git repository.

User impact

This change would not affect any existing nebari deployments as far as I am aware and would be backwards compatible.

Unresolved questions

Gitlab doesn't support multiple files for CI it wants a single entrypoint .gitlab-ci.yml. Pipelines would allow us to do this but then the separate stages will have to all write to the same gitlab-ci.yml file. This is solvable.

RFD - Include SOPS for secret management

Status Open for comments 💬
Author(s) @iameskild
Date Created 2023-01-15
Date Last updated 2023-02-06
Decision deadline 2023-02-13

Summary

See relevant discussion:

Design Proposal

SOPS is a command-line tool for encrypting and decrypting secrets on your local machine.

In the context of Nebari, SOPS can potentially solve the following high-level issues:

  • allow Nebari administrator to manage sensitive secrets
    • this includes the abiilty to store these secrets in git as part of a GitOps workflow
  • create (shared) kubernetes secrets that can be mounted to JupyterLab pods and other kubernetes resources
    • this requires some additional work but should be worth the effort

Workflow

Starting point: a Nebari admin has a new secret some of their users may need (such as credentials for external data source). They have the appropriate cloud credentials available.

  1. Generate KMS (or PGP) - only needs to be performed once
  2. Encrypts the secret locally
  3. Add the encrypted secret to the Nebari infrastructure folder
  4. Redeploy Nebari in order to create Kubernetes secrets and associate those secrets with resources that need them

Handling secrets locally

Item 1. and 2. from the workflow outlined above can be performed directly using the cloud provider CLI (aws kms create-key) and the SOPS CLI (sops --encrypt <file>).

To make it easier for Nebari admins, I propose we add a new CLI command, nebari secret to handle items 1. and 2. This might look something like:

# requires cloud credentials
nebari secret create-kms-key -c nebari-config.yaml --name <kms-name>  
  • This command would call the cloud provider API and generate the necessary KMS. In the process, this command could also generate the .sops.yaml configuration file to store the KMS and creation_rules.
  • It looks like SOPS doesn't have support for DO KMS (or DO doesn't have a KMS product?) and will likely need to rely on another method PGP / age keys.
  • Local deployments should also rely on PGP / age keys.
# encrypt secrets stored as a file
nebari secret encrypt --name <secret-name> --file <path/to/file>
# or from a literal string
nebari secret encrypt --name <secret-name> --literal <tOkeN>

# a decrypt command can be included as well
nebari secret decrypt --name <secret-name>
  • The encrypt command encrypts the secret and stores the encrypted secret in the designated location in the repo (./secrets.yaml).
  • The decrypt command decrypts the secret and prints it stdout.
  • Anyone performing this command on their local machine must have a cloud user that can use that KMS key.

Include these secrets in the Nebari cluster

Items 3. and 4. from the workflow outlined above refers to how to get these secrets included in the Nebari cluster so that they can be used by those who need them.

There exists this SOPS terraform provider which can decrypt these encrypted secrets during the deployment. To grab these secrets and use them, we can create a secrets module in stage/07-kubernetes-services that returns the output (i.e. secret) that can be used to create kubernetes_secrets as such:

  1. Read/decrypt the data from the secret.yaml:
data "sops_file" "secrets" {
	source_file = "/path/to/secrets.yaml"
}

output "my-password" {
	value = data.sops_file.demo-secret.data["password"]
	sensitive = true
}
  1. Consume above output to create Kubernetes secret (in parent module):
resource "kubernetes_secret" "k8s-secret" {
	metadata {
		name = "sops-demo-secret"
	}
	data = {
		username = module.sops.my-password
	}
}

At this point, the kubernetes secrets exist (encoded, NOT encrypted) on the Nebari cluster.

Including the secrets in the user's environment

Including secrets in the KubeSpawner's c.extra_pod_config) (in 03-profiles.py) will allow us to mount those secrets to the JupyterLab's user pod, thereby making them useable by the people.

c.extra_pod_config = {
	# as environment variables
	"containers": [
		"env": {}
	]
	# to pull images from private registries
	"image_pull_secret": {}
	# as mounted files
	"volumes": [
		"secret": {}
	]
}

How these secrets are configured on the pod (as a file, env var, etc.), and which Keycloak groups have access to these secrests (if we want to add some basic "role-based access"), can be configured in the nebari-config.yaml.

Something like this:

secrets:
- name: <my-secret>
  type: file
  keycloak_group_access:
  - admin
- name: <my-second_secret>
  type: image_pull_secret
  ...

To accomplish this, we will need to add another callable that is used in the `c.kube_spawner_overrides in 03-profile.py:render_profiles.

Alternatives or approaches considered (if any)

There are many specifics that can be modified, such as how users are granted access or how the secrets that are consumed by the deployment.

As for a different usage of SOPS, I can think of one more. That would be to create the kubernetes secret from the encrypted file directly and then have the users decrypt the secret in their JupyterLab pod. This would eliminate the need for the sops-terraform-provider above.

It might be possible to create tiered- secret files that are then associated to the keycloak groups again. This would introduce multiple KMS-keys.

The question that's hard to answer then becomes how to safely and conveniently disperse the KMS key to those who need to access the secrets.

Best practices

User impact

Access to secrets they may need to access job specific resources.

Unresolved questions

Given that SOPS is a GitOps tool, it's important to ensure that admins don't accidentally commit plain text secret files in their repos. Adding more strict filters in the .gitignore will help a little but there's always a chance for mistakes.

RFD - Move Nebari infrastructure code from HCL to python using terraformpy

Status Open for comments 💬
Author(s) @viniciusdc
Date Created 13-03-2023
Date Last updated 13-03-2023
Decision deadline --/--/--

Summary

Nebari heavily depends on terraform to handle all of our IaC needs. While HCL (the .tf files) is a great language for describing infrastructure, it is not the best language for writing code where multiple ecosystems are involved. We can see such cases where adding a simple new feature requires us to sometimes re-write the same piece of code multiple times in HCL (e.g the variables that are used across different modules)

Our main code that handles most of the execution of the terraform binaries is already written in python (a subprocess is responsible to run terraform plan and terraform apply), as well as almost all of our interactions within the already deployed cluster during testing is also done in python. Due the complexity of our ecosystem having such situations where we need to write a lot of HCL code to handle the edge cases that we have is not only time consuming but also error prone. In this RFD I would like to suggest moving our infrastructure code to python using terraformpy to make it easier to maintain and extend.

Benefit

There are multiple benefits to this change:

  • Easier to maintain and extend the codebase as we can use the full power of python to write the code
  • Easier to test the code as the function would then be easier to import and test
  • Python would grant us more flexibility when adding new features as we would then be able to point to a terraform resource as an object and then call its methods to do the required changes (no need for extra variables and output to move data around)
  • Parsing our code-base would be easier.
    • As a quick example on how that would benefit use: Right now all helm charts are using the helm provider for quick deployment, which is wonderful for the deploy perspective... though linting the files and keeping track of the version updates is really complex as we would require inspecting all files in the repo tree and use some regex to identify the charts. If we move to python we can import the helm provider and then call its methods to get the list of charts and their versions (or save then under a list to be exported somewhere else). Which would make it easier to keep track of their versions and also to update them in other tests (e.g., the upgrade test -- eval broken Bitnami charts --)
  • Easier to get people onboarded to the project as they would not need to learn HCL to contribute to the project.

Drawbacks

  • We would need to rewrite all of our codebase to python using the terraformpy library
  • Requires some time to get used to the new syntax
  • Re-think how we would call each stage of the deployment (though I think this migration would be not as terrible)
  • The terraformpy library has few updates in the last two years, but it is still maintained.

Approaches considered (if any)

Right now to write a simple new variable, we need to do something like this:

# in the variables.tf file in the main.tf root directory
variable "my_var" {
  type = string
  default = "my_value"
}
-----------------------
# in the main.tf file in the main.tf root directory
module "my_module" {
  source = "./my_module"
  my_var = var.my_var
}
-----------------------
# in the main.tf file in the my_module directory
variable "my_var" {
  type = string
}
# in the variables.tf file in the my_module directory
variable "my_var" {
  type = string
}

And we also need to make sure we are passing it over to input_vars.py. This is a lot of code to write for a simple variable that we need to pass over to a module. (image when we need to pass outputs to different stages)

With python we would instead have a function that received the vars as input and passes it over to the correct module under its hood. This would make the code much easier to maintain and extend. For example:

from terraformpy import Module
from .vars import my_var

def pass_vars_to_module(my_var):
    Module(
        source="./my_module",
        my_var=my_var
)

That's it, of course, this example is very simple and do not take in consideration the full complexity of the codebase, but I think it would be a good starting point to see how we can simplify the codebase.

User impact

  • The user would not see any changes in the way they usually interact with the project, though this would be a breaking change for the project as we would need to rewrite all of the codebase to python.
  • Our CI tests would be more reliable as we could test the code in a more isolated way.

Unresolved questions

[DOC] - Analytics with Plausible

Preliminary Checks

Summary

We can document how to access Plausible analytics for Nebari.

Steps to Resolve this Issue

  1. Shall we make the analytics public?
  2. If yes, we can share a link to the public site in the readme (and/or documentation)
  3. If not, we can share a way for community members to gain access to the analytics. Example: Open an issues/discussion or send an email to <> to gain access to analytics.

RFD - Extension Mechanism for Nebari

Status Accepted ✅
Author(s) @costrouc
Date Created 03-28-2023
Date Last updated 03-28-2023
Decision deadline 04-15-2023

Title

Extension Mechanism for Nebari

Summary

Over the past 3 years we have consistently run into the issue where extending and customizing Nebari has been a hard task. Several approaches have been added:

  • the addition of stages to the nebari deployment to make it easier to isolate pieces and was work that was done to make the extension mechanism easier to move towards
  • usage of terraform_overrides and helm_overrides keyword to allow for arbitrary overrides of helm values.
  • helm_extensions in stage 8 which allow for the addition of arbitrary helm charts
  • tf_extensions which integrate oauth2 and ingress to deploy a single docker image

Despite these features we still have needs from users and we are not addressing them all. Additionally we have issues when we want to add a new services it typically has to be directly added to the core of Nebari. We want to solve this by making extensions first class in Nebari.

User benefit

I see quite a few benifits from this proposal:

  • easier to extend Nebari making it easier to split development of nebari into smaller teams e.g. core Nebari team, feature-x team
  • easier customization of stages since the extension mechanism will solidify the interfaces between stages
  • easier adoption of new ways to deploy stages. Personally excited about this feature since it could make adoption of terraformpy easier.
  • ad-hoc client customizations will be significantly easier
  • ways to have proprietary additions to nebari that do not require deep customization

Design Proposal

Overall I propose we adopt pluggy. Pluggy has been adopted by many major projects including: datasette, conda, (TODO list more). Pluggy would allow us to expose a plugin interface and "install" extensions via setuptools entrypoints. Making extension installation as easy as pip install ...

Usage from a high level user standpoint

pip install nebari
pip install nebari-ext-clearml
pip install nebari-ext-helm
pip install nebari-ext-cost

Once a user installs the extensions we can view the installed extensions via:

$ nebari extensions list
Name                       Description
---------------------------------------------------------------------------
nebari-ext-clearml "ClearML integration into nebari"
nebari-ext-helm     "Helm extensions"
....

Plugin Interfaces

Within nebari we will expose several plugins:

Subcommands

A plugin interface for arbitrary additional typer commands. All commands will be passed in the nebari config along with all specified command line arguments from the user. Conda has a similar approach with typer for their system.

nebari cost

Stages

class Stage:
    name: str
    description: str
    priority: str    # defaults to value of name

    def validate(self, config: schema.NeabriConfig):
         """Perform additional validation of the nebari configuration specific to stage

         """

    def render(self, config: schema.NebariConfig) -> typing.Union[typing.Dict[str, bytes], pathlib.Path]:
          """Given a set configuration render a set of files

         Returns
         ------------
         typing.Union[typing.Dict[str, bytes], pathlib.Path]
              Returns either a directory to copy over files or a dictionary of keys mapping to file bytes
         """
         ...

      def deploy(self, directory: pathlib.Path, stages: typing.Dict[str, typing.Any]) -> typing.Any:
            "Deploy all resources within the stage
	         
            "
            ...

      def destroy(self, directory: pathlib.Path):
            "Destroy all resources within the stage"
            ...

Nebari will use pluggy within its core and separate each stage into a pluggy Stage . Each stage will keep it's original name.

Alternatives or approaches considered (if any)

As far as plugin/extension systems go I am only aware of two major ones within the python ecosystem:

  • pluggy
  • traitlets :: I have used traitlets on several projects and do not feel it is a good fit here because:
    • traitlets is extremely invasive to the codebase it has opinions on the class structure/class creation
    • exposes a cli
    • opinionated way to perform customization

Best practices

This will encourage the practice of extending nebari via extensions instead of direct PRs to the core.

User impact

It is possible to make this transition seamless to the user without changing behavior.

Unresolved questions

I feel confident in the approach since I have seen other project use pluggy succefully for similar work.

RFD - User Friendly Method for Jupyter users to run an Argo Workflow [Draft]

Status Draft 🚧
Author(s) Adam-D-Lewis
Date Created 02-03-2023
Date Last updated 02-03-2023
Decision deadline ?

This is very much a Draft but I welcome feedback already if you want.

User Friendly Method for Jupyter users to run an Argo Workflow (Draft)

Summary

The current method of running Argo workflows from with Jupyterlab is not particularly user friendly. We'd like to have a beginner friendly way of running simple Argo Workflows even if this method has limitations making it not appropriate for more complex/large workflows.

User benefit

Many users have asked for ways to run/schedule workflows. This would fill many of those needs.

Design Proposal

  1. Users would need to create a conda environment (or add a new default base environment - argo_workflows) that has python, python-kubernetes, argo-workflows, and hera-workflows packages.
  2. We pass in some needed pod spec (image, container, initContainers, volumes, securityContext) into the pod as environment variables. We do this via a KubeSpawner traitlet.
  3. Enable --auth-mode=client on Argo Workflows in addition to --auth-mode=sso. Then when users log in, kubespawner should map them to a service account consistent with their argo permissions, and set auto_mount_service_token to True in kubespawner as well. Example according to chatgpt is below though idk if it's hallucinating. Details around authentication via Jupyter vs Keycloak is still a bit hazy to me.
from kubespawner import KubeSpawner
import json

class MySpawner(KubeSpawner):
    def pre_spawn_start(self, user, spawner_options):
        # Get the JWT token from the authentication server
        token = self.user_options.get('token', {}).get('id_token', '')

        # Decode the JWT token to obtain the OIDC claims
        decoded_token = json.loads(self.api.jwt.decode(token)['payload'])

        # Extract the OIDC groups from the claims
        groups = decoded_token.get('groups', [])

        # Modify the notebook server configuration based on the OIDC groups
        if 'group1' in groups:
            self.user_options['profile'] = 'group1_profile'

        # Call the parent pre_spawn_start method to perform any additional modifications
        super().pre_spawn_start(user, spawner_options)
  1. Users with permissions can then submit argo workflows since /var/run/secrets/kubernetes.io/serviceaccount/token has the token to be able to submit workflows.
  2. Write a new library (nebari_workflows) with usage like
import nebari_workflows as wf
from wf.hera import Task, Workflow, set_global_host, set_global_token, set_global_verify_ssl, GlobalConfig, get_global_verify_ssl

# maybe make a widget like the dask cluster one
wf.settings(
  conda_environment='',  # uses same as user submitting it by default
  instance_type='',  # uses same as user submitting it by default
)

with Workflow("two-tasks") as w:  # this uses a service with the global token and host
    Task("a", p, [{"m": "hello"}], node_selectors={"beta.kubernetes.io/instance-type": "n1-standard-4"})
    Task("b", p, [{"m": "hello"}], node_selectors={"beta.kubernetes.io/instance-type": "n1-standard-8"})

wf.submit(w)

Alternatives or approaches considered (if any)

Here

Best practices

User impact

Unresolved questions

Here's what I've done so far

  1. Created a conda environment that has python, python-kubernetes, argo-workflows, and hera-workflows packages.
  2. Added a role (get pod permissions) and role binding to the default service account in dev
  3. Change instance type profile to automount credentials for all users so they get the get_pod permissions
  4. Copied all the image, container, initContainers, volumes, securityContext in 2 places, resources, and HOME env var from the pod spec and put them in an argo workflow (think jinja to insert them in the right places)
  5. Copied the ARGO_TOKEN and other env vars from Argo Server UI and sourced them in a jupyterlab terminal.
  6. Ran a short script using argo_workflows python API to submit the workflow. It has access to the user conda environments conda run -n myEnv and all the user directory and shared directories.
    1. the process started in / instead of at HOME, not sure why yet
    2. I ran ["conda", "run", "-n", "nebari-git-dask", "python", "/home/ad/dask_version.py"]
    3. I read and wrote a file to the user's home directory successfully

So deviations from that are still untested.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.