aurae-runtime / aurae Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 89.0 14.25 MB

Distributed systems runtime daemon written in Rust.

Home Page: https://aurae.io

License: Apache License 2.0

Makefile 3.15% Rust 87.37% Dockerfile 0.37% Shell 8.33% TypeScript 0.17% Nix 0.17% RenderScript 0.44%

daemon distributed-systems linux multitenancy rust system-programming

aurae's People

Contributors

Stargazers

Watchers

Forkers

wesen bradlugo flawlesscode254 isgasho future-highway tylerauerbeck wayneeseguin vincinator edwardceballos flostadler dmah42 pvoliveira manishirvani warmchang theaeonsolution fdrio nsmith5 edude03 adililhan halorgium recturtle prattmic andreek matthiask coopernetes puerco jamestiotio moto-timo nezlobnaya neoeinstein tgockel em-winterschon kiendang mccodeman gg-big-org bitbonk cxz matttproud jesse-peters arjunrbery j0shgrant s4ch rrethy manuelcoppotelli voigt palexster justizin andrewo phial3 saoodahmad hixon10 mccormickt jeroensoeters jkew ahndmal bailey-coding gifted-s marcusramberg jcorbin oylenshpeegul slamp zanshin kiranb83 tlaferriere hbjydev manishshakya wuageorg 00mjk stephenweber matonanthony seanwatters mistshi ph lemarier bnm3k slowy07 utam0k willjr yihuaf reconbug glenn-m evelynmitchell dpoluyanov raphsku sax bpmooch sergiiio taniwha3

aurae's Issues

aurae/api/v0/ runtime.proto: use repeated fields for stringly-typed fields containing multiple items?

I was looking over https://github.com/aurae-runtime/aurae/blob/main/api/v0/runtime.proto, and two instances of this caught my attention.

/// A comma-separated list of CPU IDs where the task in the control group
/// can run. Dashes between numbers indicate ranges.

Would it be prudent to represent Cell.cpu_cpus and Cell.cpu_mems as repeated fields so that some of the stringly-typed input can be retired in favor of a structured data schema?

So the fields from Cell would look something like this:

repeated string cpu_cpus = 2;
...
repeated string cpu_mems = 4;

For the expression of ranges, I might consider using the oneof capability. A sketch:

message CPUSpec {
  message Range {
    string start = 1;
    string end = 2;
  }
  oneof spec {
    string id = 1;
    Range range = 2;
  }
}

repeated CPUSpec cpu_cpus = 2;

Destroy branches on merge

I believe this is a github setting we need to investigate

Guarantee Structured JSON Output

Right now auraescript errors and can produce many types of outputs.

Would it be possible to guarantee (or strongly encourage) auraescript users to always have their output in the form of valid JSON?

If we can instill patterns/best practices that make it such that all auraescript output to stderr and stdout is valid json we can begin logging and querying the data at scale later.

Aurae as a Hypervisor (Kubernetes Namespaces)

Similar to #21

Does each Namespace get a VM?

More importantly should each namespace get a VM?

How do we start to experiment with isolation primitives for Kubernetes? Even though Aurae will not have Kubernetes awareness, it should consider the scope of Kubernetes similar to the scope of Aurae. We just aim to standardize the components according to the #20 principle of least awareness.

Disable tracing in gRPC code

Right now running auraed -v creates entirely too much TRACE output and is unsuable. We need a way to use -v without flooding the screen from gRPC.

Documentation Convention

We need to follow the Rust documentation standard and begin auto-generating our documentation in the aurae.io repository.

More: https://doc.rust-lang.org/rust-by-example/meta/doc.html

Website edit button links to 404

For example, https://aurae.io/ links to https://github.com/aurae-runtime/aurae/edit/master/docs/index.md. That should be main instead of master.

This looks simple to fix from the mkdocs docs, so I’ll send a PR.

Auraed panic when `free` a cell that does not exist

Steps to reproduce:

Start auraed
Run the following typescript:

await cells.free(<runtime.FreeCellRequest>{
    cellName: "non-existent-cell"
});

Observe auraed logs:

16:25:22 [INFO] Starting Aurae Daemon Runtime...
16:25:22 [INFO] Register Server SSL Identity
16:25:22 [INFO] Validating SSL Identity and Root Certificate Authority (CA)
16:25:22 [INFO] User Access Socket Created: /var/run/aurae/aurae.sock
16:25:31 [INFO] CellService: free() cell_name="sleeper-cell"
thread 'tokio-runtime-worker' panicked at 'find cell_name in cgroup_table', auraed/src/runtime/mod.rs:122:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Create and use build container (speed up the builds)

Right now we have several build jobs that are executed during a pull request and a merge to main

Can we please leverage #78 and have the builds use a container instead of apt-get installing the dependencies each iteration?

Auraescript case-sensitivity issue

It seems like differences in casing for the generated code is causing property/field values to be lost.

JS/TS use camelCase for properties while rust uses snake_case for fields.

@krisnova confirmed this as an issue on stream. A possible solution is to relax the casing requirements with another macro that tags all the fields with #[serde(alias = "camelCaseFieldName")] as Deno uses serde for serializing/deserializing.

*I actually have a macro for this, that probably needs a little adapting before I can contribute it. I'm just a little time constrained before the break, so wanted to leave this issue as a note for future me. For now, if anyone is trying stuff out, just use hardtoread field names so the casing will be the same throughout.

Custom Language Support: VSCode Extensions

Similar to #66 can we please implement a VScode language extension for AuraeScript?

It will need to read from the lib.rs source of truth for the language.

Aurae Official Blog

Can we host an official project blog on the website? I'd like to move some of my articles over to the project to use as needed.

Also as we identify new paradigms such as Aurae Cells and Aurae Pods we likely will need to do some story telling.

Can't use numbers greater than i32/u32 in auraescript

This is due to a bug in pbjson (which we depend on). There is currently an open PR to fix the issue in that repo:

influxdata/pbjson#87

Update/Move community docs on website

Right now the community repository is the main source of truth for our community.

We should migrate/copy/move the documentation over to the website such that it makes it easy for newcomers to the project to understand how to get started with us.

What are the things new folks to the project would love to see that they didn't see before? Or maybe something that was suprising? Or something wasn't discovered until later?

Cells should be immutable

We should not be able to change a cell after it has been created, you must destroy the cell and allocate a new one if you want to change anything about the cell.

Speed up compiles

Is there any way we can make optimizations in the Aurae compile time? How many dependencies are we building that we are no longer using? How about caching and optimization?

Structured Output

We should be able to structure the output of aurae scripts such that it can be queried.

Sticking with the "No YAML touches our project" mentality. I propose we identify a clever way to ensure that all output from the Aurae executable is structured using valid json.

Aurae as a Hypervisor (Kubernetes Nodes)

As we bring the ability to schedule VMs and containers online, we need to start exploring patterns for leveraging the new isolation boundaries.

Does each VM get a Kubelet?

Or more importantly should each VM get a Kubelet? Do we want to be able to use Kubernetes taints/tolerations to be able to schedule to VMs running on a single machine?

What about the ability for the Aurae project to support a "Kubelet VM Pattern" that makes it easy to lump all of the Kubernetes "goop" together into a single image (similar to minikube) that can then be used to schedule containers within the VM.

Need a better error output when npm is missing

Auraescript needs npm installed to build. The current output if it is missing is (thanks @moto-timo):

error: failed to run custom build command for `auraescript v0.1.0 (/home/ttorling/Projects/aurae/auraescript)`

Caused by:
  process didn't exit successfully: `/home/ttorling/Projects/aurae/target/debug/build/auraescript-614525272d6885a2/build-script-bui
ld` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed="src"
  cargo:rerun-if-changed=""/home/ttorling/Projects/aurae/auraescript/lib""

  --- stderr
  Error: No such file or directory (os error 2)
warning: build failed, waiting for other jobs to finish...
make: *** [Makefile:63: auraescript] Error 101

This is currently unsuccessfully attempted in the auraescript/build.rs file. It would be best to make the fix there.

Custom Language Support: IntelliJ (Clion)

Can we identify a way to leverage the lib.rs file and its corresponding definitions/macros/documentation to implement an IntelliJ custom langauge

Plumb stdout and stderr through CellService

I believe we will need to develop a logging cache that serves as an in-memory caching layer (Note: in the future it will need to also persist to disk!) for both stdout and stderr streams from an executable within a cell.

We will need to be able to hook into a stream at any moment and have some basic guarantees about the data and how we retrieve it.

Setns vs Clone(3)

All Aurae cells should setup the namespaces by scheduling a process immediately. We believe this will be the nested auraed.

Once the new namespaces have been "cloned" we can track their IDs.

All executables should be using setns() systemcall and not calling clone(3)() themselves. We should be entering already existing namespaces such that all executables in a cell share the same namespace and namespace ids.

Static Config (NixOS Pattern)

What if we established a pattern such that a single configuration file/source was all that was needed to guarantee that an Aurae instance with defined workloads were able to start.

What if aurae.ts was all that was needed to remotely provision an entire node with associated workloads?

User mode networking

We should be able to add user mode networking and network devices to an unprivileged container and VM

See my thread 👀 https://twitter.com/krisnova/status/1582353843110965248?s=20&t=y32VYPsVtNP1FOitiQ7h3w

Options for scripting against aurae

I'd like to propose using bash (with an aurae cli similar to kubectl or buildah) or Deno to script against aurae instead of moving forward with the Rhai implementation. I am not heavily invested in bash or Deno per se, but I am not sure that Rhai's unique properties serve the project better than other more popular approaches to scripting against API's. In short, why is Rhai better than python, javascript, bash, etc? I tried to focus my thinking on who aurae users probably will be and how that theoretical persona would want to interact with the system. In addition, I am working on a system with some of the same goals as aurae currently, and I'm sure that project also biases my opinions about aurae.

Soooo, from the readme,

AuraeScript follows a similar client paradigm to Kubernetes kubectl command. However, unlike Kubernetes this is not a command line tool like kubectl. AuraeScript is a fully supported programing language complete with a systems standard library. The Aurae runtime projects supports many clients, and the easiest client to get started building with is AuraeScript.

Does it make sense that the easiest client to get started building with is a scripting language users probably aren't familiar with? As a sysadmin, I would much rather reach for bash to get started hacking. I think buildah serves as a great model to emulate (tool intro here). buildah replaces dockerfiles the same way auraescript is attempting to replace yaml for infra configuration. It should be simple to spin up an instance of aurae and configure it with a simple bash script or interactively from the terminal. Bash is the most intuitive answer for this. Moving past simplicity, power users can get pretty far with bash, awk, and a well maintained gnu-style cli such as kubectl or buildah before needing to reach for a beefier (and crucially, more complex to integrate and maintain) scripting language such as python, javascript, perl, or lua.

And speaking of lua, an even simpler question to ask than above is why Rhai instead of lua? I've fiddled with it a bit to mess with my neovim configuration, but I am far from an expert. From my outsider's perspective, lua seems like an active success story for small, embed-able languages.

If the scripting language is important to provide a platform for the aurae standard library, then Deno becomes a very compelling option. Deno markets itself as a v8 runtime that is secure, hackable, and embed-able.

The Deno global object (Deno.spawn, Deno.test, Deno.exit) can be modified to include runtime specific functionality. You can ship aurae users a full featured runtime & compiler to build their aurae scripts with.
Javscript & Typescript is a very popular with developers these days, and might help make maintainable infrastructure code understandable to more web developers.
Deno takes a lot of design inspiration from Go. In fact the standard library is loosely modeled after the Go stdlib. Using Deno gives us a lot of the upsides of using Go without some of the more prickly downsides (see bottom)

A Go Problemo

side note: when building docker we were struggling to provide flexibly replaceable extensions as Go had no support for dynamic loading of libraries. As Docker wanted to ship with "batteries included but changeable", we ended up with those CNI / CSI constructs, where core functionality was pushed to external processes with more or less nice interfaces.

POSIX Signal Handler and Proxy/Bus

Now that auraed is launching nested versions of itself, we will need to proxy POSIX signals through auraed

Implement a POSIX compliant signal handler in Rust in Auraed and provide documentation on which signals map to what behavior.

SIGKILL should terminate (kill) the process
SIGHUP should reload the config from disk and reopen logfiles
SIGINT should "interrupt" the process and begin to "die nice" ensuring any cleanup logic can be done

Use SIGINT instead of SIGKILL to "free" a nested auraed after the signal handler has been implemented.
Proxy all signals to nested executables for them to manage independently.

For example sending a SIGHUP to a nested auraed should proxy a SIGHUP to all of the nested executables!

Aurae needs a container registry

The projects needs a place to store containers, which could implicitly mean the project has an authz/secrets problem as well.

Can we please identify a place for the project to store containers? Can we also work with the other maintainers to make sure folks and the build systems have access to the container registry as needed?

better error messages

for errors that lead to actionable resolutions by the user, we should include a link to some documentation about the error (hosted on aurae.io?) and a link to file a new issue here in the repo.

i think we can get this pretty easily by extending the From<CellsServiceError> for Status response translation to include known docs and links to file a new issue using constant strings.

Unwrap

As we are still in the sandbox phase of building Aurae we are using unwrap statements in the code. We should replace these with safer and more idiomatic systems in Rust.

Additionally we should build a linting system that prevents code like this from entering the project.

Ensure all files include license headers

For legal reasons we need to include the license in every file that we consider "Source Code".

Should we have a CI/CD check for this or any easy make target we can run to check/append files as needed?

Aurae 0.1.0 - Umbrella Tracking Issue

There are a few things we would like to accomplish with this milestone, but the overall aim to be captured here is that it signifies the project has reached a maturity where it can commit to being ready to accept contributions at any time, regardless of one's prior exposure to Aurae. Consequently, this milestone is less one of functionality, but more a state of readiness and cleanliness.

Some of the concrete deliverables of the milestone are:

Adopting a trunk-based workflow
Keeping the main branch green
Leaning on feature-flags and prioritizing collective incremental progress
The current "v0" API directory for the standard library will be moved into a folder named v0

Some of the sociotechnical goals of the milestone are:

People can contribute to Aurae with as little friction as possible regardless of prior exposure to the project
People should feel safe, comfortable, and able to contribute to Aurae in whatever capacity and ability they can offer
Aurae continues to facilitate incremental and collaborative work together
We have a solid and viable path towards achieving sustainability of the project

This umbrella issue will be associated with a milestone and will be updated with links to issues and to reflect the current state of progress towards the milestone

Generate AuraeScript docs from lib.rs

The /auraescript/src/lib.rs file serves as the source of truth for objects, functions, types, and aliases for the AuraeScript programming language.

We need to establish a convention that will generate meaningful documentation directly from the source code.

For example we expose the about() function in AuraeScript which is defined here as:

pub fn register_stdlib(mut engine: Engine) -> Engine {
    engine
        // about function
        //
        // Reserved function name to share information about the current
        // client interpreter.
        .register_fn("about", about)
}

How do we define a RustDoc-like convention that will generate additional documentation specifically for AuraeScript?
How do we ensure that each specific piece of functionality that is registered with the rhai engine has documentation?
How do we render the documentation on aurae.io
How do we map the documentation of AuraeScript to the global version of Aurae? Do we leverage the v0 style API/stdlib convention or do something else? What about semantic versioning in the Cargo.toml?

StreamLogger errors when there are 0 gRPC observers

If there is no receiver, StreamLogger and LogChannel error, which prints "Failed to log message...". I think theses errors can be safely ignored as we have multiple loggers registered and StreamLogger is only used for the observe api.

Relationship with DNS

One of the big problems with TLS/mTLS and Kubernetes is that in order to set up a cluster, you need to touch DNS records.

This is both a good thing and a bad thing. SAN and hostname material is often embedded and relied on various TLS scenarios.

How do we as a first principle call out some elegance around our relationship with DNS without pissing the internet off?

Introduce `--ritz` flag as an alternative to `--verbose`

I just want to be able to say "puttin on the ritz" while debugging production.

Is it easy to add an alias for a flag?

Break the build on compile warnings

Can we please break the build and fail a pull request if code creates a rust warning?

We are trying to keep a clean set of code for the project, and we would like to leverage the Makefile commands to ensure that no warnings are generated.

This includes

Documentation
Linting
Unused (dead) code
etc

Monitoring (more 404s) on the website

How do we know if our site as down?

Right now aurae.io is offline and is returning a 404.

Can we alert our discord and let the maintainers know when the site is broken?

IPv6 by default.

IPv6 is the ~~future~~ present.

IPv6 is now! The time has arrived! We are officially here 🎉

Can we place adopt IPv6 support for the networking subsystem by default? Additionally can we go a step further and adopt IPv6 for all of our documentation and code defaults moving forward. We should offer IPv4 documentation and code as a secondary example to the default IPv6 content.

More better 🎉

Here at Aurae we create a lo loopback device listening on localhost ::1 (or the IPv4 equivalent 127.0.0.1).

use std::net::TcpListener;

fn main() {

    let listener = TcpListener::bind(“[::1]:8080”).unwrap();

    for stream in listener.incoming() {
        let stream = stream.unwrap();

        println!("Connection established!");
    }
}

Less better 😞

Here at Aurae we create a lo loopback device listening on localhost 127.0.0.1.

use std::net::TcpListener;

fn main() {

    let listener = TcpListener::bind(“127.0.0.1:8080”).unwrap();

    for stream in listener.incoming() {
        let stream = stream.unwrap();

        println!("Connection established!");
    }
}

Speed up Auraescript compiles

Auraescript takes too long to compile. I suspect it has to do with the build.rs doing too much each time, specifically redundant npm calls.

Commitment to beautiful logs

We want to have beautiful logs in the Aurae project. This is a perpetually open issue that will be ever-relevant to newcomers to the project.

At any time new contributors to the project are welcome to audit our current log lines.

In the code base you will see statements such as the following:

info!("an ugly log");
warn!("a bad log");
debug!("something debug");
trace!("trace something");

at any time in the course of the project's development it is safe and encouraged for users to audit our log lines.

Add a "how to write a good log line" to the aurae.io website
Add examples of "when to use what level" of logging in aurae
Add a rubric to help understand when to use what
Answer what and when to use a complete sentence

Official Language Name

We need to admit that we are building a Turing complete language, and we intrinsically will adopt all of the major exciting problems of managing a popular programming language.

The first exciting problem we get to tackle is picking out a name which will inevitably be highly criticized by strange tech enthusiasts with opinions.

Some options for us to peruse.

Auraelang
Auraescript
Aurae Shell (ash?) (aurepl?)
Infralang
Systemscript
Platlang

Issues to maybe close?

Scanning over the open GH issues, I think there are some that we can possibly close. So naturally, I'm opening another issue to document the ones that may be ready to close.

Probably outdated due to Deno being JS/TS?

#67
#66
#6 -> Does JSON.stringify cover this?

ts-proto (Proto -> Typescript generator) includes the docs from the proto files for us:

Warnings are breaking the builds, but we may not have completed activating the lints we want:

Do we still have this issue?

Other:

Introduce namespaces to cells

Should we consider isolating an Aurae cell at the namespace level as well? If so what are the sane defaults we should assume for every Aurae cell? Which namespaces should we consider, and what do we do given the various kernels and their support and awareness of each namespace?

GitHub actions seems to remove CNAME from pages settings and 404s

Upon merging a PR to main (or committing directly to the branch because I am a horrible person) a GitHub action is kicked off to update the static site using GitHub pages. Shortly after the event, the website aurae.io will return a 404.

Looking in the get GitHub pages settings for the repository the domain name value is unexpectedly missing.

Setting the value back to aurae.io fixes the 404 and the site is now updated with the most recent changes from the main branch.

Scheduler Resource

https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

Transfer to PSL (Public Stewardship License) for The Nivenly Foundation

The PSL releases the project to the public domain, and also includes the concept of a "steward" of the project.

Can we please change the license from Apache 2.0 to the PSL 1.0 and call out the Nivenly Foundation as our official steward?

We will need to update the CLA as well as have existing contributors to Aurae agree to the new terms.

Clustering

Curious if there is any thoughts about clustering right now. The docs make references to Kubernetes, but from what I can tell the current code is purely focused on running AuraeScript against a local instance of aurae.

Is the current idea to fully offload this to an external system, such as Kubernetes, or have a cluster aware layer that is accessible from AuraeScript?

I ask because this very closely mirrors ideas I've been playing with and would like to contribute.

Possible to destroy cgroups manually and Aurae still has them in cache

Right now it is possible to use the rmdir command in the /sys/fs/cgroup directory to manually destroy cgroups.

In the event that we manually destroy a cgroup that was started with auraed the internal cache still believes the cgroup exists. This is problematic when we try to re-create the cgroup again with auraed.

Principle of Least Awareness

Imagine the principle of least privilege but for systems awareness.

As we traverse up the stack, a system should only have awareness of the systems that sit "below" itself.

For example the Kubernetes Kubelet has awareness of the control plane. The control plane also has awareness of the Kubelet. This interdependence model is to be avoided with Aurae.

In other words auraed should never have Kubernetes awareness. The Kubernetes control plane might potentially choose to leverage Aurae as a Kubelet/Systemd alternative, however the interaction between these systems will likely need to be patched in a generic way to make them work.

For example Kuberenetes might want to schedule a "Pod" however Aurae should have no awareness of Kubernetes "pods". Aurae will just run containers, if Kubernetes chooses to containers up with shared networking, storage, and metadata and refer to that as a "pod" so be it.

This same pattern is reflected at the kernel level as well.

The kernel should never have awareness of Aurae, however Aurae will have kernel awareness. The pattern flows upwards with each system having awareness of the systems below itself, but never above.

The principle of least awareness.

Kernel Probes and Modules (New Subsystem)

Do we want to provide a secure way of installing, managing, upgrading, and authenticating kernel modules and eBPF probes?

Think DKMS but for more than just kernel modules, and with an authz gRPC API to do the dirty work. We could also authenticate 3rd party binary blobs as well as provide attestations they are what we want them to be.

Also work considering BCC and BPFtrace for existing BPF work we could easily give a story to.