Coder Social home page Coder Social logo

tinkerbell / osie Goto Github PK

View Code? Open in Web Editor NEW
99.0 17.0 30.0 2.17 MB

An in-memory installation environment for bare metal.

Home Page: https://tinkerbell.org

License: Apache License 2.0

Makefile 3.32% Dockerfile 3.85% Shell 80.20% Python 11.67% Nix 0.22% SaltStack 0.20% Go 0.03% Jinja 0.50%
tinkerbell

osie's Introduction

Tinkerbell

Build Status codecov CII Best Practices

License

Tinkerbell is licensed under the Apache License, Version 2.0. See LICENSE for the full license text. Some of the projects used by the Tinkerbell project may be governed by a different license, please refer to its specific license.

Tinkerbell is part of the CNCF Projects.

CNCF Landscape

Community

The Tinkerbell community meets bi-weekly on Tuesday. The meeting details can be found here.

Community Resources:

What's Powering Tinkerbell?

The Tinkerbell stack consists of several microservices, and a gRPC API:

Tink

Tink is the short-hand name for the tink-server and tink-worker. tink-worker and tink-server communicate over gRPC, and are responsible for processing workflows. The CLI is the user-interactive piece for creating workflows and their building blocks, templates and hardware data.

Smee

Smee is Tinkerbell's DHCP server. It handles DHCP requests, hands out IPs, and serves up iPXE. It uses the Tinkerbell client to pull and push hardware data. It only responds to a predefined set of MAC addresses so it can be deployed in an existing network without interfering with existing DHCP infrastructure.

Hegel

Hegel is the metadata service used by Tinkerbell and OSIE. It collects data from both and transforms it into a JSON format to be consumed as metadata.

OSIE

OSIE is Tinkerbell's default an in-memory installation environment for bare metal. It installs operating systems and handles deprovisioning.

Hook

Hook is the newly introduced alternative to OSIE. It's the next iteration of the in-memory installation environment to handle operating system installation and deprovisioning.

PBnJ

PBnJ is an optional microservice that can communicate with baseboard management controllers (BMCs) to control power and boot settings.

Building

Use make help. The most interesting targets are make all (or just make) and make images. make all builds all the binaries for your host OS and CPU to enable running directly. make images will build all the binaries for Linux/x86_64 and build docker images with them.

Configuring OpenTelemetry

Rather than adding a bunch of command line options or a config file, OpenTelemetry is configured via environment variables. The most relevant ones are below, for others see https://github.com/equinix-labs/otel-init-go

Currently this is just for tracing, metrics needs to be discussed with the community.

Env Variable Required Default
OTEL_EXPORTER_OTLP_ENDPOINT n localhost
OTEL_EXPORTER_OTLP_INSECURE n false
OTEL_LOG_LEVEL n info

To work with a local opentelemetry-collector, try the following. For examples of how to set up the collector to relay to various services take a look at otel-cli

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true
./cmd/tink-server/tink-server <stuff>

Website

For complete documentation, please visit the Tinkerbell project hosted at tinkerbell.org.

osie's People

Contributors

andy-v-h avatar dependabot[bot] avatar detiber avatar dlaube avatar dustinmiller avatar gauravgahlot avatar grahamc avatar invidian avatar jacobweinstock avatar joelrebel avatar maxpeal avatar mergify[bot] avatar mikemrm avatar mmlb avatar mrmrcoleman avatar nathangoulding avatar nshalman avatar parauliya avatar rainleander avatar raj-dharwadkar avatar scott4000 avatar scottgarman avatar sfunkhouser avatar splaspood avatar thebsdbox avatar tobert avatar tstromberg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

osie's Issues

Send the worker logs to the boots syslog server?

#51 dropped support for forwarding the logs to a central place but left the worker without any way to troubleshoot the actions (and since the containers are deleted, we cannot use docker logs).

We should have a way to see the actions/worker logs somewhere, maybe forward them to the boots syslog server?

This is somewhat related to #48, but saving the logs to a local file is not enough because when the worker reboots we loose the logs.

CI is broken

github.com/mholt/caddy used to redirect to caddyserver/caddy but now is 404'ing. This is breaking our CI. Going to update to something stable.

osie-runner triggering SIGSEGV and being terminated

We've seen some machines where preinstallation has completed and osie-runner is sitting connected to hegel over grpc where it suddenly terminates with exit code 139:

localhost:~# docker ps -a
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS                     PORTS               NAMES
bcc4cb56c565        osie-runner:x86_64   "/entrypoint.sh pyth…"   5 days ago          Exited (139) 7 hours ago                       quizzical_varahamihira
localhost:~# docker logs quizzical_varahamihira --tail 3
}
2021-08-04T10:40:45.952178Z [info     ] no handler for state           [runner] state=provisionable
2021-08-04T10:40:45.952249Z [info     ] about to monitor               [runner]

Expected Behaviour

osie-runner keeps running indefinitely until it receives new information from hegel

Current Behaviour

osie-runner exits unexpectedly with exit code 139

Possible Solution

Update to the latest grpc in hopes that some bug in the core library has been fixed.

Steps to Reproduce (for bugs)

Context

Your Environment

Equinix Metal Production

osie.sh based installs fail to boot

Installations done through osie.sh (legacy stuff) fail to boot since it uses Ubuntu's grub not the grub embedded in the image. Ubuntu's grub has been updated to mitigate against BootHole round 2 issues. Ubuntu's grub has a hard coded path to the grub.cfg which is not where the default grub-install puts it and thus we fail to boot on powerup.

Expected Behaviour

Legacy installations have grub properly installed, able to find the configuration file and thus boot from disk.

Current Behaviour

Grub can't find it's configuration file,and hangs the boot.

Possible Solution

Going to re-work grub installation to install via chroot, using the image's grub.

Tinkerbell Uniform Standards: Maintained Repository

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Expected Behaviour

We believe this repository is Maintained and therefore needs the following files updated:

If you feel the repository should be experimental or end of life or that you'll need assistance to update these files, please let us know by filing an issue with https://github.com/packethost/standards.

Current Behaviour

n/a

Possible Solution

n/a

Steps to Reproduce (for bugs)

n/a

Context

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

  • Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
  • Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
  • We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

Your Environment

https://github.com/tinkerbell/

Docker images fail to download with 'no space left on device' in OSIE

When trying out the sandbox and using a different workflow than hello world as in the example (I tried https://github.com/alexellis/tinkerbell-ubuntu) the docker images fail to download. I'm posting this here and not in sandbox, because I assume this is rather an issue with the default OSIE (?)

Expected Behaviour

OSIE environment is able to download necessary Docker images

Current Behaviour

download of Docker images fails due to 'no space left on device'

Possible Solution

Steps to Reproduce (for bugs)

  1. Download and run sandbox
  2. Create workflow for https://github.com/alexellis/tinkerbell-ubuntu
  3. run the workflow against your hardware
  4. tink worker fails with "no space left on device"

Context

I noticed the ubuntu installation fails during the first step (wiping the disk), so I checked the docker logs of tink-worker on the machine I'm trying to setup. In the logs it says "no space left on device" when trying to extract the downloaded Docker image. I checked the volumes and the only full volume was /dev/loop0 (/.modloop) with 200MB.
What am I missing?

Your Environment

Using sandbox docker-compose setup on Proxomox VMs with Ubuntu 18.04 LTS as OS

Debug aarch64 vm based power-on/phone-home tests

The qemu-system-aarch64 user-mode emulation based boot/phone-home tests from #227 don't work. There's no serial output so its hard to debug atm. Is the test giving up too early? Is something done wrong with qemu? Maybe the aarch64 uefi can't boot from virtio-disks (:thinking:). Need to try the vnc graphical output, maybe there's data there...

Expected Behaviour

test_boot_and_phone_home works for aarch64 VMs

Current Behaviour

test_boot_and_phone_home always fails

workflow-helper script should not block TTY from spawning

It seems that right now, if workflow-helper script hits infinite loop trying to log in into the docker registry, there is no way to debug it and there is no logs produced to TTY either.

If you look at the serial console output, then the login prompt does not show up until workflow-helper script finishes execution.

Fetching complete repo for single image fileset is inefficient

During server provisioning we see extra ~1 minute delays whilst ~1GB git data is downloaded from https://images.packet.net/packethost/packet-images.git

As this repo grows, we get more and more slow down (this is all before git checkout, thus excludes final image large file download LFS/caching).

Contributing to this:
a) This repo has a lot of stuff beyond the final images for servers booting (idk, maybe build tools?)
b) It fetches all branches (could be single ref)
c) It fetches all history (could be shallow)

Historically fetch by commit (uploadpack.allowReachableSHA1InWant) was not well supported - it is now (including GitHub, I believe), and a shallow single commit fetch is much quicker. (Deploy script could always try direct commit fetch, and fall back to all branches if git service doesn't support it).

I'm not sure of the exact OSIE script running at the moment, but I'm assuming it's close to:

git -C $assetdir fetch origin

Example (Run from Packet SYD2)

gituri=https://github.com/packethost/packet-images.git
image_tag=82dfba29f7aa462651c2e96521ed24bcad726330

#Existing fetch-all
time git -C $assetdir fetch origin
#Receiving objects: 100% (91877/91877), 889.05 MiB | 19.89 MiB/s, done.
#real 0m51.687s
#user 0m23.772s
#sys 0m5.012s

#imageid fetch
time git -C $assetdir fetch --depth 1 origin "${image_tag}"
# remote: Total 9 (delta 0), reused 6 (delta 0), pack-reused 0
#real 0m2.982s
#user 0m0.080s
#sys 0m0.012s

Ticket reference NYDE-2114-IUHD

Write a detailed document about connections and requirement to write your own in mem os

It will be nice to get a detailed document about how to build your in-memory operating system (initrd), what it has to run (for example docker, and start tink-worker).

I am asking for mainly two reasons:

  1. Osie should be only one possible implementation, companies can have their OS, and it is nice to provide what they need to write their one
  2. a lot of the pull requests I see landing those days in Osie: #102 , #101 , #98 , #93 looks related to "Packet needs."

My secret hope is to get a minimum "Osie" implementation as part of the Tinkerbell organization, leaving more specialized ones to the end-user; ideally, Osie, as we know it today, can belong to PacketHost.

Something cool @thebsdbox did https://github.com/plunder-app/BOOTy about this topic

Probably this is related to tinkerbell/tink#136
And in some way to #2

workflow-helper.sh doesn't execute 2nd docker run

Hey there,

It looks like there is some sort of race condition in workflow-helper.sh. When workflow-helper.sh is executed as part of init.rd the following command from the mentioned script is not executed:

https://github.com/tinkerbell/osie/blob/master/installer/workflow-helper.sh#L69-L81

docker run --privileged -ti \
	-e "container_uuid=$id" \
	-e "WORKER_ID=$worker_id" \
	-e "DOCKER_REGISTRY=$docker_registry" \
	-e "TINKERBELL_GRPC_AUTHORITY=$grpc_authority" \
	-e "TINKERBELL_CERT_URL=$grpc_cert_url" \
	-e "REGISTRY_USERNAME=$registry_username" \
	-e "REGISTRY_PASSWORD=$registry_password" \
	-v /worker:/worker \
	-v /var/run/docker.sock:/var/run/docker.sock \
	--log-driver=fluentd -t \
	--net host \
	"$docker_registry/tink-worker:latest"

Re-running workflow-helper.sh manually succeeds and executes without problems.

Replace osie-build-env with shell.nix

osie-build-env is currently used to setup both an developer/build environment and also an environment suitable for running the osie tests. This is done using docker. Instead of (ab)using docker for both scenarios we should instead use nix and nix-shell for it. With nix-shell we get better support for pinning/choosing tool versions and parity with other tinkerbell/ services dev setups.

The test environment should be completely re-done with either the actual services (with docker-compose.yml or as drone services). This was much harder back in the day pre-cacher, but should be pretty easyish now.

conditional syntax not supported in future alpines

The syntax ?// from installer/osie-installer.sh is not supported in jq on Alpine 3.11, and results in the error:

+ jq -S '. + {"password_hash":"'"$pwhash"'", "state": (.state?//"'"$state"'")}' <"$metadata" >"$metadata.tmp"
jq: error: syntax error, unexpected ?// (Unix shell quoting issues?) at <top-level>, line 1:

Unable to create ZFS file system - unsupported by kernel

ZFS files system creation support is not there in current alpine version

Expected Behaviour

Need ZFS support it works in 5.4.72
current version is 5.4.52

Current Behaviour

not able to create zfs files system on disk instead of ext4

Possible Solution

Need upgraded Alpine OS to kernel version 5.4.72
then i can install zfs-lts

Steps to Reproduce (for bugs)

Context

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):
    Linux
  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
    Vagrant and Libvirt
  • Link to your project or a code example to reproduce issue:

reduce osie size

We should remove unnecessary packages, perhaps package x86_64/aarch64 architectures independently, and generally work to slim down OSIE substantially.

Containerd package gets corrupted sometimes

Expected Behaviour

Provisioning OSIE works every time.

Current Behaviour

Currently, sometimes booting worker with OSIE ends up with no workflow executed. Upon investigation, I figured that cached containerd package in OSIE is corrupted and apk add is not able to install it, which results in no Docker and no workflows executed.

@gianarb says it's only happening sometimes.

Steps to Reproduce (for bugs)

  1. Provision sandbox repository several times checking if workflow runs each time.

Context

Discovered while working on tinkerbell/cluster-api-provider-tinkerbell#17

Your Environment

Used OSIE version https://tinkerbell-oss.s3.amazonaws.com/osie-uploads/osie-v0-n=404,c=c35a5f8,b=master.tar.gz

Running in sandbox environment using tinkerbell/playground@be40a7b.

OSIE image should use multi stage builds

We build a few packages from source, some of them may go away with #128 but some are likely to still be needed. Using multi stage builds would allow making better use layer caching and likely help with the final image size (#2).

Update OSIE container image to latest Ubuntu LTS

OSIE uses Xenial (16.04) as its base image, yet we build/install newer package from source. Updating to 20.04 might avoid the need to do so which should shorten the time taken to build the container images.

18.04 was tried once, pre-open-sourcing and didn't work out for some reason I can't recall :/.

Change CI away from drone.packet.net

drone.packet.net is going to be shutdown in the not very distant future. Its running old code (v0.8, EoL for many years now), is being "managed" by a team of one (me) as fires pop up, and the hardware its running is old/crufty and about to be retired.

There are 2 options I'm considering

  1. Transition to GH Actions using our tinkebell org self hosted runners to build and test osie (we need the ability to run VMs for the tests).
  2. Transition to build kite using a bare metal runner hosted by Equinix.

I'm leaning towards option 2 as a PoC for transitioning all the tinkerbell repos over to buildkite instead of the self hosted GHA. I've been planning a PoC that I could use as an example for a formal proposal and was going to with boots, but the drone.packet.net shutdown presents a good opportunity.

Uniform Standards: Experimental Repository

Hello!

We believe this repository is Experimental and therefore needs the following files updated:

If you feel the repository should be maintained or end of life or that you'll need assistance to create these files, please let us know by filing an issue with https://github.com/packethost/standards.

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

  • Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
  • Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
  • We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

The Goal

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.