Coder Social home page Coder Social logo

cneira / firecracker-task-driver Goto Github PK

View Code? Open in Web Editor NEW
143.0 5.0 19.0 20.19 MB

nomad task driver that uses firecracker to start micro-vms

License: Apache License 2.0

HCL 0.09% Go 99.21% Shell 0.70%
firecracker-microvms firecracker nomad task-driver vmlinux kernel-image bootdisk cni firecracker-task-driver rootfs ext4 linux-kernel microvm

firecracker-task-driver's Issues

How to registry service to consul

Hi cneira,

The driver can not support register service to consul

job "neverwinter" {
    datacenters = ["dc1"]
    type        = "service"

    group "nwn-group" {
        network {
            mode = "cni/microvms"
        }

        service {
            name = "nwn-service"
            port = 22
            address_mode = "alloc"
            check {
                type = "tcp"
                interval = "10s"
                timeout = "2s"
                address_mode = "alloc"
            }
        }

        task "nwn-server" {
            driver = "firecracker-task-driver"
            config {
                Vcpus = 1
                KernelImage = "/home/cneira/Development/vmlinuxs/vmlinux"
                BootDisk= "/home/cneira/Development/rootfs/ubuntu/18.04/nwnrootfs.ext4"
                Disks = [ "/home/cneira/Development/disks/disk0.ext4:rw" ]
                Mem = 1000
                Network = "microvms"
            }
        }
    }
}

I modify some code, but it's not work correctly.
support-cni-service.txt

note: move support-cni-service.txt support-cni-service.patch

I think I should get the IP Address assigned by group -> network section, then setup taskConfigSpec.Nic.

Could you give me so me advice?

Thx!

Rootfs links not accessible

Got "access denied" when downloading the rootfses. Probably b/c S3 bucket config. Please fix the links if possible. Thanks!

Request: Propagate Firecracker Task Driver errors to Nomad UI

So I have a task start failing with the following, not-very-useful info:

rpc error: code = Unknown desc = task with ID "8ee3098b-7420-cb04-2892-fedaa3c730ba/tenant-plugin/339ec6bd" failed

image

However, going to the Nomad Agent logs I get the following, much more intelligible errors:
failure when invoking CNI: failed to load CNI configuration from dir "/etc/cni/conf.d" for network "default": no net configurations found in /etc/cni/conf.d"

    2022-04-04T13:23:32.274-0400 [INFO]  client.driver_mgr.firecracker-task-driver: starting firecracker task: driver=firecracker-task-driver driver_cfg="{KernelImage: BootOptions: BootDis
k: Disks:[] Network:default Nic:{Ip: Gateway: Interface: Nameservers:[]} Vcpus:1 Cputype: Mem:128 Firecracker:/usr/bin/firecracker Log: DisableHt:false}" @module=firecracker-task-driver ti
mestamp=2022-04-04T13:23:32.274-0400
    2022-04-04T13:23:32.274-0400 [INFO]  client.driver_mgr.firecracker-task-driver: Starting firecracker: driver=firecracker-task-driver driver_initialize_container="&{/usr/bin/firecracker
 /tmp/NomadClient1700322499/3aee425c-e789-5c1c-e029-d552efbf942c/tenant-plugin/vmlinux  console=ttyS0 reboot=k panic=1 pci=off nomodules /tmp/NomadClient1700322499/3aee425c-e789-5c1c-e029-
d552efbf942c/tenant-plugin/rootfs.ext4  [] default {   []} []    false 1  300    false false [] <nil> 0xc384c0}+" @module=firecracker-task-driver timestamp=2022-04-04T13:23:32.274-0400
    2022-04-04T13:23:32.275-0400 [INFO]  client.driver_mgr.firecracker-task-driver: Error starting firecracker vm: driver=firecracker-task-driver @module=firecracker-task-driver driver_cfg
="Failed to start machine: failure when invoking CNI: failed to load CNI configuration from dir \"/etc/cni/conf.d\" for network \"default\": no net configurations found in /etc/cni/conf.d"
 timestamp=2022-04-04T13:23:32.275-0400
    2022-04-04T13:23:32.275-0400 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=3aee425c-e789-5c1c-e029-d552efbf942c task=tenant-plugin error="rpc error: code = U
nknown desc = task with ID \"3aee425c-e789-5c1c-e029-d552efbf942c/tenant-plugin/0e1713e6\" failed"
    2022-04-04T13:23:32.275-0400 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=3aee425c-e789-5c1c-e029-d552efbf942c task=tenant-plugin reason="Error was unrecovera
ble"

I was wondering if it'd be possible to propagate that error up to the UI? Thanks!

veth interface

Hi Neira,

When Nomad runs a job for creation of micro VM, it creates a veth interface, but when we stop the job, it doesn't remove that veth.
So, after running some jobs, you would have many veth interfaces on the host machine. It's a bug or we have to do something in the job?

225: vethdc6cd6b7@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:e2:74:60:24:28 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 192.168.127.1/32 scope global vethdc6cd6b7
valid_lft forever preferred_lft forever
inet6 fe80::e2:74ff:fe60:2428/64 scope link
valid_lft forever preferred_lft forever
230: vethc53f5bdf@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 72:3a:7f:35:a1:e1 brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet 192.168.127.1/32 scope global vethc53f5bdf
valid_lft forever preferred_lft forever
inet6 fe80::703a:7fff:fe35:a1e1/64 scope link
valid_lft forever preferred_lft forever

Jailer

How to operate this driver using Firecracker with Jailer?

Supporting snapshot, pause/restore

Having the ability to snapshot and stop a running service would be great. Firecracker supports this.

https://github.com/firecracker-microvm/firecracker/blob/3388fa94c2ceeb2269a6fc9479b6f2798604c4e7/docs/snapshotting/snapshot-support.md

It will allow massively over-provisioning on RAM, if you run super heavy instances that don't get a lot of traffic. All without writing hard code (just keep all your state in RAM).

Here's how Codesandbox uses it to fork a running VM in under 2 seconds: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-seconds.

Bug install firecracker-task-driver

Hello, I can't install firecracker-task-driver.

  • go version go1.11 linux/amd64
  • commnad - go get github.com/cneira/firecracker-task-driver
  • error - package crypto/ed25519: unrecognized import path "crypto/ed25519" (import path does not begin with hostname)

Dead lock stop jobs

Nomad Versions: 1.0.3 and head
Firecracker: v0.22.4

How to reproduce:

  • test01-dc1.nomad
job "test01-dc1" {
  datacenters = ["dc1"]
  type        = "service"

  group "test01-dc1" {
    restart {
      attempts = 0
      mode     = "fail"
    }

    task "firecracker" {
      artifact {
      source = ".../vmlinux-5.4.0-rc5.tar.gz"
        destination = "."
      }
      artifact {
        source = ".../centos-7-x86_64_rootfs.tar.gz"
        destination = "."
      }
      driver = "firecracker-task-driver"
      config {
        Vcpus = 2
        Mem = 128
        Network = "test01-dc1"
      }
    }
  }
}
  • /etc/cni/conf.d/test01-dc1.confdefault
{
  "name": "test01-dc1",
  "cniVersion": "0.4.0",
  "plugins": [
    {
      "type": "macvlan",
      "master": "br0",
      "ipam": {
         "type": "static",
         "addresses": [
            {
                "address": "192.168.0.30/24",
                "gateway": "192.168.0.30"
            }
         ]
      }
    },
    {
      "type": "firewall"
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    },
    {
      "type": "tc-redirect-tap"
    }
  ]
}

Steps:
1 - Start the job
2 - Wait for it to initialize
3 - Stop the job
4 - List the job and it will list as dead
5 - Check the allocation and the vm will still be running

Add support for address_mode = "alloc"

@cneira Thanks for your update.

Now also can not support address_mode = "alloc"

cni conf: /etc/cni/conf.d/firecracker.conflist

{
  "name": "firecracker",
  "cniVersion": "0.4.0",
  "plugins": [
    {
      "type": "ptp",
      "ipMasq": true,
      "ipam": {
        "type": "host-local",
        "subnet": "192.168.60.0/24",
        "resolvConf": "/etc/resolv.conf"
      }
    },
    {
      "type": "tc-redirect-tap"
    }
  ]
}

job config

job "hello" {
    datacenters = ["dc1"]
    type = "service"

    group "sshd" {
        network {
            # mode = "cni/mynet"
            port "ssh" {
                to = 22
            }
        }
        service {
            name = "sshd"
            port = "ssh"
            address_mode = "alloc"
            check {
                type = "tcp"
                interval = "10s"
                timeout = "2s"
                address_mode = "alloc"
            }
        }

        task "sshd" {
            driver = "firecracker-task-driver"

            config {
                KernelImage = "/home/ox0spy/projects/nomad/study/firecracker/vmlinux.bin"
                BootDisk = "/home/ox0spy/projects/nomad/study/firecracker/rootfs.ext4"
                Firecracker = "/usr/local/bin/firecracker"
                Vcpus       = 1
                Mem         = 128
                Network     = "firecracker"
            }
        }
    }
}

docs for address_mode in service block: https://www.nomadproject.io/docs/job-specification/service#address_mode

run job

nomad status <alloc-id> got the below error message:

Setup Failure  failed to setup alloc: pre-run hook "group_services" failed: unable to get address for service "sshd": cannot use address_mode="alloc": no allocation network status reported

Originally posted by @ox0spy in #9 (comment)

firecracker-task-driver err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio"

Nomad: 1.1.2

Logs

    2021-09-14T23:49:13.352+0200 [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/firecracker-task-driver args=[/opt/nomad/plugins/firecracker-task-driver]
    2021-09-14T23:49:13.353+0200 [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/firecracker-task-driver pid=1765320
    2021-09-14T23:49:13.353+0200 [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/firecracker-task-driver
    2021-09-14T23:49:13.512+0200 [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2
    2021-09-14T23:49:13.512+0200 [DEBUG] agent.plugin_loader.firecracker-task-driver: plugin address: plugin_dir=/opt/nomad/plugins network=unix address=/tmp/plugin021821091 timestamp=2021-09-14T23:49:13.510+0200
    2021-09-14T23:49:13.522+0200 [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio"
    2021-09-14T23:49:13.533+0200 [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/firecracker-task-driver pid=1765320
    2021-09-14T23:49:13.538+0200 [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins
    2021-09-14T23:49:13.656+0200 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
    2021-09-14T23:49:13.656+0200 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins

Bug- Veth is not releasing when the MicroVM restart

I am working on a project where I am deploying the micro VM using nomad, the driver is working fine but there is an issue, when the VM is restarting or when we are updating the job with new rootfs, the VM is failing to start. When I dug more, I found the driver is unable to assign IP to the new VM as the IP range is exhausted. When I troubleshoot more I found that there are so many Firecracker VM is created, with each restart it provision more and more VMs uncontrollably and exhaust the whole IP range. Kindly refer to the screenshot to support my case. I am still trying to figure out this behavior of the driver. Technically it should update the rootfs and restart the VM with new rootfs and assign the same IP or new one but why it is creating the VM in the background? I would really appreciate the help here.

image

image

Readme improvements

I'm trying to follow the readme, but I'm running into lots of issues understanding it. If I make it through, I'll try to make a PR with some clarity changes, but I wanted to note some issues that I was wondering about upfront.

  1. In the container network config section, I tried figuring out how to install both repos, but couldn't make it work.
  2. It would be nice to know what is the "minimum to install" from the requirements, vs "nice to have" (if there are any that aren't required, I haven't made it through).
  3. I tried glancing over the rootfs and image section, but don't really understand why it's needed. This might just be my lack of Firecracker understanding though.

I'm also wondering, why do all the task driver options start with an uppercase letter? Makes them quite unpleasant to have in a Nomad file while other options are lowercase afaik.

Anyways, we'll see how far I get, but some insight might be nice :)

Request for examples

What would be really helpful is if there were examples of using this driver.

Examples I would find particularly useful:

  1. Connecting to another task within a group where one of those tasks is a Firecracker VM (does one just talk to localhost?)
  2. Placing artifact data into the task
  3. Working with environment variables, or noting that it's not possible to do so

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.