Coder Social home page Coder Social logo

ekho's Introduction

Extending Kernel Hardware Offload

Introduction

There are a limited number of APIs available to NIC and DPU vendors (and users) to leverage hardware offloading capabilities offered by the kernel. Today these Kernel APIs include: TC, Netfilter, Switchdev and fdb_notifier.

DPU/NIC vendors that wish to enable and/or use hardware offload capabilities need to implement one or more of these API as part of their device driver and then follow a lengthy process to upstream their driver to the Kernel. Vendors (and users) are also limited by what these APIs have to offer. If their hardware supports more capabilities than what the APIs support, they must either find a way to extend the Kernel API (which may or may not be accepted upstream), or find another solution to leverage these capabilities which typically includes maintaining an out of tree driver alongside any additional software to support NIC/DPU management.

The goal of this POC is to explore an alternative method that can work alongside the Kernel APIs to enable users to take full advantage of the Hardware capabilities offered by the NICs or DPUs.

Purpose of the Proof of Concept (POC)

The purpose of the POC is to explore if hardware offload enablement can be decoupled from kernel driver development. There are multiple mechanisms to evaluate that could each contribute to an overall solution for configuring and using NIC/DPU hardware offloads with minimal driver development.

  • Can netlink listeners be used to enable an offload path entirely in user space.
  • Can eBPF be used to extend the offload functionality of kernel drivers
  • Can eBPF be used to provide a notification path to user space for kernel notifiers.
  • Can eBPF be used to enable new types of kernel offload beyond the current scope of what Kernel offload APIs support

If offload processing can be moved outside of the kernel then it would be possible to use vendor libraries from user space to do the hardware programming.

Overview of the Technical Solution

The following document provides an overview of the offload APIs the kernel uses to talk to drivers and the ways that BPF could be used to hook into those APIs from user space.

Mirroring kernel networking state

The primary mechanism we want to use for mirroring the kernel networking state is to listen to existing netlink notifications to learn the networking state and keep in sync with the kernel. Netlink will provide an accurate view of the configured state and should provide notifications about derived state. We expect there to be gaps in what netlink notifications provide so we will need to consider:

  • Pushing patches upstream to fill the netlink gaps (slow and not guaranteed to be accepted)
  • Using BPF kprobes to provide an additional notification stream

A significant gap when using netlink to mirror state is that stats need to flow in the opposite direction. Stats are normally collected by the kernel but stats for hardware offloaded flows are typically collected by drivers on demand when a netlink stats request is received. It will be necessary to hook in to the request for stats and populate the values using an out-of-band mechanism. We envisage using a BPF "firmware" program to populate the stats structures from data collected into BPF maps by a user space process. We need to explore the following approaches:

  • Using BPF struct_ops as a way to attach a BPF program to the FLOW_CLS_STATS command in a netdev (implies UAPI changes which is discouraged)
  • Using a combination of BPF kprobes and kfuncs to achieve the same thing (worse UX)

Another known gap with netlink notifications is that the kernel also uses notifiers to inform drivers about offloadable state. These notifiers include the switchdev notifier which is used to push MAC and VLAN offloads to drivers and the FIB notifier which is used to push L3 forwarding rules offloads to drivers. The proof of concept bpf-notifier uses BPF kprobes that attach to the notifiers to send the notifications to user space via BPF ring buffers.

Extending offloads with BPF "firmware"

The previously discussed offload methods are passive, either listening to netlink notifications entirely in user space or using kprobes to eavesdrop the notifiers and pass the notifications on to user space.

A more active approach to extending hardware offload would be to provide explicit hooks where BPF programs implement the hardware offload APIs on behalf of a driver. The offload mechanism could be one of:

  • Passing the offload request to user space where it can be fulfilled asynchronously.
  • Using kfuncs to directly implement the offload within the driver (security / sandbox limitations are unknown)

This method of using BPF firmware for drivers is attractive because it decouples offload feature development from the upstream kernel development and release life cycle. It is also an explicit integration with the kernel offload APIs that gives the kernel visibility of success or failure.

Enabling new types of offload

TODO

Development Phases

Phase 1 – Netlink + Stats offload

In phase 1 we propose to investigate a minimum viable product:

  • Netlink notifications for mirroring networking state to the DPU hardware.
  • Implement BPF struct_ops for the flow offload API and use it to integrate hardware stats collection into a driver.

The proof of concept implementation can be done in the netdevsim driver which exists to test driver APIs without requiring any hardware.

Phase 2 – Implement offloads with BPF firmware

In phase 2 we propose to investigate using BPF firmware to implement hardware offloads from within a driver:

  • Extend the struct_ops from phase 1 to add hardware offload operations
  • Investigate using BPF kfuncs to implement the hardware control part of the offload mechanism.

Phase 3 – Enabling new types of offload

In phase 3 we propose to investigate enabling entirely new types of offload without needing to upstream new APIs into the kernel.

ekho's People

Contributors

donaldh avatar maryamtahhan avatar

Stargazers

Marlow Warnicke (Weston) avatar William Zhao avatar

Watchers

 avatar Anil Vishnoi avatar Dave Tucker avatar  avatar Luke Hinds avatar Andrew Stoycos avatar  avatar William Zhao avatar

Forkers

maryamtahhan

ekho's Issues

Extend YNL to support explicit model for rtnetlink message ids

rtnetlink doesn't seem to fit into unified or directional message enumeration models. It seems like an 'explicit' model would be useful, to require the schema author to specify the message ids directly.

The goal of this task is to extend YNL to support this explicit model and support commands and notifications simultaneously from the same netlink-raw spec.

DOD: Patches submitted to the Kernel

OVS-datapath YNL support

Define an OVS datapath yaml specification for YNL and upstream to the kernel

DOD: patches submitted to the kernel

Extend YNL to behave like a netlink agent

Right now YNL just runs to completion when it executes with a single spec as the argument.

The goal of this task is to extend it to run like an agent that's capable of monitoring multiple multicast groups at once.

DOD: patches pushed to Redhat-et repo

POC: Kernel notifier framework user space notifications - upstream Kernel changes

Implementing the POC made necessary the following kernel changes:

  1. Allowing kernel/module access to BPF maps by path similarly to how it has been allowed to programs.
  2. Exposing the following ringbuf functionality previously available only to eBPF callers via the helper interface (helper function names are being used for clarity, kernel side additions are prefixed with _ for POC purposes):
    a. bpf_ringbuf_reserve()
    b. bpf_ringbuf_commit()
    c. bpf_ringbuf_discard()
    d. bpf_ringbuf_output() - this can be implemented using a, b and c, so not essential. Added in order to not duplicate code.
    e. bpf_ringbuf_query() - not used in the POC
    f. bpf_user_ringbuf_drain() - should probably be dropped in future versions as it is very eBPF helper specific
  3. Adding the following new kernel functionality
    a. bpf_ringbuf_fetch_next() - fetch next record. The functionality is similar to bpf_user_ringbuf_drain, but without the callback semantics
    b. bpf_ringbuf_has_data() - may be dropped in the future - intended for wait conditions
    c. void ringbuf_wait_for_data() - not functional yet due to lack of wake up triggers

The goal of this task is to create a patchset for the changes to the kernel and submit them to the BPF mailing list for revew

Testing the netlink agent on OCP cluster

The goal of this task is to create a set of test cases and a test report for testing the Netlink agent on OCP.

DOD: Test specification and results documented and pushed to EKHO repo

Extending offloads with BPF - Adding a new type of struct_ops for offloads

A significant gap when using netlink to mirror state is that stats need to flow from the driver to user space via the kernel. Stats are normally collected by the kernel but stats for hardware offloaded flows are typically collected by drivers on demand when a netlink stats request is received. It will be necessary to hook in to the request for stats and populate the values using an out-of-band mechanism.

The first step in enabling such a mechanism involves adding a new type of struct_ops for offloads. The goal of this task is to enable this support in the Kernel.

DOD - patches pushed to EKHO repo

Extend YNL to support create/replace/excl flags for netlink-raw

The goal of this task is to extend YNL to support adding user-provided flags to operations to expose the CREATE, REPLACE, EXCL, APPEND semantics.

This could be achieved either with a --flags argument or maybe with new verbs in addition to do and dump.

POC: Extending Kernel notifier framework to send messages to user space

Presently, all Kernel Hardware Offload L2 and L3 notifications have no user space visibility and no means of accessing/controlling them via netlink. They are intended to be used by device drivers. An offload capable device driver registers a listener, receives these notifications, processes the information and informs the kernel if it performs any offloads. At that point, the kernel will mark any offloaded L2 and L3 forwarding entries. Expiration timers and stats for them will no longer be available.

The goal of this task is to enable a user space notification mechanism for when these events happen so that the Kernel sends notifications to a user space application which can also respond to these notifications

DOD: patches pushed to a github repo

OVS Flow YNL support

Define an OVS flow yaml specification for YNL and upstream to the kernel DOD: patches submitted to the kernel

Extend YNL to support netlink-raw families

YNL currently support generic Netlink. A number of core changes are required in order to support netlink-raw as the message format differs from generic netlink.

The goal of this task is to extend YNL to support raw netlink and to push the resulting patches to the kernel

OVS vport YNL support

Define an OVS vport yaml specification for YNL and upstream to the kernel

DOD: patches submitted to the kernel

Extend YNL to support rtnetlink notifications

There is not yet support for rtnetlink(netlink-raw) notifications in YNL yet because it currently doesn't support defining 'event' properties on a 'do' operation. The goal of this task is to enable this support in YNL

DOD: Patches submitted to the Kernel

Extend YNL to be configured via a json configuration file

Define a YNL configuration json schema and extend YNL to be configurable via a json file rather than the commandline. This will be the first step towards configuring YNL as a Netlink agent. Right now it just takes one spec via commandline arg which means it can only process a single spec at a time.

Test the Kernel Notifier POC on OCP cluster

The goal of this task is to create a set of test cases and a test report for testing the Kernel Notifier POC on OCP.

DOD: Test specification and results documented and pushed to EKHO repo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.