Coder Social home page Coder Social logo

intel / intel-data-center-gpu-driver-for-openshift Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 5.0 145 KB

Intel Data Center GPU Drivers for Red Hat OpenShift Container Platform

Home Page: https://catalog.redhat.com/software/containers/intel/intel-data-center-gpu-driver-container/6495ee55c8b2461e35fb8264

License: Apache License 2.0

Dockerfile 100.00%
cse drivers i915 kmm pmt rhocp intel-gpu gpu

intel-data-center-gpu-driver-for-openshift's Introduction

Intel® Data Center GPU Driver for OpenShift*

Overview

The Intel Data Center GPU Driver for OpenShift project focuses on the development, packaging, certification, and release of Intel® Data Center GPU driver container images for the Red Hat OpenShift Container Platform (RHOCP). This project allows users to leverage the pre-built driver container image to facilitate provisioning of Intel Data Center GPU cards on an OpenShift cluster. Furthermore, users can utilize the Intel Data Center GPU driver dockerfile provided by this project as a reference for constructing their own driver container images on-premises. Intel Data Center GPU driver container images for OpenShift are certified and published on the Red Hat Container Catalog.

The Intel Data Center GPU driver container image is built from the Intel GPU Repository. It includes:

Install Intel Data Center GPU Driver on RHOCP

We recommend users use the Kernel Module Management (KMM) operator to install and manage the Intel Data Center GPU driver on RHOCP. The KMM operator can be used to deploy all the necessary driver components as well as the firmware from within the driver container image.

To install Intel Data Center GPU drivers on OpenShift using the KMM operator, please follow pre-build mode support from the Intel Technology Enabling for OpenShift.

For users who prefer to create customized driver container images, the on-premises build mode is available as an option. This mode enables users to build and deploy their own container images on their OpenShift cluster.

Upgrade Intel Data Center GPU Driver with RHOCP

Upgrading of the Intel Data Center GPU drivers are supported via two scenarios:

  • Driver Upgrade Scenario: This scenario is used when there is a new release from the Intel GPU Driver Repository. After the evaluation, a corresponding Intel Data Center GPU Driver container image will be built, certified, and published on the Red Hat Container Catalog. Users can make use of the KMM Operator to upgrade Intel data center GPU driver with RHOCP.
    Notes: The seamless upgrade feature is still under development in Kernel Module Management project.
  • Kernel Upgrade Scenario: To ensure compatibility with each new Red Hat CoreOS (RHCOS) kernel used by RHOCP, the Intel GPU driver container images are re-built with the corresponding kernel version. This image is certified and then published on the Red Hat Container Catalog. KMM Operator can be used to deploy the driver container image matching the new RHCOS kernel version when upgrading the RHOCP cluster.

Support

If users encounter any issues or have questions regarding Intel Data Center GPU Driver with RHOCP, we recommend them to seek support through the following channels:

Commercial Support from Red Hat

This project facilitates provisioning pre-built Intel Data Center GPU drivers with Intel Technology Enabling for OpenShift project, which are then certified and published in the Red Hat Container Catalog. Commercial RHOCP release support is outlined in the Red Hat OpenShift Container Platform Life Cycle Policy and Intel collaborates with Red Hat to address specific requirements from our users.

Open-Source Community Support

Intel Data Center GPU Drivers for OpenShift is run as an open-source project on GitHub. Project GitHub issues related to pre-built and on-premises driver builds can be used as the primary support interface for users to submit feature requests and report issues to the community. Please provide detailed information about your issue and steps to reproduce it, if possible.

Contribute

See CONTRIBUTING for more information.

License

Distributed under the open source license. See LICENSE for more information.

Security

To report a potential security vulnerability, please refer to security.md file.

Code of Conduct

Intel has adopted the Contributor Covenant as the Code of Conduct for all of its open source projects. See CODE_OF_CONDUCT file.

intel-data-center-gpu-driver-for-openshift's People

Contributors

chaitanya1731 avatar hershpa avatar mregmi avatar rdower avatar umartinxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

intel-data-center-gpu-driver-for-openshift's Issues

v2.1.0 Release Checklist

Includes OOT Intel GPU driver release for RHOCP 4.14.0 and above
Checklist:

  • Update dockerfile with the latest driver release - I915_23WW51.5_682.48_23.6.42_230425.56. Release Notes here
  • Build and validate driver container for 4.14.0 OCP using KMM
  • Driver signing
  • Update release notes, root readme (if required), release readme table
  • Preflight certification scanning and publish image on RH catalog
  • Cut release on github

How to handle scenario where out-of-tree driver depends on in-tree driver(s)

Summary

The Intel GPU i915 driver has a large dependency tree. It depends on several in-tree drivers present in /lib/modules/(KERNELRELEASE)/. In the scenario where an out of tree driver depends on in-tree driver(s), the user needs to copy in-tree drivers into /opt/lib/modules/(KERNELRELEASE)/ rather than leveraging the in-tree drivers directly on the host. This is because everything is done in /opt including modprobe -d /opt (done explicitly by KMM) as well as depmod -b /opt (done in dockerfile).

Goal

The goal is to determine if we can avoid copying in-tree drivers. Only out of tree drivers should be a part of the driver container image.

Possible Solution

This is a complex problem where a solution may not feasible. If we continue to use /opt to avoid tainting the default /lib directory, it would require copying in-tree drivers from the host by using a modules.dep file (this file would contain dependency list and would be generated after placing the out of tree drivers modules in /lib). From a KMM perspective, it would receive a driver container with exclusively out of tree drivers and then use the supplied modules.dep to copy over the necessary in-tree drivers to successfully load the out of tree driver. This solution could involve host mounting and sym links.

Build Argument KERNEL_FULL_VERSION not available in dockerfile

Summary:
The build argument $KERNEL_FULL_VERSION was understood to be populated automatically by KMM in dockerfile. It was determined that $KERNEL_FULL_VERSION is not populated in the dockerfile but it is present in the KMM module. Currently ${KERNEL_VERSION} is only available as a default build argument automatically populated by KMM in dockerfile. An issue has been created on KMM upstream to ensure that KMM provides $KERNEL_FULL_VERSION as a default build argument to avoid user confusion and maintain consistency.

Impact:
Analysis shows that $KERNEL_FULL_VERSION is empty string when the variable is echoed in dockerfile. Since the variable is empty, that gives COPY --from=builder /lib/modules/ /opt/lib/modules/.
This copies everything under /lib/modules in the builder image into the final image.

We also noticed this in the build log: [Warning] one or more build args were not consumed: [KERNEL_VERSION].

Short-term solution:
Add this in the 2nd stage of the dockerfile:
ARG KERNEL_VERSION
ARG KERNEL_FULL_VERSION=${KERNEL_VERSION}
Use ${KERNEL_FULL_VERSION} as needed in the 2nd stage.

Long-term solution:
KMM will add and automatically populate the $KERNEL_FULL_VERSION as a default build argument in the dockerfile. With this, user can use $KERNEL_FULL_VERSION as a default build argument in the dockerfile and KMM module. This would be available in a later KMM release, most likely KMM 1.1 or later.

Once this is available, we will make the following change:
Before:
ARG KERNEL_VERSION
ARG KERNEL_FULL_VERSION=${KERNEL_VERSION}

After:
ARG KERNEL_FULL_VERSION

Improve Driver Container Image Size

Summary:

The dockerfile copies unnecessary files and directories to the final UBI minimal based driver container image. The goal is to copy only the necessary ko files and firmware binaries to keep the driver container image size as small as possible.

Solution:

Initial analysis of current driver container image shows we can safely copy all *.ko and *.ko.xz files under /lib/modules/4.18.0-372.46.1.el8_6.x86_64/. The following ko files would be copied. Note: *.ko.xz files are not shown below since there are several. The xz extension indicates it is a compressed ko file so it takes less space than a traditional ko file.

image

For firmware binaries, the proposed solution is to copy only dg2* firmware binaries since those are the only binaries that are needed for Intel Data Center GPU Flex Series. We will continue to copy the copyright license in /firmware/i915/license/ and the firmware binaries in /firmware/i915/.

Task Checklist:

  • 1. Make above changes in dockerfile
  • 2. Re-build and deploy driver container using updated dockerfile and KMM
  • 3. Test updated driver container to ensure no regression issues.
  • 4. Review and compare updated compressed driver container image size. Current compressed size is 256 MiB.
  • 5. Submit PR to update dockerfile after reviewing results.

Red Hat UBI Base Image Security/CVE Vulnerability for OOT driver container Image

Summary

How to handle Red Hat UBI Base Image Security/CVE Vulnerability for OOT driver container Image.

Detail

According to the suggestion from KMM Operator project, Red Hat UBI-minimal base image is used to package the Intel data center gpu driver container image.
During the RH certifying process, a CVE Vulnerability was found in this base image. This vulnerability comes from curl package addressed by CVE2023-23916.
To resolve this vulnerability and pass the RH certification, we have to recreate the image by using the new UBI-minimal based image which includes the latest curl package with the CVE update.

Analysis

However, From this CVE vulnerability following potential problems are worthy us to pay attention to.

  • Is the UBI-minimal base image is good and safe enough for the OOT driver container image?
    We know the safest image is the image that only includes the necessary packages. All the unnecessary packages will potentially bring Vulnerability risk to the image. To this issue, from Intel data center gpu driver container dockerfile, it is obviously the curl package is not used at all. So is it possible to have an OOT driver-specific base image and just include the necessary packages for OOT driver container image usage? Or a minimal base image for OOT driver container users to install the necessary packages. We all know to insmod the OOT driver module, the permissive privilege needs to run the container. So any potential vulnerability might be very dangerous for the whole cluster.

  • Should all of the published OOT driver container images need to update the image and redo the certifying and publishing process
    In this case, if some vulnerability was found in the base image, do we really need to rebuild the image and go through the certifying process, and publish a new version of the driver container image? From certifying and publishing efforts it will be a huge effort for all the published and certified images. Image all of the published OOT driver container images are based on this vulnerable base image and needs this effort.
    And also upgrading the driver container image is not an easy task, the related feature is still under development in KMM project.
    Do we have some way to relieve people from the efforts?

Possible solutions

  • 1. working out the base image tailored for the OOT driver Container image
  • 2. Optimize the RH certifying and publishing process an technology to relieve this effort

Conclusion

According to April-19 KMM upstream meeting, in the future, we even can only package the kernel modules without the base image. That might be the best solution to resolve this issue

Kernel ABI stability in OCP Minor Version may reduce rebuild efforts of driver container

Summary:

The Kernel Application Binary Interface (kABI) is a set of in-kernel symbols used by drivers and other kernel modules. Currently, the general idea is to rebuild and test the Intel GPU driver container image whenever the kernel version associated with a particular OCP z stream changes. This is the safest approach. Unfortunately, it requires continuous rebuild and test efforts that can be facilitated by automation but still carries a non-zero cost. It may be possible to reduce rebuild efforts based on the theory that no rebuild is required if the kernel ABI does not change across all z streams in a particular OCP minor version X.Y.

Potential Idea:

Assuming that the driver is using the list of stable symbols for which Red Hat guarantees ABI compatibility, consider the following.

Based on RHEL KB,

The kernel-abi-stablelists packages contain reference files, /lib/modules/kabi-/kabi_stablelist_, listing interfaces provided by the kernel that are considered to be stable by Red Hat engineering. Such interfaces are safe for long-term use by third-party loadable device drivers, as well as for other purposes.
With Red Hat Enterprise Linux 7 and 8, the stablelist is valid for the particular major release. This means that once a symbol has been introduced into kABI for a particular major release, it will not be removed, nor will its meaning be changed during that kernel major release complete life cycle.
With Red Hat Enterprise Linux 9, each minor release will have a unique stablelist that is valid throughout the minor release lifecycle. For more information on this, please refer to the following knowledgebase article;

Red Hat Enterprise Linux 9 kABI Policy
Red Hat recommends recompiling kernel modules against every minor release of Red Hat Enterprise Linux.

Based on this other KB, an OCP minor version always uses a certain minor RHEL version.

RHCOS/OCP Versions RHEL Versions
4.11 RHEL 8.6
4.12 RHEL 8.6
4.13 RHEL 9.2

Tentative Conclusion:

It would be reasonable to conclude that for OCP 4.12 based on RHEL8.6, only 1 driver container is required to support all OCP 4.12.z versions as long as the kernel ABI stays the same. Similarly, all z streams for OCP 4.13 based on RHEL9.2 would require a single driver container image.

Goal:

The goal is to understand the pros, cons and the potential risk of this approach. Theoretically, it is possible to use the same driver container with different kernel version as long as the kernel ABI remains stable. It is important to note that

in very rare and special circumstances, a symbol in a kABI stablelist needs to be changed. For example, Red Hat could introduce kABI breakage when a critical security issue cannot be resolved without breaking kABI. Red Hat will inform the partners if such a situation should occur.

In general, even if rebuilds are avoided, it is reasonable to retest the existing driver container when the kernel version changes using automation to ensure compatibility and functionality.

Challenges/Observations of loading/unloading out of tree RHEL 9.2 based GPU drivers on RHEL 9.2 based OCP 4.13

Summary

There are new RHEL 9.2 based GPU drivers to provision Intel GPU Flex and Max Series. Good news: the new drivers now do not have an incompatibility with ast driver. On RHEL 8.6 based OCP 4.12, ast driver needed to be unloaded or blacklisted (via machine config which triggers reboot) prior to loading out of tree GPU drivers.

Challenges:

In-tree i915 and intel_vsec drivers have to unloaded prior to loading of out of tree drivers. KMM can only unload one in-tree driver as of now. Now, it is found that we have a use case for unloading more than one in-tree driver. Short term potential solution: unload intel_vsec outside of KMM most likely using machine config.

Once the out of tree drivers are loaded, it is observed that unloading the drivers is difficult as they are always in use by GUI subcomponent i.e. framebuffer. The exact root cause is not determined but once the out of tree drivers are loaded, the GPU is actively used by a component in the system that prevents it from being unloaded. More exploration needed due to complexity to find root cause. lsof command was used to determine what was using the driver but did not provide any additional information.

Details:

2 components have changed:

  1. New GPU drivers/FW for RHEL 9.2
  2. New kernel for RHEL 9.2

KMM has a feature available on version 1.1.1 that can be used to unload 1 in-tree driver.
We can use this feature to unload in-tree i915. We cannot unload more than one kmod. We now have a use case to unload more than 1 in-tree driver. This includes i915 and intel_vsec for now and potentially cse in future.

3 Main Drivers for GPU: i915, intel_vsec (this is a prerequisite for i915), CSE (MEI)

Out of tree drivers behavior: Loading i915 driver will load the intel_vsec driver. Unloading i915 will unload intel_vsec.
In-tree driver behavior: Loading i915 does not load intel_vsec. Unloading i915 does not unload intel_vsec.

RHEL 9.2 OCP 4.13 has a new kernel based on 5.14.z upstream kernel. This is a huge jump from RHEL 8.6 based OCP 4.12 which used 4.18.z upstream kernel.

Initial smoke test analysis and Observed Impact:

There is an i915 and intel_vsec in-tree driver in RHEL 9.2 (not loaded by default, it is only loaded by kernel when it detects the GPU card via PCI device ID). These above 2 in-tree drivers do not support Intel GPU Flex or Max series. The in-tree i915 driver provides display support functionality for Intel Client Arc GPUs. As a result, customers will notice on dmesg the following message:

sh-5.1# dmesg | grep graphics
[   12.385679] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this

[  478.732896] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this

Intel® Data Center GPU Flex 170 -> PCI ID is 56c0.

Observation 1:

If in-tree intel_vsec is not unloaded prior to loading out of tree i915 driver, then unknown symbol errors observed in dmesg.

3238.466900] compat: loading out-of-tree module taints kernel.
[ 3238.466931] compat: module verification failed: signature and/or required key missing - tainting kernel
[ 3238.468361] COMPAT BACKPORTED INIT
[ 3238.468362] Loading modules backported from I915-23.6.37
[ 3238.468363] Backport generated by backports.git I915_23.6.37_PSB_230425.49
[ 3239.444973] i915: Unknown symbol intel_vsec_register (err -2)
[ 3271.091366] i915: Unknown symbol intel_vsec_register (err -2)
[ 3317.364301] i915: Unknown symbol intel_vsec_register (err -2)
[ 3376.362727] i915: Unknown symbol intel_vsec_register (err -2)

When we unload the in-tree intel_vsec driver and do nothing else different, the above issue is not observed.

Observation 2:

When you delete the KMM module CR, it unloads the out of tree i915 driver via a PreStop Hook, but it does not reload the in-tree i915 driver. This is by KMM design. Essentially, the kernel is tainted. When KMM tries to clean up, it is unable to unload the out of tree i915 driver as it says it is in use.

We are also unable to manually unload the out of tree i915 or intel_vsec driver.

sh-5.1# modprobe -rv intel_vsec
modprobe: FATAL: Module intel_vsec is in use.
sh-5.1# modprobe -rv i915      
modprobe: FATAL: Module i915 is in use.

lsmod output after out of tree drivers loaded, keep an eye on the resource counts which is the 3rd column.

sh-5.1# lsmod | grep i915

i915                 3977216  4

intel_vsec             20480  1 i915

intel_gtt              24576  1 i915

compat                 24576  2 intel_vsec,i915

video                  61440  1 i915

drm_display_helper    172032  2 compat,i915

cec                    61440  2 drm_display_helper,i915

i2c_algo_bit           16384  2 ast,i915

drm_kms_helper        192512  5 ast,drm_display_helper,i915

drm                   581632  7 drm_kms_helper,compat,ast,drm_shmem_helper,drm_display_helper,i915

sh-5.1# lsmod | grep intel_vsec

intel_vsec             20480  1 i915

compat                 24576  2 intel_vsec,i915

It has been noted to document a dependency list diagram for out of tree GPU drivers as a future exercise.

MEI warnings observed in dmesg after i915 driver loaded

Summary:

MEI warnings were observed in dmesg when out of tree i915 driver is loaded on RHEL 9.2 based OCP 4.13.10. Note, MEI is the original name of the driver. The out of tree driver equivalent is called CSE. Refer to dmesg output below. The goal of this issue is to understand if these warnings are expected, what is the meaning behind the warnings, and what is the root cause and potential solution.

Theory:

This behavior may be expected as we are not unloading the in-tree CSE (aka MEI) driver. As a result, the out of tree MEI driver is never loaded and thus the in-tree MEI is potentially trying to use out of tree FW to initialize. This is an incompatibility.

Solution:

Potential solution is to unload in-tree MEI and see if the out of tree MEI loads successfully and this error goes away. This solution needs to be tested. This is another potential use case for KMM to unload more than one in-tree driver.

### dmesg output:

[ 3567.884499] intel_vsec 0000:38:00.0: enabling device (0140 -> 0142)
[ 3567.885148] intel_vsec 0000:3d:00.0: enabling device (0140 -> 0142)
[ 3569.550055] [drm] I915 BACKPORTED INIT 
[ 3569.551787] i915 0000:37:00.0: [drm] GT count: 1, enabled: 1
[ 3569.551829] clipped [mem 0x000a0000-0x000bffff] to [mem 0x00100000-0x000bffff] for e820 entry [mem 0x0009f000-0x000fffff]
[ 3569.551840] clipped [mem 0x000c8000-0x000cffff] to [mem 0x00100000-0x000cffff] for e820 entry [mem 0x0009f000-0x000fffff]
[ 3569.553764] i915 0000:37:00.0: [drm] Using Transparent Hugepages
[ 3569.556825] i915 0000:37:00.0: [drm] Local memory IO size: 0x000000013cc00000
[ 3569.556827] i915 0000:37:00.0: [drm] Local memory available: 0x000000013cc00000
[ 3569.562817] i915 0000:37:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.9.1.bin version 70.9.1
[ 3569.562821] i915 0000:37:00.0: [drm] GT0: HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 3569.576980] i915 0000:37:00.0: [drm] GT0: GUC: submission enabled
[ 3569.576983] i915 0000:37:00.0: [drm] GT0: GUC: SLPC enabled
[ 3569.577276] i915 0000:37:00.0: [drm] GT0: GUC: RC enabled
[ 3569.590333] i915 0000:37:00.0: GT0: local0 bcs'0.0 clear bandwidth:74358 MB/s
[ 3569.591180] [drm] Initialized i915 1.6.0 20201103 for 0000:37:00.0 on minor 1
[ 3569.623597] i915 0000:3c:00.0: [drm] GT count: 1, enabled: 1
[ 3569.623600] mei_gsc i915.mei-gscfi.14080: FW not ready: resetting: dev_state = 2 pxp = 0
[ 3569.623617] clipped [mem 0x000a0000-0x000bffff] to [mem 0x00100000-0x000bffff] for e820 entry [mem 0x0009f000-0x000fffff]
[ 3569.623621] clipped [mem 0x000c8000-0x000cffff] to [mem 0x00100000-0x000cffff] for e820 entry [mem 0x0009f000-0x000fffff]
[ 3569.623642] mei_gsc i915.mei-gscfi.14080: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 3569.624324] mei_gsc i915.mei-gsc.14080: FW not ready: resetting: dev_state = 2 pxp = 2
[ 3569.624349] mei_gsc i915.mei-gsc.14080: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000

v2.0.0 Checklist

Includes OOT Intel GPU driver release for RHOCP 4.13.10
Checklist:

  • Update dockerfile with the latest driver release - I915_23WW39.5_682.38_23.6.37_230425.49
  • Build and validate driver container for 4.13.10 OCP using KMM
  • Driver signing
  • Update release notes, root readme (if required), release readme table
  • Preflight certification scanning and publish image on RH catalog
  • Cut release on github

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.