The build-planning from rapidsai

Update RAPIDS to mark all CUDA kernels with internal linkage

Downstream consumers of static built versions of RAPIDS C++ projects have encountered runtime issues due to multiple instances of the same kernel existing in different DSOs.

To resolve this issue we need to ensure that all CUDA kernels in all RAPIDS libraries need to be have internal linkage ( static for projects using whole compilation, __attribute__((visibility("hidden"))) for header libraries / separable compilation ).

My tasks

Beta Give feedback

Mark libcudf kernels with internal linkage rapidsai/cudf#14734
Mark libraft kernels with internal linkage rapidsai/raft#1898
Mark libcuml kernels with internal linkage
Mark libcugraph kernels with internal linkage rapidsai/cugraph#4098
Mark cucollections kernels with internal linkage NVIDIA/cuCollections#422
Mark libcugraph_ops kernels with internal linkage https://github.com/rapidsai/cugraph-ops/pull/586
Mark libcuml_mg kernels with internal linkage
Mark libcuspatial kernels with internal linkage
Options

Generalize GHA `select` statements (to avoid hard-coding versions)

As there is only need of Dask-CUDA build, we filter out one architecture, Python version, and CUDA version in GHA. So as to only build the package once

https://github.com/rapidsai/dask-cuda/blob/1eecb1b2ac79ae9aaff9c26d0a3c93dd57f859f3/.github/workflows/build.yaml#L69-L70

However this currently hard-codes the versions of each in the selection logic, which means it can go stale as new versions are added and old ones dropped. Potentially resulting in the build being lost altogether (maybe even silently)

To avoid pinning to a specific version, @ajschmidt8 made several suggestions in this thread: rapidsai/dask-cuda#1294 (comment)

Filing this to track for follow-up

Include common debug information in logs for raising issues

To ensure we have all the information in logs when an error occurs so that we can easily share that information in upstream bug reports, it would be good to make sure we include common config and info commands at the top of our logs (aligned with places we might raise issues like Pip, Conda, NumPy, Pandas, Dask, etc.). That way it is easy to extract this information when filling out upstream issue templates.

Update cuDF to always explicitly specify RMM pool size and avoid rmm::detail usage

Add support for CUDA 12.2

This is a meta-issue for tracking support for CUDA 12.2 builds. I'll open separate sub-issues for pip and conda because there is different work that needs to be done for each.

Tasks

Beta Give feedback

Adding URL checking to CI

We may want to add some kind of URL checking to CI of RAPIDS projects to confirm links are valid

Recently we had a couple projects that linked to an image that was moved. So now they have missing image icons popping up

Having some kind of link check would help catch this when making these kinds of changes and allow us to put some kind of migration steps in place

This could make sense on RAPIDS projects themselves (maybe as part of doc builds). It could also make sense to have on any shared assets used by multiple projects

Support dynamic linking between RAPIDS wheels

Currently RAPIDS wheels adhere strictly to the manylinux policy. While the glibc/kernel ABI restrictions are not particularly onerous, the requirement that binary wheels be essentially self-contained and only depend on a small set of external shared libraries is problematic. To adhere to this restriction, RAPIDS wheels statically link (or in rare cases, bundle) all of their external library dependencies, leading to severe binary bloat. The biggest problem with this behavior is that the current sizes prohibit us from publishing our wheels on PyPI. Beyond that come the usual more infrastructural problems: longer CI times due to extra compilation, larger binaries making wheel download and installation slower, etc. The focus of this issue is to define a better solution than static linking for this problem that still adheres to the manylinux spec in spirit while reducing binary sizes. This issue will not address the usage of CUDA math library dynamic library wheels; that will be discussed separately.

Proposed Solution

RAPIDS should start publishing its C++ libraries as standalone wheels that can be pip installed independently from the Python(/Cython) wheels.These wheels should

Be py3 wheels (independent of Python version, except in rare cases like ucxx where we actually use the Python C API in the C++ library) that are built once per arch/CUDA major version
Continue to statically link to the CUDA runtime and math libraries
Contain a complete C++ dev library including CMake files, headers, and transitive dependencies. IOW these wheels should be suitable both for use both during compilation and at runtime.
Leverage scikit-build-core's entry point support to automate exposing their CMake to other packages building against them.

A key question to address is how to encode binary dependencies between wheels. One option is for each wheel to embed RPATHs pointing to the expected relative path to library dependencies in other wheels. This could be accomplished with some CMake to extract library locations from targets and then construct relative paths during the build based on the assumption that the packages are installed into a standard site-packages layout. However, since this approach is fragile and has generally been frowned upon by the Python community in the past, I suggest that we instead exploit dynamic loading to load the library on import of a package. This choice would make packages sensitive to import order (C++ wheels would need to be imported before any other extension module that links to them) but I think that's a reasonable price to pay since it only matters when depending on a C++ wheel. This solution also lets us handle the logic in Python, making it far easier to configure and control. Moreover, it will make the solution fairly composable when an extension module depends on a C++ wheel that depends on yet another C++ wheel.

Once these wheels exist, we should rewrite the existing Python packages to require the corresponding C++ wheels. The current approach of "find C++ if exists, build otherwise" can be scrapped in favor of always requiring that the C++ CMake package be found. Consumers will have the choice of installing the C++ library (e.g. from conda), building it from source, or installing the C++ wheel. The C++ wheel will become a hard dependency in pyproject.toml, so it will automatically be installed when building. In conda environments the pyproject dependencies are ignored, so the new wheels will not be installed, and similarly in devcontainer builds where requirements are generated dynamically from dependencies.yaml. Ultimately a pylibraft->libraft dependency will behave nearly identically to a raft-dask->pylibraft dependency from the perspective of dependency management.

Notes

Since the Python wheels will be dynamically linking to the C++ libraries, these wheels should be a lot closer to what we need in devcontainer/DLFW/PB2/etc builds. As a result we may be able to actually start using them there.

PRs in flight

Beta Give feedback

Merge the miniforge-cuda repo into ci-imgs

Currently we build a base image with CUDA and conda installed in the miniforge-cuda repo, then use that as a base for the conda images in the ci-imgs repo. Due to the tight coupling between these images, we trigger a rebuild of the ci-imgs whenever a PR is merged to miniforge-cuda. Given this tight coupling, I think we should consider merging these repos. Combining the repositories will make it easier to teach new build engineers about the pieces we have, and make it easier to maintain since it will reduce the amount of distinct processes required if there are changes that need to be coordinated between the repositories.

Consider enabling overlinking checks for conda builds in RAPIDS

Enabling overlinking checks will help us capture issues with things like rpaths or libraries that we're implicitly/unexpectedly linking to.

Update RAPIDS repos for RMM pool and detail API improvements

rapidsai/rmm#1417 and a followon PR makes a number of improvements and changes to help consumers of RMM avoid accessing RMM's detail namespace.

initial pool size in pool_memory_resource is required to be provided (no longer optional)
rmm::detail::available_device_memory() (moving to `rmm::available_device_memory())
Move alignment utility functions out of detail namespace.
Add utility to calculate an aligned percentage of free device memory in bytes. This way, existing instantiations of pool_memory_resource<Upstream>(upstream) can be replaced by pool_memory_resource<Upstream>(upstream, rmm::percent_of_free_device_memory(50), which matches previous behavior.

All RAPIDS repos that use any of the above will need to be updated. rapidsai/rmm#1417 only adds functionality. A follow on PR(s) will deprecate / remove functionality after RAPIDS repos are updated. Based on a search, the following repo issues cover the required changes.

Tasks

Beta Give feedback

Add support for Python 3.11

RAPIDS currently supports Python 3.9 and 3.10. We would like to add support for Python 3.11. This issue documents the steps needed.

Each section should be fully completed before moving to the next section.

CI images

Add Python 3.11 builds of miniforge-cuda. rapidsai/miniforge-cuda#55
Add Python 3.11 CI images (ci-conda, ci-wheel, citestwheel) rapidsai/ci-imgs#96

CI workflows

Branch Strategy:

Create a branch on shared-workflows called python-3.11
Add Python 3.11 to the build matrix on the python-3.11 branch: https://github.com/rapidsai/shared-workflows/blob/e7ebbae5854727b897b65213cf51ff8b965f53c1/.github/workflows/conda-python-build.yaml#L56-L63
Add Python 3.11 to the test matrix on the python-3.11 branch: https://github.com/rapidsai/shared-workflows/blob/e7ebbae5854727b897b65213cf51ff8b965f53c1/.github/workflows/conda-python-tests.yaml#L68-L82
- When adjusting the test matrix, be aware of total GPU resource consumption. Build jobs are CPU only but test jobs require GPUs. We want to keep our GPU consumption roughly the same (don't double the test matrix size), even if it gets a bit sparser in its coverage. I would recommend changing a portion of the Python 3.10 jobs to use Python 3.11. Try to mix coverage across architectures, OS versions, Python versions, and CUDA versions, and be sure we cover configurations like "latest of everything (OS/Python/CUDA)" and "oldest of everything."

Experimental Strategy:

@ajschmidt8 proposed that we could use GitHub Actions' continue-on-error to add Python 3.11 jobs that are allowed to fail, for all repos. Once all repos are passing, we could require the job to pass. This could be cleaner and less total work than our normal approach of the branch strategy above.

RAPIDS repositories

I copied a list of repositories from https://github.com/rapidsai/workflows/blob/dfd73ad47d977c57ee27b2349216f85946be757b/.github/workflows/nightly-pipeline.yaml#L27-L44 and sorted it roughly in dependency-tree order.

For each repo,

~~Update .github/workflows/*.yaml to point to the python-3.11 branch of shared-workflows~~
Update dependencies.yaml to add support for Python 3.11.
Review any pyproject.toml files for necessary changes (classifiers, etc.)
Update docs (README, etc) that reference a single Python version to point to the latest (3.11).
Once CI passes, merge the PR.

Most of this is easy to automate with rapids-reviser, and I've made an attempt at it here: https://github.com/rapidsai/rapids-reviser/pull/11. We still need to manually review the PRs for missing pieces.

Repos:

(checklist moved to #3 (comment))

Once all repos are migrated to the python-3.11 branch, the migration is complete. We merge python-3.11 into branch-24.02 on shared-workflows and then open follow-up PRs to each repo to reset the branches to branch-24.02. This "reset" is simple and should be automated with rapids-reviser.

Post-migration

Update the ci-imgs repo's latest configuration to use Python 3.11. https://github.com/rapidsai/ci-imgs/blob/main/latest.yaml
- The latest image is frequently used by CI jobs for building docs and testing notebooks. Be aware that issues may arise in those jobs.
- PR: rapidsai/ci-imgs#125
Update the docker repo's matrix.yaml and matrix-test.yaml
- PR: rapidsai/docker#635
(added by @jameslamb 2/27) update the build matrix in pypi-wheel-scripts so that Python 3.11 wheels are checked: https://github.com/rapidsai/pypi-wheel-scripts/blob/fa1e8744c8ec961a5b5e38ae172ae9c8c51b4280/release/check-wheels.sh#L41-L45
- PR: https://github.com/rapidsai/pypi-wheel-scripts/pull/8

Update RAPIDS to use `cuda::mr::async_resource_ref`

This is a cross-RAPIDS tracking issue for the epic described in this document. Please refer to the document for background and details.

Tasks

Beta Give feedback

Introduce cuda::mr::{async_}resource_ref to RMM and use in place of device_memory_resource pointers. Introduce conversions between rmm::cuda_stream_view and cuda::stream_ref. Done in rapidsai/rmm#1095 (23.12)
Require explicit initial pool size in pool_memory_resource. This is because it is difficult to automatically size the pool when the memory is not device memory. Done in rapidsai/rmm#1417
Deprecate and remove get_mem_info methods from memory resources. rmm#1388

4 of 4

cpp feature request tech debt
Deprecate and remove supports_streams() from memory resources rmm#1389

3 of 3

cpp feature request
[FEA] Change semantics of RMM memory resource equality comparison rmm#1402

feature request
[FEA] Add a convenience alias in RMM for cuda::mr::async_resource_ref<cuda::mr::device_accessible> rmm#1442

feature request
[FEA] Refactor RMM in terms of cuda::mr::memory_resource rmm#1443

10 of 14

feature request
Refactor RMM tests to use cuda::resource_ref instead of device_memory_resource pointers. rmm#1444

feature request
[FEA] Dev container support for building and testing cuDF JNI cudf#15549

cuDF (Java) feature request
Replace device_memory_resource* with device_async_resource_ref cudf#15498

improvement
Replace device_memory_resource* with device_async_resource_ref raft#2261

improvement
Replace device_memory_resource* with device_async_resource_ref cugraph#4333

improvement
Replace device_memory_resource* with device_async_resource_ref cuml#5839

improvement
Replace device_memory_resource* with device_async_resource_ref cuspatial#1371

improvement
Convert any other RAPIDS repos to use rmm::mr::device_async_resource_ref instead of device_memory_resource pointers
Refactor RMM resources to implement cuda::memory_resource concepts and deprecate and remove device_memory_resource and host_memory_resource base classes. rmm#1445

8 of 14

Epic feature request
Options

Build and ship C++ RAPIDS binaries

Currently the only way to access RAPIDS C++ libraries is via conda. There is no easy way to install RAPIDS C++ libraries in any other context. With #33, that will change to an extent since it will be possible to pip install the binaries. However, that is a very nontraditional approach for providing native libraries, and should not be our primary avenue for producing said binaries. However, the changes in #33 are also essentially a PoC that we can build the C++ libraries in our wheel containers and produce binaries that are portable, at least up to the requirements of the Python manylinux standard.

Once #33 is completed, we should take this one step further and build the C++ libraries standalone. We should be able to extract most of what we need directly from the C++ wheel building scripts since all native dependencies are already preinstalled into the wheel images and we know that these images produce fairly portable binaries. We can use CPack to produce native packages for whatever targets that we care about here. Then, the C++ wheel builds can be modified to simply include the entire contents of the CPacked package into the wheel install.

Getting this working will probably require some significant experimentation. Some notes:

I don't think a simple install(IMPORTED_RUNTIME_ARTIFACTS) of the CPacked library into the wheel will be sufficient because that will only install the library. What we want is to copy over everything required to make this a valid CMake packages, such as the CMake config files, as well as all the headers and anything else needed to compile against this package, in addition to the compiled libraries. However, we also may not want every single thing that is contained in the library; the CPacked library may include e.g. test binaries that we do not want to include in the installation.
I don't know enough about how scikit-build-core configures CMake on the back end to know if we can safely reuse its build directory for the C++ build, otherwise we could simply run CMake commands directly in there and specify install components to split up the build. I doubt this will "just work" out of the box, though, and I don't know if we will ever be able to expect this to be a supported mode of operation for scikit-build-core even if we could make it work.
One possible option would be for the C++ and C++ wheel builds to be done in the same CI job so that the C++ build happens in a build directory that is still available when doing the wheel build. That might allow us some more flexibility in doing the two step wheel build to pull everything more granularly from the raw C++ CMake build.

This may require some level of work on the scikit-build-core side to support.

Add support for CUDA 12.2 wheels

We would like to start publishing wheels that support versions of CUDA newer than CUDA 12.0. Currently, this requires that we:

Update CI runners to use R535 drivers (CUDA 12.2, Long Term Support Branch). @ajschmidt8 will take care of this.
Update the ci-imgs repo to build images with the desired CUDA versions. This was partially completed in rapidsai/ci-imgs#53, but the ci-wheel and citestwheel images are currently excluded from the latest CUDA version. Moreover, that initial change was for 12.1, and at this point we might as well use 12.2.
Modify the wheel shared-workflows to use the new images in building wheels. ~~These jobs should be set up to continue on error~~ (comment explaining why this is crossed out)
Follow up with different RAPIDS developer teams to address any issues with builds that arise in CUDA 12.x builds. This job will mostly involve coordinating a response from the teams; the assignee of this issue is not responsible for actually fixing said builds.

Once we have 12.2 wheels and conda packaging has caught up, we may wish to revisit the CTK version and go to 12.3. For initial work we will start with 12.2, though.

Move creation of temporary build file env.yaml outside the current directory in build scripts

Move the creation of the env.yaml file in build scripts to a temporary directory instead of the current directory.
Reference change here made here to the cudf repo: rapidsai/cudf#14476 to make this change to other RAPIDS repos with similar build scripts.

During the CI builds, a env.yaml file is created for use with a mamba call later in the script. This is a problem when trying to reproduce CI locally when the launched docker container tries to create this file locally as root. If the user does not have root access (or equivalent permissions), the file is not created and the process aborts.

The same error occurs when creating the test_results directory when the RAPIDS_TEST_DIR is not set. In this case the script tries to create the test_results in the current working directory resulting in the same failure.

Also mentioned here rapidsai/cudf#14476 (comment) was possibly using https://github.com/rapidsai/rapids-reviser to help update the appropriate RAPIDS repositories.

Evaluate replacing conda-build with rattler-build

RAPIDS currently builds conda packages in CI using conda-build. The rattler-build tool is a newer alternative. It is written in Rust, and should be faster than conda-build (I haven't seen any official benchmarks yet, though). It only supports a limited subset of the meta.yaml recipe format, but that subset is designed to still enable all the same features, just with a more limited syntax (see CEPS 13 and 14). conda-build overhead is nontrivial (I've never benchmarked it, but I know it can stretch into multiple minutes beyond the environment solve when doing local CI reproductions), and reducing that would be quite valuable for us in improving our CI turnaround. Moreover, switching to the more restricted syntax described in the above CEPs would be beneficial because it would convert our conda recipes into pure YAML rather than the extended YAML currently used by meta.yaml. That change is important because the YAML extensions currently in our recipe make it impossible to parse or write with standard YAML parsers, which is a big reason why we have struggled to do things like support meta.yaml files in rapids-dependency-file-generator.

We should do a PoC of replacing conda-build with rattler-build in one repo (preferably something reasonably complex like cudf or cugraph) to see what it would take to make this transition, and how much we would benefit.

Build Python packages using the limited API

Python has a limited API that is guaranteed to be stable across minor releases. Any code using the Python C API that limits itself to using code in the limited API is guaranteed to also compile on future minor versions of Python within the same major family. More importantly, all symbols in the current (and some historical) version of the limited API are part of Python's stable ABI, which also does not change between Python minor versions and allows extensions compiled against one Python version to continue working on future versions of Python.

Currently RAPIDS builds a single wheel per Python version. If we were to compile using the Python stable ABI, we would be able to instead build a single wheel that works for all Python versions that we support. There would be a number of benefits here:

Reduced build time: This benefit is largely reduced by #33, since if we build the C++ components as standalone wheels they are already Python-independent (except when we actually use the Python C API in our own C libraries; the only example that I'm currently aware of in RAPIDS is ucxx). The Python components alone are generally small and easy to build. We'll still benefit, but the benefits will be much smaller.
Reduced testing time: Currently we run test across a number of Python versions for our packages on every PR. We often struggle with what versions need to be tested each time. If we were to only build a single wheel that runs on all Python versions, it would be much easier to justify a consistent strategy of always testing e.g. the earliest and latest Python versions. We may still want to test more broadly in nightlies, but really the only failure mode here is if a patch release is made for a Python version that is neither the earliest nor the latest, and that patch release contains breaking changes. That is certainly possible (e.g. the recent dask failure that forced us to make a last-minute patch), but it's infrequent enough that we don't need to be testing regularly.
Wider support matrix: Since we'll have a single binary that works for all Python versions, maintaining the full support matrix will be a lot easier and we won't feel as much pressure to drop earlier versions in order to support newer ones.
Day 0 support: Our wheels will work for new Python versions as soon as they're released. Of course, if there are breaking changes then we'll have to address those, but in the average case where things do work users won't be stuck waiting on us.
Better installation experience: Having a wheel that automatically works across Python versions will reduce the frequency of issues that are raised around our pip installs.

Here are the tasks (some ours, some external) that need to be accomplished to make this possible:

Making Cython compatible with the limited API: Cython has preliminary support for the limited API. However, this support is still experimental, and most code still won't compile. I have been making improvements to Cython itself to fix this, and I now have a local development branch of Cython where I can compile most of RAPIDS (with additional changes to RAPIDS libraries). We won't be able to move forward with releasing production abi3 wheels until this support in Cython is released. This is going to be the biggest bottleneck for us.
nanobind support for the limited API: nanobind can already produce abi3 wheels when compiled with Python 3.12 or later. Right now we use nanobind in pylibcugraphops, and nowhere else.
Removing C API usage in our code: RAPIDS makes very minimal direct usage of the Python C API. The predominant use case that I see is creating memoryviews in order to access some buffers directly. We can fix this by constructing buffers directly. The other thing we'll want to do is remove usage of the NumPy C API, which has no promise of supporting the limited API AFAIK. That will be addressed in #41. Other use cases can be addressed incrementally.
Intermediate vs. long-term: If Cython support for the limited API ends up being released before RAPIDS drops support for Python 3.10, we may be in an intermediate state where we still need to build a version-specific wheel for 3.10 while building an abi3 wheel for 3.11+ (and 3.12+ for pylibcugraphops due to nanobind). If that is the case, it shouldn't cause much difficulty since it'll just involve adding a tiny bit of logic on top of our existing GH workflows.

At this stage, it is not yet clear whether the tradeoffs required will be worthwhile, or at what point the ecosystem's support for the limited API will be reliable enough for us to use in production. However, it shouldn't be too much work to get us to the point of at least being able to experiment with limited API builds, so we can start answering questions around performance and complexity fairly soon. I expect that we can pretty easily remove explicit reliance on any APIs that are not part of the stable ABI, at which point this really becomes a question of the level of support our binding tools provide and if/when we're comfortable with those.

Use CUDA wheels to avoid statically linking CUDA components in our wheels

In order to achieve manylinux compliance, RAPIDS wheels currently statically link all components of the CTK that they consume. This leads to heavily bloated binaries, especially when the effect is compounded across many packages. Since NVIDIA now publishes wheels containing the CUDA libraries and these libraries have been stress tested by the wheels for various deep learning frameworks (e.g. pytorch now depends on the CUDA wheels), RAPIDS should now do the same to reduce our wheel sizes. This work is a companion to #33 that should probably be tackled afterwards since #33 will reduce the scope of these changes to just the resulting C++ wheels, a meaningful reduction since multiple RAPIDS repos produce multiple wheels. While the goals of this are aligned with #33 and the approach is similar, there are some notable differences because of the way the CUDA wheels are structured. In particular, they are not really designed to be compiled against, only run against. They do generally seem to contain both includes and libraries, which is helpful, but they do not contain any CMake or other packaging metadata, nor do they contain the multiple symlinked copies of libraries (e.g. linker name->soname->library name). The latter is a fundamental limitation of wheels not supporting symlinks, but could cause issues for library discovery using standardized solutions like CMake's FindCUDAToolkit or pkg-config that rely on a specific version of those files existing (AFAICT only the SONAME is present). We should stage work on this in a way that minimizes conflicts with #31 and #33, both of which should facilitate this change. I propose the following, but all of it is open for discussion:

Test dynamically linking in a build, then manually installing the CUDA wheels at runtime - Our first attempt should simply be to verify that we are able to interchange the libraries as expected. To achieve this, we will want to do the following:
1. Pick a repo. raft is probably the best choice here since it's the main entry point for a lot of math libraries in RAPIDS (cuml, cugraph, and cuopt all use it that way) and because it only depends on rmm as a header-only library so there's minimal conflict with the ongoing work to introduce wheel interdependencies.
2. Turn off static linking in the build, then configure auditwheel to exclude the relevant CUDA math libraries from inclusion. This can be done with the --exclude flag. The resulting wheel should be inspected to verify that all CUDA math libraries have been excluded from the build. Note that (at least for now) we want to continue statically linking the CUDA runtime. This change will likely require some CMake work to decouple static linking of cudart from the static linking of other CUDA libraries.
3. We will then want to try installing these wheels into a new environment without the necessary CUDA libraries installed. This could be done using a container with a different CUDA version, or on a machine with CUDA drivers installed but relying on e.g. conda for installing the CUDA runtime and libraries. Attempting to import the wheel should give a linker error.
4. We will then want to try installing the CUDA wheels and verify that we can make things work. The easiest choice at this stage will probably be to just set LD_LIBRARY_PATH
Build against the CUDA wheels: In the long run, we would like to be able to build against the CUDA wheels to ensure that we see a consistent build and runtime layout of CUDA files. At present, this likely to be challenging due to some of the layout issues mentioned above. Concretely, I think that we will achieve the most benefit here if we attempt to make things work with the current layout, but do so in a way that makes it manifestly clear why the current layout is difficult to work with. We can then have a more productive discussion with the CUDA wheels teams about changes that we'd like to see (I have already started some of those discussions, but I think it'll be a lot easier to make headway when we have something concrete to discuss). With that in mind, I would suggest that at this stage we focus on writing custom CMake find modules for the CUDA libraries that work when we're building wheels. This will allow us to determine what shortcomings there are with the existing layouts.
Layer on top of the C++ wheels - In the long run, RAPIDS Python packages should never need to deal directly with the CUDA wheels. Instead, they should be getting all their CUDA dependencies transitively via the C++ wheels. To achieve this goal, once we've reached this point with these wheels we should rework the above changes on top of the ongoing work to create separate C++ wheels.
Figure out a suitable CUDA library loading strategy - The easiest way to make our wheels work with CUDA wheels at runtime is by setting the RPATHs to do a relative lookup of the libraries in the CUDA packages. Ideally I think we would want to push for the CUDA packages to instead manage the libraries via dynamic loading (the way I've set up the RAPIDS C++ wheels) to insulate consumers from the file layout of the wheel, the use of layered environments, etc, but that's probably not going to be an option in the near to intermediate term. Therefore, our options will likely be to set the RPATHs of our binaries directly, or to load the libraries in Python ourselves. The latter is a bit more flexible in that it would allow the potential for coexistence with system-installed CUDA libraries if desired, so for the purpose of e.g. DLFW containers we may still want to go that route. This would be the stage where we try to figure out in general the degree to which we want to support system vs pip-installed CUDA libraries when using our pip wheels.
Roll out to all the libraries - Once we reach this point, we can make analogous changes to other RAPIDS packages.

Drop CentOS 7 support in RAPIDS 24.06

We plan to drop CentOS 7 (which uses glibc 2.17) in RAPIDS 24.06. https://docs.rapids.ai/notices/rsn0037/

At this time, the new minimum glibc supported by RAPIDS will become 2.28 (used by Rocky 8), because that is the oldest glibc of any operating system we currently support.

This issue documents some of the work items needed to complete the drop in platform support.

Early 24.06

Before 24.06 release

Update various release scripts (especially wheel related ones which key on the manylinux version)
Update shared-actions to use actions with Node 20 (to avoid deprecation warnings)
- rapidsai/shared-actions#9
Update installation guide
Update rapids.ai homepage
Drop centos7 and ubuntu18.04 from the https://github.com/rapidsai/ci-imgs matrix
- Not required for release. Wait until 24.04 has shipped (and simmered for a bit, in case hotfixes are needed).

Feel free to edit this checklist with more items.

References:

NumPy 2.0 support

NumPy 2.0 is coming out soon ( numpy/numpy#24300 ). NumPy 2.0.0rc1 packages for conda & wheels came out 2 weeks back ( numpy/numpy#24300 (comment) )

Ecosystem support for NumPy 2.0 is being tracked in issue: numpy/numpy#26191

Also conda-forge is discussing how to support NumPy 2.0: conda-forge/conda-forge.github.io#1997

When building against NumPy 2.0, it is possible with default settings to build packages that are compatible with NumPy 1 & 2. Where NumPy will target the oldest NumPy version that was built for that Python version being targeted

From a RAPIDS perspective, we will need to identify our dependencies using NumPy and track when they have been upgraded to support NumPy 2

Pin to NumPy <2

NumPy 2 is expected to be released in the near future. For the RAPIDS 24.04 release, we will pin to numpy>=1.23,<2.0a0. This issue tracks the work needed to add an upper bound to affected RAPIDS repositories.

cc: @jakirkham

Automate C++ include grouping and ordering using .clang-format

We want to add rules to RAPIDS repos clang-format files to automate the C++ include grouping and ordering to ensure consistency and make it easier to write scripts that insert includes into C++ files. Such scripts would not have to worry about placing the includes in the right place because clang-format will fix up any ordering or grouping problems after running the script.

Full discussion in the cuDF PR.

Tasks

Beta Give feedback

Automate C++ include file grouping and ordering using clang-format rmm#1477

improvement
Automate C++ include file grouping and ordering using clang-format cudf#15103

improvement
Automate C++ include file grouping and ordering using clang-format raft#2193

improvement
Automate C++ include file grouping and ordering using clang-format cugraph#4185

improvement
Automate C++ include file grouping and ordering using clang-format cuml#5779

improvement
Automate C++ include file grouping and ordering using clang-format cuspatial#1345

improvement
Others?
Options

Migrate all Python builds from scikit-build to scikit-build-core

RAPIDS libraries are generally built with CMake. To facilitate better integration of the C++ builds with Python builds, we switched from using pure setuptools builds to using scikit-build. This change was crucial to enabling wheels by providing a single standard entrypoint (all the usual Python pip [install|wheel|etc] machinery) for building a Python package while also compiling the required C++ components. However, scikit-build's approach to enabling this is fundamentally limited because it relies on plugging into setuptools directly in ways that setuptools only marginally supports. The result is a tool that works most of the time, but has various sharp edges (e.g. incomplete support for MANIFEST.in, broken installations in certain cases, etc) and limitations (an inability to support true editable installations, mixed support for pyproject.toml/setup.py, etc).

The solution is to switch to the newer scikit-build-core builder, a modern standards-based builder that offers the same class of functionality as scikit-build (integrating a Python build with CMake) in a more reliable manner. Doing so will allow us to completely removed deprecated components of our build systems (various uses of setup.py), get rid of workarounds for scikit-build (e.g. the languages we must specify at the CMake level), and get full support for critical features like editable installs.

PRs Contributing To This Effort

Beta Give feedback

Switch to scikit-build-core raft#2051

CMake breaking ci improvement python
Switch to scikit-build-core kvikio#325

breaking improvement
Switch to scikit-build-core cuspatial#1313

Python breaking ci cmake conda improvement
Switch to scikit-build-core cuml#5693

CMake Cython / Python breaking ci conda improvement
Switch to scikit-build-core cugraph#4053

CMake breaking ci conda improvement python
Switch to scikit-build-core cudf#13531

CMake breaking ci conda cuDF (Python) improvement libcudf
Switch to scikit-build-core rmm#1287

0 - Blocked CMake Python breaking conda improvement
Switch to scikit-build-core ucxx#146

breaking improvement
Options

Moving from `pynvml` to `nvidia-ml-py`

Opening this issue to track moving from pynvml to nvidia-ml-py. There has been past discussion and issues about this. Moving this to build-planning to improve visibility.

Have compiled a list of issues on RAPIDS projects that would need to be updated to complete the move. Largely expect this to be pretty simple string replacement.

In a few cases pynvml.smi is used, which does not have an equivalent in nvidia-ml-py. If we don't need pynvml.smi in the places it is used, we could simply drop those code paths. If we do need it for some reason, we may need to think more about what a reasonable replacement would be.

Issues

Convert deprecation warnings to errors on CI

To make it easier to catch and fix deprecations in RAPIDS projects, it is worth considering converting deprecation warnings to errors on CI. That way deprecations fail loudly and we are able to catch and address them quickly. Alternatively we can use that opportunity to tighten our dependencies and flag the deprecation for follow up when we are ready.

Consider consolidating conda solves in CI tests

Currently (using cuML as an example here), the conda test environment initialization for most CI jobs looks something like creating the test environment:

rapids-dependency-file-generator \
  --output conda \
  --file_key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --force -f env.yaml -n test

And then downloading and installing build artifacts from previous jobs on top of this environment:

CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

...

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  libcuml cuml

In addition to forcing us to eat the cost of a second conda environment solve, in many cases this can cause some pretty drastic changes to the environment which can be blocking - for example, consider this cuML run which fails because conda is unable to solve a downgrade from Arrow 15.0.0 (build 5) to 14.0.1.

Our current workaround for this is to manually add pinnings to the testing dependencies initially solved such that the artifact installation can be solved, but this can introduce a lot of burden in needing to:

identify what packages/changes are blocking artifact installation
open PR(s) modifying the impacted repos
follow up on each impacted repo to potentially remove the pinning later on

Would it be possible to consolidate some (or all) of these conda environment solves by instead:

downloading the conda artifacts before creating the environment
updating the dependencies.yaml (or, if this isn't possible, the generated conda environment file) to include the desired packages, making sure to explicitly specify source channel to ensure we're picking up the build artifacts
creating the environment with this patched file

In my mind, the main blocker I could see to this working would be if rapids-download-conda-from-s3 requires some conda packages contained in the testing environment to work.

Statically link gtest/gbench/nvbench into all C++ test/benchmark executables

Currently we typically dynamically link gtest/gbench/nvbench into our tests/benchmarks. This is unnecessary and it makes these executables harder to consume for people who want to run our tests/benchmarks. We should change this default across RAPIDS. The first step will be updating rapids-cmake to support using gtest statically. Once that is done, we can start rolling out the changes across RAPIDS.

Reduce amount of hard-coding of RAPIDS version

Many (most?) projects have an update-version.sh script that uses sed expressions to replace the RAPIDS version in the repository's files. Many of these hard-coded usages of the version can be replaced with smarter dynamic reading of the VERSION file, and the remaining usages that must be hard-coded can be updated by a centralized hook in https://github.com/rapidsai/pre-commit-hooks that reads a configuration file from the repo.

In this issue, I propose two RAPIDS-wide changes:

Intelligently read from VERSION wherever possible. rapidsai/cudf#14867 serves as an example for updating CMake code to read from VERSION. Other scripts in other languages can do something similar.
Add a hook to https://github.com/rapidsai/pre-commit-hooks that searches for version strings and warns you about hard-coding version numbers. In some cases (README files, etc.), hard-coding is unavoidable, and the hook can be configured to instead update these files when the version changes.

Implement a cuda-version wheel metapackage

Currently we have a cuda-version metapackage in conda that can be used in install commands to select the appropriate version of other packages. Ideally such a package would facilitate something like the following:

pip install cudf cuml cuda-version=12.2

with the appropriate cudf/cuml packages being pulled for us. The major caveat with pip vs conda is that we do not have a single package across CUDA versions for a given RAPIDS package, but instead separate packages by major version. That is because wheels do not offer any way of encoding arbitrary extra tag information to distinguish wheels that are identical according to the standard information that wheels support (platform, Python version).

In lieu of the latter being feasible, we would at minimum like to support a package that would serve to constrain our environment and prevent inconsistent installations, e.g.

pip install cudf-cu12
pip install cuml-cu11  # We would want this to fail because we have cudf-cu12

That is a more tractable problem for which the solution should be achieved by adding the appropriate cuda-version constraint to each *-cu* package.

Add support for Python 3.12

Overview

Python 3.12 was released in October 2023. This issue tracks the work to add support for Python 3.12 to RAPIDS.

In #3, RAPIDS added support for Python 3.11, which was released in RAPIDS 24.04. The work to add Python 3.11 was heavily automated, and that could be done again for Python 3.12 to reduce the load on RAPIDS maintainers.

When should we drop Python 3.9?

Typically RAPIDS has kept the matrix of supported Python minor versions to 2 or 3 versions at a time. When adding Python 3.12, we should probably drop Python 3.9 as well.

SPEC 0 recommended dropping support for Python 3.9 in 2023Q4. Meanwhile, NEP 29 recommended dropping support for Python 3.9 as of Apr 05, 2024. Both of these deadlines have passed and several large Python libraries are now moving towards dropping Python 3.9, so it is probably reasonable to drop Python 3.9 around the same time that we add Python 3.12.

Tasks

Each section should be fully completed before moving to the next section.

CI images

Add Python 3.12 builds of miniforge-cuda. rapidsai/miniforge-cuda#55
Add Python 3.12 CI images (ci-conda, ci-wheel, citestwheel) rapidsai/ci-imgs#137

CI workflows

Branch Strategy:

Create a branch on shared-workflows called python-3.12
Add Python 3.12 to the build matrix on the python-3.12 branch
Add Python 3.12 to the test matrix on the python-3.12 branch
- When adjusting the test matrix, be aware of total GPU resource consumption. Build jobs are CPU only but test jobs require GPUs. We want to keep our GPU consumption roughly the same (don't double the test matrix size), even if it gets a bit sparser in its coverage. We have some rough guidelines for how to decide on the matrix entries to include.

RAPIDS repositories

First, create a checklist for tracking repository migration like the one we used for Python 3.11: #3 (comment)

For each repo,

Update .github/workflows/*.yaml to point to the python-3.12 branch of shared-workflows
Update dependencies.yaml to add support for Python 3.12.
Review any pyproject.toml files for necessary changes (classifiers, etc.)
Update docs (README, etc) that reference a single Python version to point to the latest (3.12).
Once CI passes, merge the PR.

Most of this is easy to automate with rapids-reviser, and we can copy from this previous migrator for Python 3.11: https://github.com/rapidsai/rapids-reviser/pull/11. We still need to manually review the PRs for missing pieces.

Once all repos are migrated to the python-3.12 branch, the migration is complete. We merge python-3.12 into the development branch on shared-workflows and then open follow-up PRs to each repo to reset the branches to that development branch. This "reset" is simple and should be automated with rapids-reviser.

Post-migration

Update the ci-imgs repo's latest configuration to use Python 3.12. https://github.com/rapidsai/ci-imgs/blob/main/latest.yaml
- The latest image is frequently used by CI jobs for building docs and testing notebooks. Be aware that issues may arise in those jobs.
Update the docker repo's matrix.yaml and matrix-test.yaml
Update the build matrix in pypi-wheel-scripts so that Python 3.12 wheels are checked: https://github.com/rapidsai/pypi-wheel-scripts/blob/fa1e8744c8ec961a5b5e38ae172ae9c8c51b4280/release/check-wheels.sh#L41-L45

ensure `update-versions.sh` scripts account for dependencies with `-cu{CUDA_MAJOR}` suffixes

Overview

RAPIDS projects are released on the same cadence, mostly using the same versioning scheme, as described in https://docs.rapids.ai/releases/process/.

Given that, the projects tend to have dependencies on other RAPIDS projects from the same release. For example, like this in a pyproject.toml:

[project]
# ...
dependencies = [
    "cudf==24.2.*",
    # ...
    "dask-cuda==24.2.*",
    "dask-cudf==24.2.*",
    # ...
    "pylibraft==24.2.*",
    "raft-dask==24.2.*",
    "rapids-dask-dependency==24.2.*",
    "rmm==24.2.*"
]

(cuml code link).

When cutting a new release, shell scripts in each repo (by convention, ci/release/update-version.sh) are used to update all such versions to the newest RAPIDS release.

As of this writing, some of those scripts don't account for projects whose names have a -cu{CUDA_MAJOR} suffix in the name, like this:

cudf-cu12==24.2.*

As a result, some dependencies may be missed when beginning a new release cycle.

That should be fixed.

Approach

review the discussion in rapidsai/cudf#14825 and rapidsai/cuml#5726
make similar changes on the branch-24.04 branches of all other RAPIDS repos

Merge the gha-tools repo into ci-imgs

This task is a follow-up to #48. Many of the reasons are the same. The gha-tools repo defines a number of bash scripts that we use in our CI scripts throughout RAPIDS. These tools are automatically installed in the images: https://github.com/rapidsai/ci-imgs/blob/main/ci-conda.Dockerfile#L103. Some of these tools rely on environment variables set in the images, and some developments between the repos must be coordinated. For example, the addition of rapids-configure-sccache involved the simultaneous removal of these variables from CI images to verify that the tool was actually setting the right variables. Like with miniforge-cuda, PR merges in gha-tools trigger a release, which then triggers a rebuild of ci-imgs in order to embed the latest version of the tools. There is a lot of extra process here that we could elide by simply moving the tools into the ci-imgs repository. We could also make it easier to test changes; if the two are in the same repository, then a change to a tool would automatically trigger an image build with the new tools and we would only need to add a parameter to our shared workflows to enable rerunning a workflow from another repo (e.g. cudf) with the latest images.

Ensure `ctest` is called with `--no-tests=error`.

Following up from rapidsai/rapids-cmake#534 and rapidsai/cuml#5753.

We should ensure that ctest is called with the flag --no-tests=error. This would help prevent false positives in RAPIDS C++ test suites. @KyleFromNVIDIA also proposed adding a pre-commit hook to enforce this.

Deploy new copyright update script to all RAPIDS repos

A new copyright update script was added to https://github.com/rapidsai/pre-commit-hooks, and v0.0.1 was tagged. This check has been deployed in cudf and cuspatial and seems to be working very well. We should deploy this to the rest of RAPIDS too.

Properly support building pure Python wheels

Most RAPIDS wheels contain extension modules. However, after #33 we will have a number of pure C++ wheels that contain no Python code at all. We also have a handful of pure Python packages, namely dask_cudf and the wheels in the cugraph repo aside from cugraph and pylibcugraph. Those packages are handled in a somewhat specialized manner in the wheels workflows in order to produce pure Python wheels, but we do not handle this correctly for conda packages, where we still produce a package per minor version of Python. We should address this issue more holistically.

There are two parts to this request:

Updating our shared workflows to support building pure wheels. The most important thing to do here is to create a new workflows based on wheels-build and conda-python-build that only use a single version. We already do this manually in a few places (especially in the new jobs added in addressing #33), so the simplest solution I see here is creating workflows that wrap those preexisting workflows but pass in a matrix filter containing a max_by(py_ver). The other thing that we may want to do here is forward along any other information specific to pure wheel builds. One example is the need to specify the RAPIDS_PY_WHEEL_PURE variable for various gha-tools to work correctly. We could set that appropriately to the environment of all jobs using this shared workflow.
Updating conda recipes to properly produce packages without a Python ABI dependence. This will require that we remove the Python component of the build string, that we specify the packages as noarch:python, and that we ensure that the Python dependency becomes a >=min_version instead of pinning to a specific version (this should automatically be handled if the package is built as noarch:python).

Add support for CUDA 12.2 conda packages

We would like to start publishing conda packages that support versions of CUDA newer than CUDA 12.0. At the moment, this is blocked on efforts to get the CTK on conda-forge updated to a sufficiently new version. As of this writing, we are currently updating the conda-forge CTK to 12.1.1. Our plan is to continue the cf update process, and whatever the latest version of the CTK is that's available via cf on Jan 8, 2024, we will use that version for building RAPIDS 24.02 packages.

Assuming that #7 is completed before this, the main tasks will be to:

Modify the conda shared-workflows to use the new images in building conda packages These jobs should be set up to continue on error
Follow up with different RAPIDS developer teams to address any issues with builds that arise in CUDA 12.x builds. This job will mostly involve coordinating a response from the teams; the assignee of this issue is not responsible for actually fixing said builds.

Step 2 above will likely involve making updates to dependency files in various RAPIDS repos.

This issue will be filled out more and updated once the conda-forge updates are completed and the version finalized.

Remove usage of the NumPy C API

RAPIDS currently makes use of the NumPy C API in a handful of places, generally in Cython code. The NumPy C API is generally quite good and has remained stable, making it easy to work with. However, it does introduce additional build and packaging complexity that would be nice to avoid. With minimal changes to RAPIDS code, we should be able to remove numpy as a build dependency entirely, which may simplify our builds and also saves us from needing to rebuild packages at all when numpy 2.0 is released. If we were getting a lot of value out of the C API the calculus might be different, but in practice our usage of it is very minimal and can generally be avoided. I propose that we expend a little bit of development effort to stop relying on the NumPy C API altogether. This will help us on two fronts: 1) we'll more easily be able to support multiple major versions of NumPy (see #38) since we only have to worry about Python compatibility, not C compatibility; and more importantly 2) we won't have to worry about NumPy C APIs when considering if we can use the Python limited API to produce a single package across Python versions (will open a separate issue for that next). The latter is the more important piece here, since as of this writing the numpy C API is not compatible with the Python limited API based on the author's current experimentation.

The changes required basically boil down to two things:

cudf/cuspatial: cudf and cuspatial both use the C API transitively only, via pyarrow. There is no direct usage of the C API. Therefore, the cudf/cuspatial piece of this issue will be addressed when rapidsai/cudf#15193 is completed.
ucxx: ucxx uses the C API to expose host buffers to other APIs. This usage should be possible to remove by directly implementing the buffer protocol on a custom object. It will require a bit of extra work, but should be easy to maintain going forward.

Align conda and wheel building workflows

Historically our conda and wheel GHA workflow scripts have looked fairly different for a number of reasons. However, with #33 many of the fundamental distinctions will no longer exist because wheels will also have separate build steps for C++ and Python builds. As a result, we should invest in aligning our workflows as much as possible so as to reduce maintenance costs going forward. Some changes that we ought to make:

We should parallelize test jobs across different Python packages. Wheels generally already do this, while all conda tests typically occur within a single job (in some repos there is a partial split, usually based on criteria specific to each repo e.g. cuml dask tests or cudf tests that aren't part of the cudf package).
We should standardize handling of pure Python packages. Wheels already do this, while conda packages do not. See #43 for a more detailed writeup.
We should automatically append the CUDA version to the artifacts produced by the gha-tools for uploading packages. We already do this for conda, but not for wheels. As part of #33 we will be rearchitecting the wheel pipelines to have one job for building the C++ wheel and one for all the Python wheels, matching the conda packages more closely (this becomes feasible after #33 because the Python builds will be very fast if they don't have to rebuild the C++). Once that is the case, we can also upload/download all the wheels in the same manner that we do for conda packages (a single tarball for all wheels instead of one per wheel), so we can get rid of support for RAPIDS_PY_WHEEL_NAME. In the PRs for #33 we're currently abusing RAPIDS_PY_WHEEL_NAME to handle the CUDA version, so we need to start adding it for wheels before we can get rid of that variable.
rapids-download-conda-from-s3 automates choosing the output directory, while rapids-download-wheels-from-s3 requires that the caller specify it. We should update the wheel tool to automate that too.
Update conda jobs to include conda in the name. Currently wheels jobs are e.g. build_wheel_*, whereas conda is just build_cpp.sh etc. That is an artifact of a time when conda was our only produced artifact.
The rapids-wheels-anaconda tool will need to be modified to support upload of cpp wheels.

I will update this list as more ideas come to mind.

Lint for (and replace) `http://` with `https://` (where possible)

Wherever possible, it would be ideal to use https://. Currently we have several cases where http:// is still used

To address this, think there are a couple things that would be helpful:

Linting tool to flag http:// usage
Verify whether https:// works for the URL in question
Automating replacement http:// with https:// (where the replaced URL works)

Related to issue ( #9 )

Explore ways to maximize coverage while minimizing cost of the CI test matrix

There are various potential improvements we could make to our CI matrix to improve test coverage of critical components while reducing the overall load. Some possible improvements that have been suggested at various times include:

Reducing the frequency of testing multiple Python versions to nightly: We very rarely encounter cases where our CI tests uncover a bug that is only present in one Python version. While it is possible, we don't generally use cutting edge language or standard library features that require a very new version of Python. We could choose to only test the oldest supported Python version in PR CI and only run the other tests in nightlies.
Don't test arm builds all the time: Architecture-specific bugs are similarly rare and we probably don't need to test both x86 and arm on every PR and in CI.
Better testing of CUDA drivers: Currently we only test the oldest and newest supported drivers between both PR and nightly CI. However, for use cases like JIT-compiled kernels we may, at minimum, want to test the oldest and newest supported drivers for each major version of CUDA that we support.
Reduce duplication between wheel and conda testing: We currently run a matrix of both wheel and conda tests. We may be able to use a subset of those as long as we get coverage of both wheels and conda (and coverage of all the other pieces -- Python version, CUDA verison, etc -- between the wheel and conda tests).

Update package metadata to allow safe coexistence of nightly and release conda channels

Currently RAPIDS conda packages pin other RAPIDS packages in recipes using version constraints that are effectively of the form YY.MM.*. In conda, the trailing .* allows nightlies. That makes using the rapidsai and rapidsai-nightly channels in the same conda install/env creation command potentially problematic, and could lead to situations where user install commands result in environments that are technically invalid. This is especially likely to be problematic for rapids-dask-dependency given the high rate of dask changes and the fact that we track the main branch until just before releases, causing potential problems around release time.

With pip packages, our use of nightly packages is in some sense more controlled. pip will only use nightlies if passed the --pre flag on the command line or if the version constraint explicitly includes dev versions, e.g. via a constraint of the form >=YY.MM.00a0. To accommodate this, we set the versions of our packages inside our package build scripts, directly modifying pyproject.toml before invoking pip wheel. Our final releases do not have constraints specified in this way. That behavior affords users some degree of protection. Although users could still break things by manually specifying --pre, the default behavior is safe and it's fair to say users are on their own if they use --pre with release channels. Therefore, mixing the nightly and stable pip indexes is in this sense relatively safer than mixing the nightly and stable conda channels.

We should consider rewriting dependencies in our conda packages to specify constraints in a way that only allows nightly packages to be installed when building nightlies. This could easily be accomplished by using an environment variable that is read in the meta.yaml recipe, by parsing the VERSION file to determine whether the current version corresponds to a nightly build, or by any number of other similar strategies.

Consider statically linking the CUDA runtime

Currently RAPIDS libraries support static linkage to cudart via a CMake flag CUDA_STATIC_RUNTIME. This flag is leveraged by wheel builds and by the Spark-RAPIDS JNI (specifically for cudf), but it is not the default. We would like to consider changing that. Using static libcudart has a few advantages:

It aligns with the nvcc default.
It aligns with recommendations from the CUDA programming guide. This lists multiple advantages including
- It allows multiple CUDA runtimes to exist in the same process space. This is particularly valuable for RAPIDS in the context of packages like wheels where users could install wheels built against different CUDA runtimes, but can also be useful in other contexts.
- Reduces the likelihood of compatibility issues
Since it is nvcc's default behavior, that default has translated to conda-forge
It overcomes some known performance issues with using the dynamic libcudart.so
It would allow newer features to be used on user systems with older runtimes (newer drivers may still be required).

Given that cudart is small, the typical size concerns around static linking aren't concerning. However, the CUDA libraries (such as the math libraries like cuBLAS) are large, so we don't typically want to statically link those. Furthermore, static linking has the potential to open us up to issues around weak linking and CUDA kernels in the case of header-only libraries (i.e. anything using thrust, or raft). Therefore, before we can move to building statically by default, we should ensure that our libraries are safe to build that way by marking all kernels as static.

Migrate Conda recipes to `{{ stdlib('c') }}`

The sysroot* syntax used currently by RAPIDS recipes is getting phased out

Sample syntax as seen currently in librmm:

recipe/meta.yaml:

    - sysroot_{{ target_platform }} {{ sysroot_version }}

recipe/conda_build_config.yaml:

sysroot_version:
  - "2.17"

The recommendation is to move to {{ stdlib('c') }} ( conda-forge/conda-forge.github.io#2102 )

Changes would look something like this

recipe/meta.yaml:

-    - sysroot_{{ target_platform }} {{ sysroot_version }}
+    - {{ stdlib('c') }}

recipe/conda_build_config.yaml:

-sysroot_version:
+c_stdlib_version:
  - "2.17"

Raising this issue to track making these changes in RAPIDS

Require UCX 1.14.0+

Recently we dropped libnuma from RAPIDS Docker images: rapidsai/miniforge-cuda#22 (comment)

The libnuma dependency is added to conda-forge's ucx packages starting in 1.14.0: conda-forge/ucx-split-feedstock#111

To ensure libnuma is available, we should make sure that ucx is 1.14.0 or newer. Some potential places to update

Investigate using micromamba instead of installing miniforge

We currently use miniforge as our minimal conda installation in our CI images. However, we may be able to switch to something even more lightweight, micromamba. This switch would allow us to shrink our images and also simplify our image builds since we currently take whatever base Python version miniforge installed and then upgrade/downgrade depending on the needs of our particular image.

Ensure cached packages installed in CI test phase

Recently we ran into an issue on a project (cuCIM) where older packages of the project (libcucim & cucim from 23.12) were installed instead of the most recent packages from the PR (24.02.00a*). This made the installation look successful. However old issues that had been fixed in the development branch (branch-24.02) were not getting picked up

This was ultimately caused by a solver issue. However we were not able to ascertain that until we pinned the packages installed in the test phase to the latest version. Then the solver issue became clear and we could work to resolve that

Think we should take a closer look at this issue and come up with a way to guarantee the cached packages are picked up as opposed to some other packages. Attempted to do this more directly by using the <channel>::<package> syntax, but this didn't work well with file based channels. Maybe there is a better way to do this

Roll out H100 Testing

We are preparing to add H100 testing to RAPIDS in rapidsai/shared-workflows#194.

So far we have the following test PRs opened:

We should identify any other major repositories that need to be tested with H100s before the shared-workflows PR above is merged.

Enable running all pytests in `arm64` builds

For arm64 jobs, we currently only run smoke tests that don't test the full coverage on arm, the following are all the PRs that will enable running full suite of pytests on arm64 jobs:

Update RAPIDS Python packages to use the new rapids-build-backend

rapids-build-backend is a wrapper around standard backends like scikit-build-core and setuptools that handles some of the standard issues that we face for RAPIDS packages (CUDA versioning, alpha versions, etc). Substituting it into existing RAPIDS Python packages should be fairly painless, but will require some careful testing to verify that nothing is broken.

The backend need not be updated in lockstep across all of RAPIDS for this to work, and merging PRs in any order should be generally safe. However, since there are cases like unified devcontainers where underlying build commands may need an update, it would be best to try and test at least a couple of core packages together to verify that everything works as expected.

Split RAPIDS C++ conda libraries into standardized components

Currently most RAPIDS C++ libraries produce a single lib*.so that represents the complete output of the C++ library. There are additional conda packages produced for things like tests, benchmarks, and examples, but the core libraries are contained in a single conda package. While this has historically been fine, we are now seeing increased usage of RAPIDS libraries as dependencies of other libraries, both internally (e.g. cugraph-ops and cumlprims_mg are primarily consumed as dependencies of cugraph and cuml, respectively) and externally (raft is being increasingly used by vector dbs etc). Moreover, we are seeing the potential for static RAPIDS libraries.

Our current package structure is not well suited to handle all of these uses. The lack of separation between runtime and build-time packages means that build-time dependencies are often propagated unnecessarily, bloating runtime environments and making conda solves more complex than they need to be. Additionally, not having standardized package delineations puts a greater onus on downstream developers to know which packages to include in what parts of the recipes, which in turn often leads to misconfigured recipes down the line that cause additional issues. As our conda environments become more and more complex, having packages configured correctly is critical to reducing the number of issues we run into. Some packages (especially raft) have started to address some of these concerns piecemeal to fix specific use cases, but I think now would be a good time for us to consider adopting a more holistic strategy here.

To better address these diverse use cases, we should migrate all RAPIDS packages to offer a more standardized set of packages.
The most common case will be RAPIDS libraries that produce two different conda packages:

${lib}: The base package would only contain the shared library, basically the minimal runtime requirement for any other package that depends on this library. For example, libcuml would have a runtime dependency on libraft because it needs libraft functions at runtime.
${lib}-dev: *-dev packages should include everything required to build against the library. ${lib}-dev should include a runtime dependency on ${lib} so that the library can be linked to. It should also include a runtime dependency on anything required to build against the package since this package will only be installed with the intent of building against the library. In addition, the dev package should include header files required to compile code that uses the package as well as any packaging files like CMake config files (for now we don't produce e.g. pkgconfig files, but such things would also go in this library if we did). The ${lib}-dev package should include a run export of ${lib}, which ensures that any package that builds against ${lib}-dev will automatically have ${lib} added to its list of runtime dependencies. Typically Python packages will consume the dev version of the C++ package.

For most libraries, the above two will be sufficient. In cases where RAPIDS libraries also want to offer a static component, we will also want to produce

${lib}-static: static packages will contain the static library. If a static package exists for a given library, then the corresponding dev package should include a run_constrained specification section so that the dev package and the static package require installing consistent versions.

Header-only libraries

Some RAPIDS libraries are header-only (rmm) or offer a header-only component (raft). This introduces an additional layer of complexity. I do not know if there is a standard for this, so please comment if there is one that we should follow. If not, I would propose the following layout:

${lib}-headers: This package should exist only for packages that support header-only usage. This package should include all header files and have a runtime dependency for every other package that is required to build against these headers. It should also include CMake config files so that the headers can be found by CMake. If a headers package exists, the dev package should depend on the headers package. In most cases, ${lib}-dev will likely just be a metapackage that pulls in ${lib} and ${lib}-headers. There may need to be some additional CMake files to stitch together the headers with the runtime libs. The header package should not include a run export on the corresponding lib since the presumption is that this package should only be pulled for header-only usage.

Additional considerations

raft currently produces an additional package libraft-headers-only. The purpose of this package is to allow consumers of the raft headers to include and use a limited subset of raft that does not require CUDA math libraries. I do not think that this is a standard use case that we'll need to support more generally. However, if we were to support this kind of usage, I would probably argue for modifying the package so that libraft-headers-only only contained the headers that are actually consumable without CUDA math libraries. Currently I believe that it includes all headers, so it is the user's responsibility to only use the headers that don't require CUDA libs (or to manually install CUDA libs).

rapidsai / build-planning Goto Github PK

build-planning's People

Contributors

Watchers

Forkers

build-planning's Issues

My tasks

Tasks

Proposed Solution

Notes

PRs in flight

Tasks

CI images

CI workflows

RAPIDS repositories

Post-migration

Tasks

Early 24.06

Before 24.06 release

Tasks

PRs Contributing To This Effort

Issues

Overview

When should we drop Python 3.9?

Tasks

CI images

CI workflows

RAPIDS repositories

Post-migration

Overview

Approach

Header-only libraries

Additional considerations

Recommend Projects

Recommend Topics

Recommend Org