rapidsai / build-planning Goto Github PK
View Code? Open in Web Editor NEWTracking for RAPIDS-wide build tasks
Home Page: https://github.com/rapidsai
Tracking for RAPIDS-wide build tasks
Home Page: https://github.com/rapidsai
Downstream consumers of static built versions of RAPIDS C++ projects have encountered runtime issues due to multiple instances of the same kernel existing in different DSOs.
To resolve this issue we need to ensure that all CUDA kernels in all RAPIDS libraries need to be have internal linkage ( static
for projects using whole compilation, __attribute__((visibility("hidden")))
for header libraries / separable compilation ).
As there is only need of Dask-CUDA build, we filter out one architecture, Python version, and CUDA version in GHA. So as to only build the package once
However this currently hard-codes the versions of each in the selection logic, which means it can go stale as new versions are added and old ones dropped. Potentially resulting in the build being lost altogether (maybe even silently)
To avoid pinning to a specific version, @ajschmidt8 made several suggestions in this thread: rapidsai/dask-cuda#1294 (comment)
Filing this to track for follow-up
To ensure we have all the information in logs when an error occurs so that we can easily share that information in upstream bug reports, it would be good to make sure we include common config and info commands at the top of our logs (aligned with places we might raise issues like Pip, Conda, NumPy, Pandas, Dask, etc.). That way it is easy to extract this information when filling out upstream issue templates.
This is a meta-issue for tracking support for CUDA 12.2 builds. I'll open separate sub-issues for pip and conda because there is different work that needs to be done for each.
We may want to add some kind of URL checking to CI of RAPIDS projects to confirm links are valid
Recently we had a couple projects that linked to an image that was moved. So now they have missing image icons popping up
Having some kind of link check would help catch this when making these kinds of changes and allow us to put some kind of migration steps in place
This could make sense on RAPIDS projects themselves (maybe as part of doc builds). It could also make sense to have on any shared assets used by multiple projects
Currently RAPIDS wheels adhere strictly to the manylinux policy. While the glibc/kernel ABI restrictions are not particularly onerous, the requirement that binary wheels be essentially self-contained and only depend on a small set of external shared libraries is problematic. To adhere to this restriction, RAPIDS wheels statically link (or in rare cases, bundle) all of their external library dependencies, leading to severe binary bloat. The biggest problem with this behavior is that the current sizes prohibit us from publishing our wheels on PyPI. Beyond that come the usual more infrastructural problems: longer CI times due to extra compilation, larger binaries making wheel download and installation slower, etc. The focus of this issue is to define a better solution than static linking for this problem that still adheres to the manylinux spec in spirit while reducing binary sizes. This issue will not address the usage of CUDA math library dynamic library wheels; that will be discussed separately.
RAPIDS should start publishing its C++ libraries as standalone wheels that can be pip installed independently from the Python(/Cython) wheels.These wheels should
A key question to address is how to encode binary dependencies between wheels. One option is for each wheel to embed RPATHs pointing to the expected relative path to library dependencies in other wheels. This could be accomplished with some CMake to extract library locations from targets and then construct relative paths during the build based on the assumption that the packages are installed into a standard site-packages layout. However, since this approach is fragile and has generally been frowned upon by the Python community in the past, I suggest that we instead exploit dynamic loading to load the library on import of a package. This choice would make packages sensitive to import order (C++ wheels would need to be imported before any other extension module that links to them) but I think that's a reasonable price to pay since it only matters when depending on a C++ wheel. This solution also lets us handle the logic in Python, making it far easier to configure and control. Moreover, it will make the solution fairly composable when an extension module depends on a C++ wheel that depends on yet another C++ wheel.
Once these wheels exist, we should rewrite the existing Python packages to require the corresponding C++ wheels. The current approach of "find C++ if exists, build otherwise" can be scrapped in favor of always requiring that the C++ CMake package be found. Consumers will have the choice of installing the C++ library (e.g. from conda), building it from source, or installing the C++ wheel. The C++ wheel will become a hard dependency in pyproject.toml, so it will automatically be installed when building. In conda environments the pyproject dependencies are ignored, so the new wheels will not be installed, and similarly in devcontainer builds where requirements are generated dynamically from dependencies.yaml. Ultimately a pylibraft->libraft dependency will behave nearly identically to a raft-dask->pylibraft dependency from the perspective of dependency management.
Currently we build a base image with CUDA and conda installed in the miniforge-cuda repo, then use that as a base for the conda images in the ci-imgs repo. Due to the tight coupling between these images, we trigger a rebuild of the ci-imgs whenever a PR is merged to miniforge-cuda. Given this tight coupling, I think we should consider merging these repos. Combining the repositories will make it easier to teach new build engineers about the pieces we have, and make it easier to maintain since it will reduce the amount of distinct processes required if there are changes that need to be coordinated between the repositories.
Enabling overlinking checks will help us capture issues with things like rpaths or libraries that we're implicitly/unexpectedly linking to.
rapidsai/rmm#1417 and a followon PR makes a number of improvements and changes to help consumers of RMM avoid accessing RMM's detail
namespace.
pool_memory_resource
is required to be provided (no longer optional)detail
namespace.pool_memory_resource<Upstream>(upstream)
can be replaced by pool_memory_resource<Upstream>(upstream, rmm::percent_of_free_device_memory(50)
, which matches previous behavior.All RAPIDS repos that use any of the above will need to be updated. rapidsai/rmm#1417 only adds functionality. A follow on PR(s) will deprecate / remove functionality after RAPIDS repos are updated. Based on a search, the following repo issues cover the required changes.
RAPIDS currently supports Python 3.9 and 3.10. We would like to add support for Python 3.11. This issue documents the steps needed.
Each section should be fully completed before moving to the next section.
ci-conda
, ci-wheel
, citestwheel
) rapidsai/ci-imgs#96Branch Strategy:
shared-workflows
called python-3.11
python-3.11
branch: https://github.com/rapidsai/shared-workflows/blob/e7ebbae5854727b897b65213cf51ff8b965f53c1/.github/workflows/conda-python-build.yaml#L56-L63python-3.11
branch: https://github.com/rapidsai/shared-workflows/blob/e7ebbae5854727b897b65213cf51ff8b965f53c1/.github/workflows/conda-python-tests.yaml#L68-L82
Experimental Strategy:
continue-on-error
to add Python 3.11 jobs that are allowed to fail, for all repos. Once all repos are passing, we could require the job to pass. This could be cleaner and less total work than our normal approach of the branch strategy above.I copied a list of repositories from https://github.com/rapidsai/workflows/blob/dfd73ad47d977c57ee27b2349216f85946be757b/.github/workflows/nightly-pipeline.yaml#L27-L44 and sorted it roughly in dependency-tree order.
For each repo,
.github/workflows/*.yaml
to point to the python-3.11
branch of shared-workflows
dependencies.yaml
to add support for Python 3.11.pyproject.toml
files for necessary changes (classifiers, etc.)Most of this is easy to automate with rapids-reviser, and I've made an attempt at it here: https://github.com/rapidsai/rapids-reviser/pull/11. We still need to manually review the PRs for missing pieces.
Repos:
(checklist moved to #3 (comment))
Once all repos are migrated to the python-3.11
branch, the migration is complete. We merge python-3.11
into branch-24.02
on shared-workflows
and then open follow-up PRs to each repo to reset the branches to branch-24.02
. This "reset" is simple and should be automated with rapids-reviser.
ci-imgs
repo's latest
configuration to use Python 3.11. https://github.com/rapidsai/ci-imgs/blob/main/latest.yaml
latest
image is frequently used by CI jobs for building docs and testing notebooks. Be aware that issues may arise in those jobs.docker
repo's matrix.yaml and matrix-test.yaml
pypi-wheel-scripts
so that Python 3.11 wheels are checked: https://github.com/rapidsai/pypi-wheel-scripts/blob/fa1e8744c8ec961a5b5e38ae172ae9c8c51b4280/release/check-wheels.sh#L41-L45
This is a cross-RAPIDS tracking issue for the epic described in this document. Please refer to the document for background and details.
Currently the only way to access RAPIDS C++ libraries is via conda. There is no easy way to install RAPIDS C++ libraries in any other context. With #33, that will change to an extent since it will be possible to pip install
the binaries. However, that is a very nontraditional approach for providing native libraries, and should not be our primary avenue for producing said binaries. However, the changes in #33 are also essentially a PoC that we can build the C++ libraries in our wheel containers and produce binaries that are portable, at least up to the requirements of the Python manylinux standard.
Once #33 is completed, we should take this one step further and build the C++ libraries standalone. We should be able to extract most of what we need directly from the C++ wheel building scripts since all native dependencies are already preinstalled into the wheel images and we know that these images produce fairly portable binaries. We can use CPack to produce native packages for whatever targets that we care about here. Then, the C++ wheel builds can be modified to simply include the entire contents of the CPacked package into the wheel install.
Getting this working will probably require some significant experimentation. Some notes:
install(IMPORTED_RUNTIME_ARTIFACTS)
of the CPacked library into the wheel will be sufficient because that will only install the library. What we want is to copy over everything required to make this a valid CMake packages, such as the CMake config files, as well as all the headers and anything else needed to compile against this package, in addition to the compiled libraries. However, we also may not want every single thing that is contained in the library; the CPacked library may include e.g. test binaries that we do not want to include in the installation.This may require some level of work on the scikit-build-core side to support.
We would like to start publishing wheels that support versions of CUDA newer than CUDA 12.0. Currently, this requires that we:
Once we have 12.2 wheels and conda packaging has caught up, we may wish to revisit the CTK version and go to 12.3. For initial work we will start with 12.2, though.
Move the creation of the env.yaml
file in build scripts to a temporary directory instead of the current directory.
Reference change here made here to the cudf repo: rapidsai/cudf#14476 to make this change to other RAPIDS repos with similar build scripts.
During the CI builds, a env.yaml
file is created for use with a mamba
call later in the script. This is a problem when trying to reproduce CI locally when the launched docker container tries to create this file locally as root. If the user does not have root access (or equivalent permissions), the file is not created and the process aborts.
The same error occurs when creating the test_results
directory when the RAPIDS_TEST_DIR
is not set. In this case the script tries to create the test_results
in the current working directory resulting in the same failure.
Also mentioned here rapidsai/cudf#14476 (comment) was possibly using https://github.com/rapidsai/rapids-reviser to help update the appropriate RAPIDS repositories.
RAPIDS currently builds conda packages in CI using conda-build. The rattler-build
tool is a newer alternative. It is written in Rust, and should be faster than conda-build (I haven't seen any official benchmarks yet, though). It only supports a limited subset of the meta.yaml recipe format, but that subset is designed to still enable all the same features, just with a more limited syntax (see CEPS 13 and 14). conda-build overhead is nontrivial (I've never benchmarked it, but I know it can stretch into multiple minutes beyond the environment solve when doing local CI reproductions), and reducing that would be quite valuable for us in improving our CI turnaround. Moreover, switching to the more restricted syntax described in the above CEPs would be beneficial because it would convert our conda recipes into pure YAML rather than the extended YAML currently used by meta.yaml. That change is important because the YAML extensions currently in our recipe make it impossible to parse or write with standard YAML parsers, which is a big reason why we have struggled to do things like support meta.yaml files in rapids-dependency-file-generator
.
We should do a PoC of replacing conda-build with rattler-build in one repo (preferably something reasonably complex like cudf or cugraph) to see what it would take to make this transition, and how much we would benefit.
Python has a limited API that is guaranteed to be stable across minor releases. Any code using the Python C API that limits itself to using code in the limited API is guaranteed to also compile on future minor versions of Python within the same major family. More importantly, all symbols in the current (and some historical) version of the limited API are part of Python's stable ABI, which also does not change between Python minor versions and allows extensions compiled against one Python version to continue working on future versions of Python.
Currently RAPIDS builds a single wheel per Python version. If we were to compile using the Python stable ABI, we would be able to instead build a single wheel that works for all Python versions that we support. There would be a number of benefits here:
Here are the tasks (some ours, some external) that need to be accomplished to make this possible:
At this stage, it is not yet clear whether the tradeoffs required will be worthwhile, or at what point the ecosystem's support for the limited API will be reliable enough for us to use in production. However, it shouldn't be too much work to get us to the point of at least being able to experiment with limited API builds, so we can start answering questions around performance and complexity fairly soon. I expect that we can pretty easily remove explicit reliance on any APIs that are not part of the stable ABI, at which point this really becomes a question of the level of support our binding tools provide and if/when we're comfortable with those.
In order to achieve manylinux compliance, RAPIDS wheels currently statically link all components of the CTK that they consume. This leads to heavily bloated binaries, especially when the effect is compounded across many packages. Since NVIDIA now publishes wheels containing the CUDA libraries and these libraries have been stress tested by the wheels for various deep learning frameworks (e.g. pytorch now depends on the CUDA wheels), RAPIDS should now do the same to reduce our wheel sizes. This work is a companion to #33 that should probably be tackled afterwards since #33 will reduce the scope of these changes to just the resulting C++ wheels, a meaningful reduction since multiple RAPIDS repos produce multiple wheels. While the goals of this are aligned with #33 and the approach is similar, there are some notable differences because of the way the CUDA wheels are structured. In particular, they are not really designed to be compiled against, only run against. They do generally seem to contain both includes and libraries, which is helpful, but they do not contain any CMake or other packaging metadata, nor do they contain the multiple symlinked copies of libraries (e.g. linker name->soname->library name). The latter is a fundamental limitation of wheels not supporting symlinks, but could cause issues for library discovery using standardized solutions like CMake's FindCUDAToolkit or pkg-config that rely on a specific version of those files existing (AFAICT only the SONAME is present). We should stage work on this in a way that minimizes conflicts with #31 and #33, both of which should facilitate this change. I propose the following, but all of it is open for discussion:
--exclude
flag. The resulting wheel should be inspected to verify that all CUDA math libraries have been excluded from the build. Note that (at least for now) we want to continue statically linking the CUDA runtime. This change will likely require some CMake work to decouple static linking of cudart from the static linking of other CUDA libraries.LD_LIBRARY_PATH
We plan to drop CentOS 7 (which uses glibc 2.17) in RAPIDS 24.06. https://docs.rapids.ai/notices/rsn0037/
At this time, the new minimum glibc supported by RAPIDS will become 2.28 (used by Rocky 8), because that is the oldest glibc of any operating system we currently support.
This issue documents some of the work items needed to complete the drop in platform support.
NEW_MANYLINUX_MATRIX
. After dropping CentOS 7, RAPIDS will not build manylinux_2_17
tags. RAPIDS will only produce wheels that support manylinux_2_28
tags but nothing older.build-2_28-wheels
option: rapidsai/shared-workflows#195shared-workflows
PRs from renovate
, including (but not limited to):
shared-actions
to use actions with Node 20 (to avoid deprecation warnings)
centos7
and ubuntu18.04
from the https://github.com/rapidsai/ci-imgs matrix
Feel free to edit this checklist with more items.
References:
NumPy 2.0 is coming out soon ( numpy/numpy#24300 ). NumPy 2.0.0rc1 packages for conda & wheels came out 2 weeks back ( numpy/numpy#24300 (comment) )
Ecosystem support for NumPy 2.0 is being tracked in issue: numpy/numpy#26191
Also conda-forge is discussing how to support NumPy 2.0: conda-forge/conda-forge.github.io#1997
When building against NumPy 2.0, it is possible with default settings to build packages that are compatible with NumPy 1 & 2. Where NumPy will target the oldest NumPy version that was built for that Python version being targeted
From a RAPIDS perspective, we will need to identify our dependencies using NumPy and track when they have been upgraded to support NumPy 2
NumPy 2 is expected to be released in the near future. For the RAPIDS 24.04 release, we will pin to numpy>=1.23,<2.0a0
. This issue tracks the work needed to add an upper bound to affected RAPIDS repositories.
cc: @jakirkham
We want to add rules to RAPIDS repos clang-format files to automate the C++ include grouping and ordering to ensure consistency and make it easier to write scripts that insert includes into C++ files. Such scripts would not have to worry about placing the includes in the right place because clang-format will fix up any ordering or grouping problems after running the script.
Full discussion in the cuDF PR.
RAPIDS libraries are generally built with CMake. To facilitate better integration of the C++ builds with Python builds, we switched from using pure setuptools builds to using scikit-build. This change was crucial to enabling wheels by providing a single standard entrypoint (all the usual Python pip [install|wheel|etc]
machinery) for building a Python package while also compiling the required C++ components. However, scikit-build's approach to enabling this is fundamentally limited because it relies on plugging into setuptools directly in ways that setuptools only marginally supports. The result is a tool that works most of the time, but has various sharp edges (e.g. incomplete support for MANIFEST.in, broken installations in certain cases, etc) and limitations (an inability to support true editable installations, mixed support for pyproject.toml/setup.py, etc).
The solution is to switch to the newer scikit-build-core builder, a modern standards-based builder that offers the same class of functionality as scikit-build (integrating a Python build with CMake) in a more reliable manner. Doing so will allow us to completely removed deprecated components of our build systems (various uses of setup.py), get rid of workarounds for scikit-build (e.g. the languages we must specify at the CMake level), and get full support for critical features like editable installs.
Opening this issue to track moving from pynvml
to nvidia-ml-py
. There has been past discussion and issues about this. Moving this to build-planning
to improve visibility.
Have compiled a list of issues on RAPIDS projects that would need to be updated to complete the move. Largely expect this to be pretty simple string replacement.
In a few cases pynvml.smi
is used, which does not have an equivalent in nvidia-ml-py
. If we don't need pynvml.smi
in the places it is used, we could simply drop those code paths. If we do need it for some reason, we may need to think more about what a reasonable replacement would be.
To make it easier to catch and fix deprecations in RAPIDS projects, it is worth considering converting deprecation warnings to errors on CI. That way deprecations fail loudly and we are able to catch and address them quickly. Alternatively we can use that opportunity to tighten our dependencies and flag the deprecation for follow up when we are ready.
Currently (using cuML as an example here), the conda test environment initialization for most CI jobs looks something like creating the test environment:
rapids-dependency-file-generator \
--output conda \
--file_key test_python \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml
rapids-mamba-retry env create --force -f env.yaml -n test
And then downloading and installing build artifacts from previous jobs on top of this environment:
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)
...
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
libcuml cuml
In addition to forcing us to eat the cost of a second conda environment solve, in many cases this can cause some pretty drastic changes to the environment which can be blocking - for example, consider this cuML run which fails because conda is unable to solve a downgrade from Arrow 15.0.0 (build 5) to 14.0.1.
Our current workaround for this is to manually add pinnings to the testing dependencies initially solved such that the artifact installation can be solved, but this can introduce a lot of burden in needing to:
Would it be possible to consolidate some (or all) of these conda environment solves by instead:
In my mind, the main blocker I could see to this working would be if rapids-download-conda-from-s3
requires some conda packages contained in the testing environment to work.
Currently we typically dynamically link gtest/gbench/nvbench into our tests/benchmarks. This is unnecessary and it makes these executables harder to consume for people who want to run our tests/benchmarks. We should change this default across RAPIDS. The first step will be updating rapids-cmake to support using gtest statically. Once that is done, we can start rolling out the changes across RAPIDS.
Many (most?) projects have an update-version.sh
script that uses sed
expressions to replace the RAPIDS version in the repository's files. Many of these hard-coded usages of the version can be replaced with smarter dynamic reading of the VERSION
file, and the remaining usages that must be hard-coded can be updated by a centralized hook in https://github.com/rapidsai/pre-commit-hooks that reads a configuration file from the repo.
In this issue, I propose two RAPIDS-wide changes:
VERSION
wherever possible. rapidsai/cudf#14867 serves as an example for updating CMake code to read from VERSION
. Other scripts in other languages can do something similar.Currently we have a cuda-version metapackage in conda that can be used in install commands to select the appropriate version of other packages. Ideally such a package would facilitate something like the following:
pip install cudf cuml cuda-version=12.2
with the appropriate cudf/cuml packages being pulled for us. The major caveat with pip vs conda is that we do not have a single package across CUDA versions for a given RAPIDS package, but instead separate packages by major version. That is because wheels do not offer any way of encoding arbitrary extra tag information to distinguish wheels that are identical according to the standard information that wheels support (platform, Python version).
In lieu of the latter being feasible, we would at minimum like to support a package that would serve to constrain our environment and prevent inconsistent installations, e.g.
pip install cudf-cu12
pip install cuml-cu11 # We would want this to fail because we have cudf-cu12
That is a more tractable problem for which the solution should be achieved by adding the appropriate cuda-version constraint to each *-cu*
package.
Python 3.12 was released in October 2023. This issue tracks the work to add support for Python 3.12 to RAPIDS.
In #3, RAPIDS added support for Python 3.11, which was released in RAPIDS 24.04. The work to add Python 3.11 was heavily automated, and that could be done again for Python 3.12 to reduce the load on RAPIDS maintainers.
Typically RAPIDS has kept the matrix of supported Python minor versions to 2 or 3 versions at a time. When adding Python 3.12, we should probably drop Python 3.9 as well.
SPEC 0 recommended dropping support for Python 3.9 in 2023Q4. Meanwhile, NEP 29 recommended dropping support for Python 3.9 as of Apr 05, 2024. Both of these deadlines have passed and several large Python libraries are now moving towards dropping Python 3.9, so it is probably reasonable to drop Python 3.9 around the same time that we add Python 3.12.
Each section should be fully completed before moving to the next section.
ci-conda
, ci-wheel
, citestwheel
) rapidsai/ci-imgs#137Branch Strategy:
shared-workflows
called python-3.12
python-3.12
branchpython-3.12
branch
First, create a checklist for tracking repository migration like the one we used for Python 3.11: #3 (comment)
For each repo,
.github/workflows/*.yaml
to point to the python-3.12
branch of shared-workflows
dependencies.yaml
to add support for Python 3.12.pyproject.toml
files for necessary changes (classifiers, etc.)Most of this is easy to automate with rapids-reviser, and we can copy from this previous migrator for Python 3.11: https://github.com/rapidsai/rapids-reviser/pull/11. We still need to manually review the PRs for missing pieces.
Once all repos are migrated to the python-3.12
branch, the migration is complete. We merge python-3.12
into the development branch on shared-workflows
and then open follow-up PRs to each repo to reset the branches to that development branch. This "reset" is simple and should be automated with rapids-reviser.
ci-imgs
repo's latest
configuration to use Python 3.12. https://github.com/rapidsai/ci-imgs/blob/main/latest.yaml
latest
image is frequently used by CI jobs for building docs and testing notebooks. Be aware that issues may arise in those jobs.docker
repo's matrix.yaml and matrix-test.yamlpypi-wheel-scripts
so that Python 3.12 wheels are checked: https://github.com/rapidsai/pypi-wheel-scripts/blob/fa1e8744c8ec961a5b5e38ae172ae9c8c51b4280/release/check-wheels.sh#L41-L45RAPIDS projects are released on the same cadence, mostly using the same versioning scheme, as described in https://docs.rapids.ai/releases/process/.
Given that, the projects tend to have dependencies on other RAPIDS projects from the same release. For example, like this in a pyproject.toml
:
[project]
# ...
dependencies = [
"cudf==24.2.*",
# ...
"dask-cuda==24.2.*",
"dask-cudf==24.2.*",
# ...
"pylibraft==24.2.*",
"raft-dask==24.2.*",
"rapids-dask-dependency==24.2.*",
"rmm==24.2.*"
]
When cutting a new release, shell scripts in each repo (by convention, ci/release/update-version.sh
) are used to update all such versions to the newest RAPIDS release.
As of this writing, some of those scripts don't account for projects whose names have a -cu{CUDA_MAJOR}
suffix in the name, like this:
cudf-cu12==24.2.*
As a result, some dependencies may be missed when beginning a new release cycle.
That should be fixed.
branch-24.04
branches of all other RAPIDS reposThis task is a follow-up to #48. Many of the reasons are the same. The gha-tools repo defines a number of bash scripts that we use in our CI scripts throughout RAPIDS. These tools are automatically installed in the images: https://github.com/rapidsai/ci-imgs/blob/main/ci-conda.Dockerfile#L103. Some of these tools rely on environment variables set in the images, and some developments between the repos must be coordinated. For example, the addition of rapids-configure-sccache
involved the simultaneous removal of these variables from CI images to verify that the tool was actually setting the right variables. Like with miniforge-cuda, PR merges in gha-tools trigger a release, which then triggers a rebuild of ci-imgs in order to embed the latest version of the tools. There is a lot of extra process here that we could elide by simply moving the tools into the ci-imgs repository. We could also make it easier to test changes; if the two are in the same repository, then a change to a tool would automatically trigger an image build with the new tools and we would only need to add a parameter to our shared workflows to enable rerunning a workflow from another repo (e.g. cudf) with the latest images.
Following up from rapidsai/rapids-cmake#534 and rapidsai/cuml#5753.
We should ensure that ctest
is called with the flag --no-tests=error
. This would help prevent false positives in RAPIDS C++ test suites. @KyleFromNVIDIA also proposed adding a pre-commit hook to enforce this.
A new copyright update script was added to https://github.com/rapidsai/pre-commit-hooks, and v0.0.1
was tagged. This check has been deployed in cudf and cuspatial and seems to be working very well. We should deploy this to the rest of RAPIDS too.
Most RAPIDS wheels contain extension modules. However, after #33 we will have a number of pure C++ wheels that contain no Python code at all. We also have a handful of pure Python packages, namely dask_cudf
and the wheels in the cugraph repo aside from cugraph and pylibcugraph. Those packages are handled in a somewhat specialized manner in the wheels workflows in order to produce pure Python wheels, but we do not handle this correctly for conda packages, where we still produce a package per minor version of Python. We should address this issue more holistically.
There are two parts to this request:
wheels-build
and conda-python-build
that only use a single version. We already do this manually in a few places (especially in the new jobs added in addressing #33), so the simplest solution I see here is creating workflows that wrap those preexisting workflows but pass in a matrix filter containing a max_by(py_ver)
. The other thing that we may want to do here is forward along any other information specific to pure wheel builds. One example is the need to specify the RAPIDS_PY_WHEEL_PURE
variable for various gha-tools to work correctly. We could set that appropriately to the environment of all jobs using this shared workflow.noarch:python
, and that we ensure that the Python dependency becomes a >=min_version
instead of pinning to a specific version (this should automatically be handled if the package is built as noarch:python
).We would like to start publishing conda packages that support versions of CUDA newer than CUDA 12.0. At the moment, this is blocked on efforts to get the CTK on conda-forge updated to a sufficiently new version. As of this writing, we are currently updating the conda-forge CTK to 12.1.1. Our plan is to continue the cf update process, and whatever the latest version of the CTK is that's available via cf on Jan 8, 2024, we will use that version for building RAPIDS 24.02 packages.
Assuming that #7 is completed before this, the main tasks will be to:
Step 2 above will likely involve making updates to dependency files in various RAPIDS repos.
This issue will be filled out more and updated once the conda-forge updates are completed and the version finalized.
RAPIDS currently makes use of the NumPy C API in a handful of places, generally in Cython code. The NumPy C API is generally quite good and has remained stable, making it easy to work with. However, it does introduce additional build and packaging complexity that would be nice to avoid. With minimal changes to RAPIDS code, we should be able to remove numpy as a build dependency entirely, which may simplify our builds and also saves us from needing to rebuild packages at all when numpy 2.0 is released. If we were getting a lot of value out of the C API the calculus might be different, but in practice our usage of it is very minimal and can generally be avoided. I propose that we expend a little bit of development effort to stop relying on the NumPy C API altogether. This will help us on two fronts: 1) we'll more easily be able to support multiple major versions of NumPy (see #38) since we only have to worry about Python compatibility, not C compatibility; and more importantly 2) we won't have to worry about NumPy C APIs when considering if we can use the Python limited API to produce a single package across Python versions (will open a separate issue for that next). The latter is the more important piece here, since as of this writing the numpy C API is not compatible with the Python limited API based on the author's current experimentation.
The changes required basically boil down to two things:
Historically our conda and wheel GHA workflow scripts have looked fairly different for a number of reasons. However, with #33 many of the fundamental distinctions will no longer exist because wheels will also have separate build steps for C++ and Python builds. As a result, we should invest in aligning our workflows as much as possible so as to reduce maintenance costs going forward. Some changes that we ought to make:
RAPIDS_PY_WHEEL_NAME
. In the PRs for #33 we're currently abusing RAPIDS_PY_WHEEL_NAME
to handle the CUDA version, so we need to start adding it for wheels before we can get rid of that variable.rapids-download-conda-from-s3
automates choosing the output directory, while rapids-download-wheels-from-s3
requires that the caller specify it. We should update the wheel tool to automate that too.build_wheel_*
, whereas conda is just build_cpp.sh
etc. That is an artifact of a time when conda was our only produced artifact.rapids-wheels-anaconda
tool will need to be modified to support upload of cpp wheels.I will update this list as more ideas come to mind.
Wherever possible, it would be ideal to use https://
. Currently we have several cases where http://
is still used
To address this, think there are a couple things that would be helpful:
http://
usagehttps://
works for the URL in questionhttp://
with https://
(where the replaced URL works)Related to issue ( #9 )
There are various potential improvements we could make to our CI matrix to improve test coverage of critical components while reducing the overall load. Some possible improvements that have been suggested at various times include:
Currently RAPIDS conda packages pin other RAPIDS packages in recipes using version constraints that are effectively of the form YY.MM.*
. In conda, the trailing .*
allows nightlies. That makes using the rapidsai and rapidsai-nightly channels in the same conda install/env creation command potentially problematic, and could lead to situations where user install commands result in environments that are technically invalid. This is especially likely to be problematic for rapids-dask-dependency given the high rate of dask changes and the fact that we track the main branch until just before releases, causing potential problems around release time.
With pip packages, our use of nightly packages is in some sense more controlled. pip will only use nightlies if passed the --pre
flag on the command line or if the version constraint explicitly includes dev versions, e.g. via a constraint of the form >=YY.MM.00a0
. To accommodate this, we set the versions of our packages inside our package build scripts, directly modifying pyproject.toml before invoking pip wheel
. Our final releases do not have constraints specified in this way. That behavior affords users some degree of protection. Although users could still break things by manually specifying --pre
, the default behavior is safe and it's fair to say users are on their own if they use --pre
with release channels. Therefore, mixing the nightly and stable pip indexes is in this sense relatively safer than mixing the nightly and stable conda channels.
We should consider rewriting dependencies in our conda packages to specify constraints in a way that only allows nightly packages to be installed when building nightlies. This could easily be accomplished by using an environment variable that is read in the meta.yaml recipe, by parsing the VERSION file to determine whether the current version corresponds to a nightly build, or by any number of other similar strategies.
Currently RAPIDS libraries support static linkage to cudart via a CMake flag CUDA_STATIC_RUNTIME
. This flag is leveraged by wheel builds and by the Spark-RAPIDS JNI (specifically for cudf), but it is not the default. We would like to consider changing that. Using static libcudart has a few advantages:
Given that cudart is small, the typical size concerns around static linking aren't concerning. However, the CUDA libraries (such as the math libraries like cuBLAS) are large, so we don't typically want to statically link those. Furthermore, static linking has the potential to open us up to issues around weak linking and CUDA kernels in the case of header-only libraries (i.e. anything using thrust, or raft). Therefore, before we can move to building statically by default, we should ensure that our libraries are safe to build that way by marking all kernels as static.
The sysroot*
syntax used currently by RAPIDS recipes is getting phased out
Sample syntax as seen currently in librmm
:
recipe/meta.yaml
:
- sysroot_{{ target_platform }} {{ sysroot_version }}
recipe/conda_build_config.yaml
:
sysroot_version:
- "2.17"
The recommendation is to move to {{ stdlib('c') }}
( conda-forge/conda-forge.github.io#2102 )
Changes would look something like this
recipe/meta.yaml
:
- - sysroot_{{ target_platform }} {{ sysroot_version }}
+ - {{ stdlib('c') }}
recipe/conda_build_config.yaml
:
-sysroot_version:
+c_stdlib_version:
- "2.17"
Raising this issue to track making these changes in RAPIDS
Recently we dropped libnuma
from RAPIDS Docker images: rapidsai/miniforge-cuda#22 (comment)
The libnuma
dependency is added to conda-forge
's ucx
packages starting in 1.14.0
: conda-forge/ucx-split-feedstock#111
To ensure libnuma
is available, we should make sure that ucx
is 1.14.0
or newer. Some potential places to update
We currently use miniforge as our minimal conda installation in our CI images. However, we may be able to switch to something even more lightweight, micromamba. This switch would allow us to shrink our images and also simplify our image builds since we currently take whatever base Python version miniforge installed and then upgrade/downgrade depending on the needs of our particular image.
Recently we ran into an issue on a project (cuCIM) where older packages of the project (libcucim
& cucim
from 23.12
) were installed instead of the most recent packages from the PR (24.02.00a*
). This made the installation look successful. However old issues that had been fixed in the development branch (branch-24.02
) were not getting picked up
This was ultimately caused by a solver issue. However we were not able to ascertain that until we pinned the packages installed in the test phase to the latest version. Then the solver issue became clear and we could work to resolve that
Think we should take a closer look at this issue and come up with a way to guarantee the cached packages are picked up as opposed to some other packages. Attempted to do this more directly by using the <channel>::<package>
syntax, but this didn't work well with file based channels. Maybe there is a better way to do this
We are preparing to add H100 testing to RAPIDS in rapidsai/shared-workflows#194.
So far we have the following test PRs opened:
We should identify any other major repositories that need to be tested with H100s before the shared-workflows PR above is merged.
For arm64
jobs, we currently only run smoke tests that don't test the full coverage on arm
, the following are all the PRs that will enable running full suite of pytests on arm64
jobs:
rapids-build-backend
is a wrapper around standard backends like scikit-build-core and setuptools that handles some of the standard issues that we face for RAPIDS packages (CUDA versioning, alpha versions, etc). Substituting it into existing RAPIDS Python packages should be fairly painless, but will require some careful testing to verify that nothing is broken.
The backend need not be updated in lockstep across all of RAPIDS for this to work, and merging PRs in any order should be generally safe. However, since there are cases like unified devcontainers where underlying build commands may need an update, it would be best to try and test at least a couple of core packages together to verify that everything works as expected.
Currently most RAPIDS C++ libraries produce a single lib*.so
that represents the complete output of the C++ library. There are additional conda packages produced for things like tests, benchmarks, and examples, but the core libraries are contained in a single conda package. While this has historically been fine, we are now seeing increased usage of RAPIDS libraries as dependencies of other libraries, both internally (e.g. cugraph-ops and cumlprims_mg are primarily consumed as dependencies of cugraph and cuml, respectively) and externally (raft is being increasingly used by vector dbs etc). Moreover, we are seeing the potential for static RAPIDS libraries.
Our current package structure is not well suited to handle all of these uses. The lack of separation between runtime and build-time packages means that build-time dependencies are often propagated unnecessarily, bloating runtime environments and making conda solves more complex than they need to be. Additionally, not having standardized package delineations puts a greater onus on downstream developers to know which packages to include in what parts of the recipes, which in turn often leads to misconfigured recipes down the line that cause additional issues. As our conda environments become more and more complex, having packages configured correctly is critical to reducing the number of issues we run into. Some packages (especially raft) have started to address some of these concerns piecemeal to fix specific use cases, but I think now would be a good time for us to consider adopting a more holistic strategy here.
To better address these diverse use cases, we should migrate all RAPIDS packages to offer a more standardized set of packages.
The most common case will be RAPIDS libraries that produce two different conda packages:
${lib}
: The base package would only contain the shared library, basically the minimal runtime requirement for any other package that depends on this library. For example, libcuml would have a runtime dependency on libraft because it needs libraft functions at runtime.${lib}-dev
: *-dev
packages should include everything required to build against the library. ${lib}-dev
should include a runtime dependency on ${lib}
so that the library can be linked to. It should also include a runtime dependency on anything required to build against the package since this package will only be installed with the intent of building against the library. In addition, the dev package should include header files required to compile code that uses the package as well as any packaging files like CMake config files (for now we don't produce e.g. pkgconfig files, but such things would also go in this library if we did). The ${lib}-dev
package should include a run export of ${lib}
, which ensures that any package that builds against ${lib}-dev
will automatically have ${lib}
added to its list of runtime dependencies. Typically Python packages will consume the dev version of the C++ package.For most libraries, the above two will be sufficient. In cases where RAPIDS libraries also want to offer a static component, we will also want to produce
${lib}-static
: static packages will contain the static library. If a static package exists for a given library, then the corresponding dev package should include a run_constrained
specification section so that the dev package and the static package require installing consistent versions.Some RAPIDS libraries are header-only (rmm) or offer a header-only component (raft). This introduces an additional layer of complexity. I do not know if there is a standard for this, so please comment if there is one that we should follow. If not, I would propose the following layout:
${lib}-headers
: This package should exist only for packages that support header-only usage. This package should include all header files and have a runtime dependency for every other package that is required to build against these headers. It should also include CMake config files so that the headers can be found by CMake. If a headers package exists, the dev package should depend on the headers package. In most cases, ${lib}-dev
will likely just be a metapackage that pulls in ${lib}
and ${lib}-headers
. There may need to be some additional CMake files to stitch together the headers with the runtime libs. The header package should not include a run export on the corresponding lib since the presumption is that this package should only be pulled for header-only usage.raft currently produces an additional package libraft-headers-only
. The purpose of this package is to allow consumers of the raft headers to include and use a limited subset of raft that does not require CUDA math libraries. I do not think that this is a standard use case that we'll need to support more generally. However, if we were to support this kind of usage, I would probably argue for modifying the package so that libraft-headers-only
only contained the headers that are actually consumable without CUDA math libraries. Currently I believe that it includes all headers, so it is the user's responsibility to only use the headers that don't require CUDA libs (or to manually install CUDA libs).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.