enfiskutensykkel / ssd-gpu-dma Goto Github PK

Build userspace NVMe drivers and storage applications with CUDA support

License: BSD 2-Clause "Simplified" License

Makefile 0.20% C 51.20% Cuda 14.20% CMake 3.42% C++ 30.99%

ssd nvme pcie gpudirect-rdma gpu nvm-express disk dax driver cluster-computing

ssd-gpu-dma's Introduction

libnvm: An API for building userspace NVMe drivers and storage applications

This library is a userspace API implemented in C for writing custom NVM Express (NVMe) drivers and high-performance storage applications. The API provides simple semantics and functions which a userspace program can use to control or manage one or more NVMe disk controllers.

The API is in essence similar to SPDK, in that it moves driver code to userspace and relies on hardware polling rather than being interrupt driven. By mapping userspace memory directly, libnvm eliminates the cost of context switching into kernel space and enables zero-copy access from userspace. This greatly reduces the latency of IO operations compared to accessing storage devices through normal file system abstractions provided by the Linux kernel.

libnvm is able to provide a simple low-level block-interface with extremely low latency in the IO path. With minimal driver support, it is possible to set up arbitrary memory mappings to device memory, enabling direct IO between NVMe storage devices and other PCIe devices (PCIe peer-to-peer).

As NVMe is designed in a way that reflects the inherent parallelism in modern computing architectures, we are able to provide a lock-less interface to the disk which can be shared by multiple computing instances. libnvm can be linked with CUDA programs, enabling high-performance storage access directly from your CUDA kernels. This is achieved by placing IO queues and data buffers directly in GPU memory, eliminating the need to involve the CPU in the IO path entirely.

A huge benefit of the parallel design of NVMe combined with the possibility of using arbitrary memory addresses for buffers and queues also means that a disk can be shared concurrently by multiple computing instances. By setting up mappings using a PCIe Non-Transparent Bridge (PCIe NTB), it is possible for multiple PCIe root complexes to share a disk concurrently. The API can be linked with applications using the SISCI SmartIO API from Dolphin Interconnect Solutions, allowing the user to create powerful custom configurations of remote and local devices and NVMe disks in a PCIe cluster. In other words, it enables concurrent low-latency access to NVMe disks from multiple machines in the cluster.

Note for researchers

This library and SmartIO are a part of my PhD dissertation, and a description of it can be found in: Markussen et al. "SmartIO: Zero-overhead Device Sharing through PCIe Networking" ACM Transactions on Computer Systems DOI: https://dl.acm.org/doi/abs/10.1145/3462545

If you use this project in your research, I would appreciate a citation for this publication:

@article{Markussen2021,
author = {Markussen, Jonas and Kristiansen, Lars Bj\o{}rlykke and Halvorsen, P\r{a}l and Kielland-Gyrud, Halvor and Stensland, H\r{a}kon Kvale and Griwodz, Carsten},
title = {SmartIO: Zero-Overhead Device Sharing through PCIe Networking},
year = {2021},
issue_date = {May 2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {38},
number = {1–2},
issn = {0734-2071},
url = {https://doi.org/10.1145/3462545},
doi = {10.1145/3462545},
abstract = {The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.},
journal = {ACM Transactions on Computer Systems},
month = {jul},
articleno = {2},
numpages = {78},
keywords = {Resource sharing, composable infrastructure, I/O disaggregation, NTB, cluster architecture, distributed I/O, NVMe, Device Lending, PCIe, GPU}
}

Quick start

You need a PCIe-attached or M.2 NVMe disk (not the system disk!). If the disk contains any data, you should back this up before proceeding. It is also highly recommended that you read the NVMe specification first, which can be found at the following URL: http://nvmexpress.org/resources/specifications/

Prerequisites and requirements

Please make sure that the following is installed on your system:

A relatively new Linux kernel
CMake 3.1 or newer.
GCC version 5.4.0 or newer. Compiler must support GNU extensions for C99 and linking with POSIX threads is required.

The above is sufficient for building the userspace library and most of the example programs.

For using libnvm with your CUDA programs, you need the following:

An Nvidia GPU capable of GPUDirect RDMA and GPUDirect Async This means either a Quadro or Tesla workstation model using the Kepler architecture or newer.
An architecture that supports PCIe peer-to-peer, for example the Intel Xeon family of processors. This is strictly required if you are using SmartIO or plan on using RDMA.
The FindCUDA package for CMake.
GCC version 5.4.0 or newer. Compiler must be able to compile C++11 and POSIX threads.
CUDA 8.0 or newer with CUDA development toolkit.
Kernel module symbols and headers for your Nvidia driver.

For linking with SISCI API, you additionally need the Dolphin 5.5.0 software base (or newer) with CUDA support and SmartIO enabled.

Disable IOMMU

If you are using CUDA or implementing support for your own custom devices, you need to explicitly disable IOMMU as IOMMU support for peer-to-peer on Linux is a bit flaky at the moment. If you are not relying on peer-to-peer, I would in fact recommend you leaving the IOMMU on for protecting memory from rogue writes.

To check if the IOMMU is on, you can do the following:

$ cat /proc/cmdline | grep iommu

If either iommu=on or intel_iommu=on is found by grep, the IOMMU is enabled.

You can disable it by removing iommu=on and intel_iommu=on from the CMDLINE variable in /etc/default/grub and then reconfiguring GRUB. The next time you reboot, the IOMMU will be disabled.

As soon as peer-to-peer IOMMU support is improved in the Linux API and the Nvidia driver supports it, I will add it to the kernel module.

Using CUDA without SmartIO

If you are going to use CUDA, you also need to locate the kernel module directory and manually run make. Locations will vary on different distros and based on installation type, but on Ubuntu the driver source can be usually found in /usr/src/nvidia-<major>-<major>.<minor> if you install CUDA through the .deb. package.

The CMake configuration is supposed to autodetect the location of CUDA, and the Nvidia driver by looking for a file called Module.symvers in known directories. Make sure that this file is generated. It is also possible to point CMake to the correct location of the driver by specifying the NVIDIA define

Make sure that the output from CMake contains both Using NVIDIA driver found in ... and Configuring kernel module with CUDA.

Building the project

From the project root directory, do the following:

$ mkdir -p build; cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release # use =Debug for debug build
$ make libnvm                         # builds library
$ make examples                       # builds example programs

The CMake configuration is supposed to autodetect the location of CUDA, Nvidia driver and SISCI library. CUDA is located by the FindCUDA package for CMake, while the location of both the Nvidia driver and SISCI can be manually set by overriding the NVIDIA and DIS defines for CMake (cmake .. -DNVIDIA=/usr/src/... -DDIS=/opt/DIS/`).

After this, you should also compile the libnvm helper kernel module unless you are using SISCI SmartIO. Assuming that you are still standing in the build directory, do the following:

$ cd module; make # only required if not using SISCI SmartIO

If you have disabled the IOMMU, you can run the identify example to verify that your build is working. Find out your disk's PCI BDF by using lspci. In our example, assume that it is 05:00.0.

First unbind the default nvme driver from the disk:

$ echo -n "0000:05:00.0" > /sys/bus/pci/devices/0000\:05\:00.0/driver/unbind

Then run the identify sample (standing in the build directory). It should look something like this:

$ make libnvm && make identify
$ ./bin/nvm-identify-userspace --ctrl=05:00.0
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID           : 86 80
PCI Subsystem Vendor ID : 86 80
NVM Express version     : 1.2.0
Controller page size    : 4096
Max queue entries       : 256
Serial Number           : BTPY74400DQ5256D
Model Number            : INTEL SSDPEKKW256G7
Firmware revision       :  PSF121C
Max data transfer size  : 131072
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------

If you are using SISCI SmartIO, you need to use the SmartIO utility program to configure the disk for device sharing.

$ /opt/DIS/sbin/smartio_tool add 05:00.0
$ /opt/DIS/sbin/smartio_tool available 05:00.0
$
$ # Find out the local node identifier
$ /opt/DIS/sbin/dis_config -gn
Card 1 - NodeId:  8
$
$ # Connect to the local node
$ /opt/DIS/sbin/smartio_tool connect 8
$
$ # Find out the device identifier
$ /opt/DIS/sbin/smartio_tool list
80000: Non-Volatile memory controller Intel Corporation Device f1a5 [available]
$
$ # Build library and identify example
$ make libnvm && make identify-smartio
$
$ ./bin/nvm-identify --ctrl=0x80000  # use the device id
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID           : 86 80
PCI Subsystem Vendor ID : 86 80
NVM Express version     : 1.2.0
Controller page size    : 4096
Max queue entries       : 256
Serial Number           : BTPY74400DQ5256D
Model Number            : INTEL SSDPEKKW256G7
Firmware revision       :  PSF121C
Max data transfer size  : 131072
Max outstanding commands: 0
Max number of namespaces: 1
Current number of CQs   : 8
Current number of SQs   : 8
--------------------------------------------------

Using the libnvm helper kernel module

If you are not using SISCI SmartIO, you must use the project's kernel module in order to map GPU memory for the NVMe disk. Currently the only version of Linux tested is Linux 4.11.0. Other versions may work, but you probably have to change the call to get_user_pages() as well as any calls to the DMA API.

Repeating the requirements from the section above, you should make sure that you use a processor that supports PCIe peer-to-peer, and that you have a GPU with GPUDirect support. Remember to disable the IOMMU. If you are not using CUDA (or any other third-party stuff), it is recommended that you leave the IOMMU on.

Loading and unloading the driver is done as follows:

$ cd build/module
$ make
$ make load     # will insert the kernel module
$ make unload   # unloads the kernel module

You want to unload the default nvme driver for the NVMe disk, and bind the helper driver to it:

$ echo -n "0000:05:00.0" > /sys/bus/pci/devices/0000\:05\:00.0/driver/unbind
$ echo -n "0000:05:00.0" > /sys/bus/pci/drivers/libnvm\ helper/bind

After doing this, the file /dev/libnvm0 should show up, representing the disk's BAR0.

All CMake build settings

Settings can be passed to CMake using the -Dsetting=value flag. Here is a comprehensive list of settings that can be overridden.

Setting	Default	Explanation
`CMAKE_BUILD_TYPE`	`Debug`	Set to `Release` to make a release build
`DIS`	`/opt/DIS`	Override the Dolphin installation path
`NVIDIA`		Override path to Nvidia driver
`nvidia_archs`	30;50;60;61;70	Specify compute modes and SMs
`no_smartio`	`false`	Don't build API with SmartIO support
`no_module`	`false`	Don't build kernel module
`no_cuda`	`false`	Don't build API with CUDA support
`no_smartio_samples`	`false`	Don't build SmartIO samples
`no_smartio_benchmarks`	`false`	Don't build SmartIO benchmarks

Non-Volatile Memory Express (NVMe)

NVMe is a software specification for disk controllers (drives) that provides storage on non-volatile media, for example flash memory or Intel's 3D XPoint.

The specification is designed in a way that reflects the parallelism in modern CPU architectures: a controller can support up to 2^16 - 1 IO queues with up to 64K outstanding commands per queue. It does not require any register reads in the command or completion path, and it requires a maximum of a 32-bit register write in the command submission path to a dedicated register.

The specification assumes an underlying bus interface that conforms to PCIe.

NVM Namespaces

A namespace is a quantity of non-volatile memory that may be formatted into logical blocks. A NVMe controller may support multiple namespaces. Many controllers may attach the same namespace. In many ways, a namespace can be regarded as an abstraction of traditional disk partitions.

Queue pairs and doorbells

NVMe is based on a paired submission and completiong queue mechanism. The software will enqueue commands on the submission queue (SQ), and completions are posted by the controller to the associated completion queue (CQ). Multiple SQs may use the same CQ, and queues are allocated in system memory. In other words, there are an N:M mapping of SQs and CQs.

Typically the number of command queues are based on the number of CPU cores. For example, on a four core processor, there may be a queue pair per core to avoid locking and ensure that commands are local to the appropriate processors' cache.

A SQ is a ring buffer with a fixed slot size that software uses to submit commands for execution by the controller. After the command structure is updated in memory, the software updates the appropriate SQ tail doorbell register with the number of commands to execute. The controller fetches the SQ entries in order from the SQ, but may execute them in an arbitrary order. Each entry in the SQ is a command. Commands are 64 bytes in size.

An admin submission queue (ASQ) and completion queue (ACQ) exists for the purpose of controller management and control. There is a dedicated command set for admin commands.

Physical Region Pages and Scatter-Gather Lists

Nvidia GPUDirect

Programs intended for running on GPUs or other computing accelerators that support Remote DMA (RDMA), can use this library to enable direct disk access from the accelerators. Currently, the library supports setting up mappings for GPUDirect-capable Nvidia GPUs.

PCIe NTBs and Dolphin SmartIO

Now run the latency benchmark with the specified controller and for 1000 blocks:

$ ./bin/nvm-latency-bench --ctrl=0x80000 --blocks=1000 --pattern=sequential
Resetting controller...
Queue #01 remote qd=32 blocks=1000 offset=0 pattern=sequential (4 commands)
Creating buffer (125 pages)...
Running benchmark...
Queue #01 total-blocks=1000000 count=1000 min=531.366 avg=534.049 max=541.388
	0.99:        540.287
	0.97:        539.424
	0.95:        538.568
	0.90:        535.031
	0.75:        534.377
	0.50:        534.046
	0.25:        533.030
	0.05:        532.025
	0.01:        531.859
OK!

You can also compare this with the performance of the disk locally:

$ ./bin/nvm-latency-bench --ctrl=0x80000 --blocks=1000 --pattern=sequential
Resetting controller...
Queue #01 remote qd=32 blocks=1000 offset=0 pattern=sequential (4 commands)
Creating buffer (125 pages)...
Running benchmark...
Queue #01 total-blocks=1000000 count=1000 min=536.117 avg=541.190 max=549.240
	0.99:        543.080
	0.97:        542.053
	0.95:        541.825
	0.90:        541.677
	0.75:        541.507
	0.50:        541.346
	0.25:        541.152
	0.05:        539.600
	0.01:        539.351
OK!

Note that in this configuration, reads actually have lower latency for the remote run than for the local run.

API overview

Scope and limitations of `libnvm`

Types

nvm_ctrl_t: This is the controller reference type. Holds basic information about a controller and a memory map of its doorbell registers.
nvm_dma_t: DMA descriptor. This is a convenience type for describing memory regions that are mapped for a controller.
nvm_queue_t: Queue descriptor. Used to keep state about I/O queues. Note that the same type is used to represent submission queues (SQs) and completion queues (CQs).
nvm_cmd_t: Definition of an NVM IO command (SQ entry).
nvm_cpl_t: Definition of an NVM IO completion (CQ entry).
nvm_aq_ref: This is a reference to the controller's admin queue-pair. Used for RPC-like calls to the process that "owns" the admin queue-pair.

Header files

nvm_types.h contains type definitions for the most commonly used types. The most interesting types are:
nvm_ctrl.h contains functions for creating and releasing a controller reference. It also contains functions for resetting a controller.
nvm_dma.h has helper functions for creating DMA buffer descriptors aligned to controller pages. It also has functions for creating mappings to memory for the controller.
nvm_aq.h contains the necessary functions for setting up an admin queue-pair and creating a reference to this.
nvm_rpc.h contains functions for binding an admin queue-pair reference to the actual (remote) admin queue-pair.
nvm_queue.h consists of "header-only" functions for enqueuing and submitting I/O commands as well as polling for completions.
nvm_cmd.h contains helper functions for building NVM IO commands.
nvm_admin.h consists of a series of convenience functions for common admin commands, such as reserving IO queues and retrieving controller and namespace information.
nvm_util.h is a bunch of convenience macros.
nvm_error.h deals with packing and unpacking error information. Also contains a function similar to strerror() to retrieve a human readable error description.

Kernel module

Typical mode of operation

Please refer to section 7 of the NVM Express specification.

ssd-gpu-dma's People

Contributors

Stargazers

Watchers

Forkers

scape1989 dolphinics johnjohnsp1 allenyllee zhaojp-frank wskwak patstew cooldavid xunliu lazerhawk sashakot francisxuguoq debojyoti-majumder joeywang1977 jasoncwik muthuramanecs03g wty715 lineagech xzman briandbl potatogim kevinwu2017 wangwenjie123 laike23 hnu-researches hansonchar kkndyu extremeviscent zhouhaocomeon1 moonsam setlion7 eric2947 matt-x86 xuehuayulin angletzlh freemanofchina qq502233945 gaowayne clicx kqingcan keshavaspanda yiluoe

ssd-gpu-dma's Issues

nvm-cuda-bench infintiely waiting for IO completion

So I am running the cuda benchmark with different combinations of number of pages, chunks, and threads, and for some of the combinations of the options the GPU thread waiting for IO completion keeps waiting forever on the first chunk and this seems very random in terms of which combos work and which don't. Below are some for the simple combos I tried.

Pages	Chunks	Threads	Status
1	1	1	Works
1	1	2	Works
1	1	4	Works
1	1	8	Hangs
1	2	1	Works
1	2	2	Works
1	2	4	Hangs
1	2	8	Works
1	4	1	Works
1	4	2	Hangs
1	4	4	Works
1	4	8	Works
1	8	1	Hangs
1	8	2	Works
1	8	4	Works
1	8	8	Works
2	1	1	Works
2	1	2	Works
2	1	4	Hangs
2	1	8	Works
2	2	1	Works
2	2	2	Hangs
2	2	4	Works
2	2	8	Works
2	4	1	Hangs
2	4	2	Works
2	4	4	Works
2	4	8	Works
2	8	1	Works
2	8	2	Works
2	8	4	Works
2	8	8	Works
1	1	5	Hangs
1	2	5	Hangs
2	1	5	Hangs
2	2	5	Works

I know the last 4 configs is using 5 threads which isn't a power of 2, but I don't see a problem with the provided code running with any thread count as long as the __syncthreads can synchronize the threads in the thread block and there are enough entries in the NVMe queues for each thread. I have changed the settings file to allow this.

So I am not understanding why certain configs hang and why others don't.
The only changes I have done to the code in your repo is change the settings file for the cuda benchmark to remove the restriction of threads being a power of 2 and removing the +1 from the computation fo max entries from the MQES register.

nvm-cuda-bench failed as "an illegal memory access was encountered"

Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
Supermicro X12DPi-N6
NVIDIA RTX A2000
Samsung 980 Pro nvme
Ubuntu 20.04.5 / 5.4.0-135-generic / cuda_12.0.0_525.60.13_linux

$ cmake .. -DCMAKE_BUILD_TYPE=Debug -Dnvidia_archs="86"
$ make identify module cuda-benchmark
$ sudo rmmod nvme
$ sudo make -C module load

$ deviceQuery
......
Device 0: "NVIDIA RTX A2000"
CUDA Driver Version / Runtime Version 12.0 / 11.8
CUDA Capability Major/Minor version number: 8.6
......

$ sudo ./bin/nvm-identify --ctrl=/dev/libnvm0
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID : 4d 14
PCI Subsystem Vendor ID : 4d 14
NVM Express version : 1.3.0
Controller page size : 4096
Max queue entries : 16384
Serial Number : S5GXNG0N905360M
Model Number : Samsung SSD 980 PRO 1TB
Firmware revision : 5B2QGXA7
Max data transfer size : 524288
Max outstanding commands: 256
Max number of namespaces: 1
Current number of CQs : 129
Current number of SQs : 129

When run
$ sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0)
Controller page size : 4096 B
Namespace block size : 512 B
Number of threads : 32
Chunks per thread : 32
Pages per chunk : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering : no
Unexpected error: Unexpected CUDA error: an illegal memory access was encountered

$ dmesg
[ 484.710982] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[ 484.710999] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Global Exception on (GPC 2, TPC 0, SM 0): Multiple Warp Errors
[ 484.711014] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x24 0x514728=0xc81eb60 0x51472c=0x1174
[ 484.711584] NVRM: Xid (PCI:0000:98:00): 43, pid=2037, name=nvm-cuda-bench, Ch 00000008

And if run under compute-sanitizer
$ sudo /usr/local/cuda/bin/compute-sanitizer ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
========= COMPUTE-SANITIZER
CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0)
Controller page size : 4096 B
Namespace block size : 512 B
Number of threads : 32
Chunks per thread : 32
Pages per chunk : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering : no
========= Invalid __local__ write of size 16 bytes
========= at 0x3e0 in readSingleBuffered(QueuePair *, unsigned long, void *, void *, unsigned long, unsigned long, unsigned long *, CmdTime *)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xfffcd0 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x302a52]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart798 [0x30e0b]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:cudaLaunchKernel [0x8cd0b]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/usr/local/cuda/include/cuda_runtime.h:216:cudaError cudaLaunchKernel(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x1fd21]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/tmp/tmpxft_00002ae3_00000000-6_main.cudafe1.stub.c:1:__device_stub__Z18readSingleBufferedP9QueuePairmPvS1_mmPmP7CmdTime(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fab2]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:306:readSingleBuffered(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fb2a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:450:launchNvmKernel(Controller const&, std::shared_ptr, Settings const&, cudaDeviceProp const&) [0x1dd7f]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:698:main [0x1ee3a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:../csu/libc-start.c:342:__libc_start_main [0x24083]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x1bf8e]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
<...... Same trace from GPU thread 1 to 31 .......>

========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaHostUnregister.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.
<...... host backtrace omitted ......>

Unexpected error: Unexpected CUDA error: unspecified launch failure
========= Target application returned an error
========= ERROR SUMMARY: 42 errors

Unexpected error: Unexpected CUDA error: an illegal memory access was encountered

Hi,

I use ubuntu 18.04 with V100 GPU.

I ran the benchmark 'nvm-cuda-bench -c /dev/libnvm0', but got the error message "Unexpected error: Unexpected CUDA error: an illegal memory access was encountered".
And, the dmesg also shows the following also messages,

[263169.171738] Adding controller device: 88:00.0
[263169.172098] Character device /dev/libnvm0 created (504.0)
[263169.172185] libnvm helper loaded
[263209.858820] Mapping for address 7ff72da00000 not found
[263255.876777] NVRM: Xid (PCI:0000:1b:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[263255.876795] NVRM: Xid (PCI:0000:1b:00): 13, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x20 0x514728=0x4c1eb72 0x51472c=0x174
[263255.877633] NVRM: Xid (PCI:0000:1b:00): 43, Ch 00000030

It seems the GPU cannot access the device registers of NVME, is that true? And, do you know how to solve it?

Does --verify option works?

Hi, I'm a student interested in GPU data processing.

I appreciate your great effort for implementing RDMA using GPU Direct. It is very useful for studying that research area.

By the way, while trying to run the project, I have met some problems.

The current case is:
Trying to send data from SSD to GPU directly, without using smartio
Below is my make command line
cmake .. -DCMAKE_BUILD_TYPE=Release -Dno_smartio=true -Dno_smartio_samples=true -Dno_smartio_benchmarks=true

And I tried to run nvm-latency-benchmark program with below options

./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1'
./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1 --gpu 0'

When I run the program, both shows
Verifying buffers... FAIL
Unexpected runtime error: Memory buffer differ from file content

Can you please give some advice?

Thank you.

Floating Point Exception

I am trying to run the example nvm-identify but I get the following output:
Resetting controller and setting up admin queues...
Floating point exception

The dmesg output is this:
[May24 16:40] traps: nvm-identify[3179] trap divide error ip:7f6d2f98a434 sp:7ffd9a74e3b0 error:0 in libnvm.so[7f6d2f985000+9000]
I am not sure what is going on. Any help would be appreciated.

Issue when using the cuda example/benchmark

I have been successful in run the nvm-latency-bench without GPU. The output of that is as follows:

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000  --queue="no=128,location=local" --bw

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... DONE
Running bandwidth benchmark (reading, sequential, 1000 iterations)... DONE
Calculating percentiles...
Queue #128 read percentiles (1000 samples)
            bandwidth,       adj iops,    cmd latency,    prp latency
  max:       2118.074,     517108.001,         67.191,          2.150
 0.99:       2107.156,     514442.488,         65.464,          2.095
 0.97:       2102.182,     513227.943,         64.984,          2.079
 0.95:       2097.901,     512182.780,         64.776,          2.073
 0.90:       2093.795,     511180.541,         64.481,          2.063
 0.75:       2084.105,     508814.706,         63.536,          2.033
 0.50:       2070.331,     505451.803,         61.828,          1.978
 0.25:       2014.709,     491872.302,         61.419,          1.965
 0.10:       1985.443,     484727.223,         61.136,          1.956
 0.05:       1976.456,     482533.263,         61.015,          1.952
 0.01:       1957.190,     477829.660,         60.771,          1.945
  min:       1905.024,     465093.782,         60.432,          1.934
End percentiles
OK!

But when I try to run it with a GPU or run the nvm-cuda-bench binary, I get an error saying the following: "Unexpected error: Failed to map device memory: Invalid argument"

./bin/nvm-cuda-bench --ctrl=/dev/libnvm0

CUDA device           : 0 Tesla V100-PCIE-16GB (0000:07:00.0)
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Unexpected error: Failed to map device memory: Invalid argument

Make kernel TCP version of RPC

Investigate DAX support

Kernel module doesn't clean up resources if program crashes or doesn't release handles

Investigate P2P support in the Linux DMA API for latest kernel release

It appears that some efforts have been made recently to make IOMMU and PCIe peer-to-peer (P2P) work on x86. Need to test if the latest kernel properly supports dma_map_resource() and/or if nvidia_p2p_map_pages() returns valid IO addresses.

Incorrect use of DMA API

I think the kernel module is technically using the DMA API incorrectly. dma_map_page and dma_map_single are supposed to be for 'streaming DMA' where you write to the buffer in userspace, then dma_map it, then hand it over to the device. Otherwise data in the CPU cache can be missed. I think dma_alloc_coherent is the correct thing to use in cases like this where the buffer contents are changed repeatedly. I suspect that x86_64 doesn't have a problem because the cache is coherent anyway, but it's a problem for me on 32-bit ARM. So it probably won't affect 99% of users, but I thought I'd report it in case anyone else has problems.
I've worked around it by using another memory allocation system I have on my platform.

ROCm support ?

How hard would it be to add ROCm support ?

Merge into master

Does CQ and SQ memory need to be contiguous

Can I create separate DMA regions for the CQ memory and SQ memory? Is it supported? Or do the SQ pages have to be right after the CQ pages?
I want to create separate memory regions to make sure I am not reading/writing in the wrong memory location.
When you create the queue and prp list memory region in benchmarks/cuda/queue.c, why do you make the alignment (1UL << 16) [65536] and not controller page size?

Sperating SQ, CQ, and PRP List Memories

So in ./benchmarks/cuda/queue.cu, I am trying to use separate allocations and DMA regions for the SQ, CQ, and PRP List. by doing something like the following:

__host__ DmaPtr prepareQueuePair(QueuePair& qp, const Controller& ctrl, const Settings& settings, uint16_t id)
{
    printf("Creating QP %u\n", (unsigned int) id);
    //size_t queueMemSize = 1024 * sizeof(nvm_cmd_t) + 1024 * sizeof(nvm_cpl_t);
    size_t sq_size = 1024 * sizeof(nvm_cmd_t);
    size_t cq_size = 1024 * sizeof(nvm_cpl_t);
    
    size_t prpListSize = ctrl.info.page_size * settings.numThreads * (settings.doubleBuffered + 1);

   
    auto sq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(sq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto cq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(cq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto prp_list_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(prpListSize, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    // Set members
    qp.pageSize = ctrl.info.page_size;
    qp.blockSize = ctrl.ns.lba_data_size;
    qp.nvmNamespace = ctrl.ns.ns_id;
    qp.pagesPerChunk = settings.numPages;
    qp.doubleBuffered = settings.doubleBuffered;
    
    qp.prpList = NVM_DMA_OFFSET(prp_list_mem, 0);
    qp.prpListIoAddr = prp_list_mem->ioaddrs[0];
    
    // Create completion queue
    int status = nvm_admin_cq_create(ctrl.aq_ref, &qp.cq, id, cq_mem->vaddr, cq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create completion queue: ") + nvm_strerror(status));
    }
    printf("CQ MAX_ENTRIES: %u\n", (unsigned int) qp.cq.max_entries);
    // Get a valid device pointer for CQ doorbell
    void* devicePtr = nullptr;
    cudaError_t err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.cq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.cq.db = (volatile uint32_t*) devicePtr;

    // Create submission queue
    status = nvm_admin_sq_create(ctrl.aq_ref, &qp.sq, &qp.cq, id, NVM_DMA_OFFSET(sq_mem, 0), sq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create submission queue: ") + nvm_strerror(status));
    }
    printf("SQ MAX_ENTRIES: %u\n", (unsigned int) qp.sq.max_entries);
    // Get a valid device pointer for SQ doorbell
    err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.sq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.sq.db = (volatile uint32_t*) devicePtr;

    return NULL;
}

All of these allocations seem to be fine.
However, when the GPU threads try to write to the Submission queue entry in prepareChunk with *cmd = local; I get threads accessing illegal memory addresses when they try to write the last 4 bytes of the 64 byte command entry. Am I doing something stupid? I have already tested 1024 entries in the command and completion queue using the original code so I know that part is fine. I just want to separate the memories for the 2 queues just so I avoid any errors.

Change benchmark statistics to per command for nvm-latency-bench

Build and Binding the helper driver

Hi,

Trying to run the CUDA benchmark. Successfully build project with CUDA support. No IOMMU support

root@labuser-pc:/home/labuser# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=UUID=1d724a7d-a2bf-4b8d-b79f-4419bbedd509 ro quiet splash vt.handoff=7

But when try to load the helper driver: can't do the second step. With first step I can see unbind happening.

$ echo -n "0000:05:00.0" > /sys/bus/pci/devices/0000\:05\:00.0/driver/unbind
$ echo -n "0000:05:00.0" > /sys/bus/pci/drivers/disnvm/bind

Don't see "disnvm". Here is the output of the drivers folder. Trying these on Ubuntu 16.04. Need help to setup the driver

root@labuser-pc:/home/labuser# /sys/bus/pci/drivers/
agpgart-intel/    imsttfb/          nvidia-nvswitch/  skx_uncore/
agpgart-via/      ioatdma/          nvme/             snd_hda_intel/
ahci/             iosf_mbi_pci/     ohci-pci/         uhci_hcd/
asiliantfb/       ipmi_si/          parport_pc/       virtio-pci/
ata_generic/      libnvm helper/    pata_sis/         xen-platform-pci/
ata_piix/         lpc_ich/          pcieport/         xhci_hcd/
ehci-pci/         mei_me/           serial/           
i40e/             nvidia/           shpchp/

Regards,
MJay

Cmake output saying 'Configuring kernel module without CUDA'

Hi;

I have a Jetson Xavier AGX kit board and I plugged into the M.2 key M an NVMe SSD. Now, I'm trying to install your libnm on my Xavier and I show the following message in CMake output:

-- Found CUDA: /usr/local/cuda-10.0 (found suitable version "10.0", minimum required is "8.0")
-- Using NVIDIA driver found in
-- Configuring kernel module without CUDA
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ganapathi/Downloads/ssd-gpu-dma-master/build

How can I force Cmake to build with CUDA?

Thank

Can not find "nvm-latency-bench" in build/bin

Hi,

I want to run the latency benchmark with the specified controller and for 1000 blocks as your instruction, but I found out that there is no "nvm-latency-bench" in build/bin to run the command
$ ./bin/nvm-latency-bench --ctrl=0x80000 --blocks=1000 --pattern=sequential

I dont know what happened with "make libray"????

Write benchmark program that memory maps a file and uses cudaMemcpy to copy into GPU memory

nvm-identify run error

Hi:
I have build libnvm with:
cmake .. -DNVIDIA=/usr/src/nvidia-440.33.01
make identify
then run nvm-identify binary as follows:
1) when unbind device :
echo "0000:19:00.0" > /sys/bus/pci/devices/0000:19:00.0/driver/unbind
then run cmd with: ./bin/nvm-identify --ctrl=/dev/nvme0n1
with error: Failed to open descriptor: No such file or directory
2) when bind device:
run cmd with: ./bin/nvm-identify --ctrl=/dev/nvme0n1
with error: [map_memory] Page mapping kernel request failed: Inappropriate ioctl for device
what need i to do to fix this problem?

Issue with multiple queues for latency benchmark

For the latency benchmark, there is an issue with using multiple queues causing the consumer threads to hang forever waiting for completions. This happens when using an Intel Optane 900P PCIe disk.

Invalid NSID

I am running the cuda benchmark from your codebase, with the following output for the controller and command line configuration:

Controller page size  : 4096 B
Namespace block size  : 4096 B
Number of threads     : 1
Chunks per thread     : 1
Pages per chunk       : 5
Total number of pages : 5
Total number of blocks: 5
Double buffering      : no

The problem is the thread never finishes polling for the first chunk. So I exit out, reload the regular nvme driver and check the device's error log.
When I check the device's error log, I see the following entry for each time I try to run the benchmark:

sqid         : 1
cmdid        : 0
status_field : 0x4016(INVALID_NS)
parm_err_loc : 0xffff
lba          : 0
nsid         : 0x1
vs           : 0

The nvme ssd has only 1 namespace (NSID: 1) and its the one being used for all commands in the codebase. So what could be the issue? Any help in this matter will be appreciated.