I am trying to run the example nvm-identify but I get the following output: <code

The reason I add 1 is that MQES is, according to the spec, <p dir="au

This is the value printed due to printf("%lx ", *((uint64_t

Floating Point Exception about ssd-gpu-dma HOT 16 OPEN

enfiskutensykkel commented on May 21, 2024

Floating Point Exception

from ssd-gpu-dma.

Comments (16)

enfiskutensykkel commented on May 21, 2024

Hi Zaid,

This probably means that there is a division by zero somewhere. Not really sure where that would be. Would it be possible for you to add some print statements in the examples/identify/module.c and examples/identify/common.c files to see how far it gets before the bug occurs, and ideally which library call that fails? I'm wondering if the error occurs in the reset_ctrl function (in common.c) or in the identify_ctrl function (also in common.c).

Thanks

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

I did some debugging and it seems the problem is on line 51 in include/nvm_queue.h which has the following text:
if ((uint16_t) ((sq->tail - sq->head) % sq->max_entries) == sq->max_entries - 1)
So for some reason the tail, head, and max_entries of the admin queue are all 0.

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

Thank you for your effort.

I suspect that this may be because it is not able to read controller registers properly. I assume you are using the kernel module version (and not the SmartIO version)?

In examples/identify/module.c, could you print out the members of ctrl directly after the call to nvm_ctrl_init, particularly the values of ctrl->page_size and ctrl->max_entries ?

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

These are the values for each of the members of ctrl after the nvm_ctrl_init call:

CTRL->DSTRD: 0
CTRL->timeout: 30000
CTRL->MAX_ENTRIES: 0
CTRL->MM_SIZE: 8192
CTRL->mm_ptr: 0x7f511d31f000

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

Thank you again.

This is a bit strange, it appears that it managed to read the time-out value from the registers, but he maximum number of entries has become 0. I may be doing something wrong in src/ctrl.c or maybe I’ve interpreted the spec wrong. There are some fields that are defined as 0 meaning 1, so I need to check if this is the case here.

Out of curiosity, what kind of nvme controller/disk is this?

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

This is a Corsair NX500 SSD.

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

So the value returned from CAP.MQES reg value in initialize_handle in src/ctrl.c for me is 65535. When we add 1, it overflows and turns 0. Now the question is why does it return 65535.
When I remove the plus 1, I get valid values for max entries for the admin queues and the controller information is returned correctly..

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

The reason I add 1 is that MQES is, according to the spec,

This is a 0's based value. The minimum value is 1h, indicating two entries
Meaning that a value of 0 is illegal, and a value of 1 means it should be 2 etc.

If it returns 65535, I suspect a different type of issue, namely that all MMIO reads are returning 0xFFs (this is a common symptom when reading does not work). If you attempt to read any other values directly from the memory mapped pointer, does it all return 0xFFs? Could you for example try this:
printf("%lx\n", *((uint64_t*) mm_ptr));

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

This is the value printed due to printf("%lx\n", *((uint64_t*) ctrl->mm_ptr)); after calling nvm_ctrl_init:
203c03ffff

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

That's a good thing!

The controller probably supports 65536 entries, and the bug is that I'm using 16-bit variable to read out the value and add one. I'm trying to find a datasheet or documentation in order to verify that it supports that many queue entries, but I suspect that this is the case.

I'll add a quick fix for it and add your name in the commit message (if you want), or you can send me a patch or alternatively a pull request if you want. In the mean time, you can probably comment out the +1 for now.

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

Thank you for your time and effort, by the way.

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

Wait but the specification states that this register is only 16 bits.

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

Yes, the register is only 16 bits, but I belive the value allowed to be 65536. I'll look around, and check the spec errata if it is mentioned, and I'll also have a look at what the Linux kernel NVMe driver does. Does the indentify example program work for you if you don't add one?

from ssd-gpu-dma.

ZaidQureshi commented on May 21, 2024

Yes that example works but I think typically what people do is they always use the MQES+1 value. So I think you will need to change the type of the max_entries field in all of your code. I believe this is leading to another issue at the end of the latency benchmark when you are unmapping memory as I get the following error at the end of the latency benchmark:
[unmap_memory] Page unmapping kernel request failed: Invalid argument

I am trying to debug it and see what is happening.

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

The spec says in the introduction chapter:

Support for 65,535 I/O queues, with each I/O queue supporting up to 64K outstanding commands.

Since it's 2^16 - 1 number of I/O queues, does that mean 64K in this case perhaps means 2^16?

Regardless, the unmapping is a known bug (see #18), I haven't had time to rewrite the kernel module yet. The problem is how I'm currently storing address mappings, as I'm storing everything into one global list when I should instead create a list per open descriptor. It happens with the nvm-latency-benchmark because I check against the PID, and for some reason CUDA programs fork into child processes so the "wrong" PID is attempting to clean up the stored mapping.

It will clean up automatically once you unload the module, so do that every now and then if you run the nvm-latency-bench program multiple times. The warnings and errors look bad, but it is expected.

from ssd-gpu-dma.

enfiskutensykkel commented on May 21, 2024

The permanent fix would be to change the type of max_entries to uint32_t as you suggested, or alternatively always use + 1 when I reference it. I'll have to look into the impact of both solutions.

from ssd-gpu-dma.

Floating Point Exception about ssd-gpu-dma HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent