Even after a lot of efforts, we still seem to face NaN issues and sadly this seems to

Is this happening on the CPU or on the GPU? <p dir="aut

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

A minor question, <a class="user-mention notranslate" data-hovercard-type="user" data-

Still facing NaN issues! about tensorflow-nufft HOT 21 OPEN

chaithyagr commented on July 19, 2024

Still facing NaN issues!

from tensorflow-nufft.

Comments (21)

jmontalt commented on July 19, 2024

That is frustrating! Thank you for your efforts. Is this happening on the CPU or on the GPU?

Eager mode and graph mode follow the same code path. As far as I understand, the main difference between the two modes for our purposes is that in graph mode multiple ops may be executed concurrently. The errors might be related to this concurrent execution.

It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Is this happening on the CPU or on the GPU?

Currently I am only working on GPU, so it is happening on GPU

It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.

Thats a good thing to test. I will try this. However, that still build the question as to how will we fix such an issue in case it is this. I atleast hope this will be faster to run as compared to eager mode..

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

In all the runs I launched yesterday all ran fine for 20h these are:

Repeat an earlier run which resulted in NaN
Run in eager mode
Run with tf.config.threading.set_inter_op_parallelism_threads as 1

This is getting really frustating as I really just dont have a setup to repro to even be ...
All t he above runs was made with all the seeds set to 0. I am pretty much lost at this point.. But I will keep the issue open to track the presence of such an issue.. Is that fine @jmontalt ?

from tensorflow-nufft.

jmontalt commented on July 19, 2024

Hi @chaithyagr, thank you so much for running these tests. It's very useful to know this. Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?

Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.

Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?

Correct, and I get your point where you expect this to fix our issue and frankly I dont get a huge speedup with inter-op parallelism set as something other than 1. (I am working in 3D, so I guess we are pretty much using all resources already!)
However, as the repeat run also passed, I am confused :P

Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.

I think the same code from #20 (comment) should be good enough minimum repro. Although one problem I see with this version is that in TFMRI we now threhold the density (https://github.com/mrphys/tensorflow-mri/blob/cfd8930ee5281e7f6dceb17c4a5acaf625fd3243/tensorflow_mri/python/ops/traj_ops.py#L938) and this non_nan reciprocal could be bad:

out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))

Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).

I dont think it could be hardware as I run it on multiple machines and different GPUs. It could be possible that something is wrong with my codes specifically. I now plan to save the input when we get NaNs (although this may not be of much use as an extact repeat of run, which results in same output, did not result in NaN, signyfying that it is a more random issue)

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

A minor question, @jmontalt I was wondering what happens if the input points are not in [-pi, pi], could this possibly cause an errors / NaNs? Do we have any guard checks?

from tensorflow-nufft.

jmontalt commented on July 19, 2024

It should support up to the extended range [-3 * pi, 3 * pi], with points beyond [-pi, pi] simply being folded back to this range.

This extended range should give some flexibility and robustness against points that are close to [-pi, pi], but note that I have never tested this. It would be useful to add such a test.

For points beyond this extended range, behaviour is undefined.

We do not currently assert that the points lie within the correct range. This would be rather inefficient, especially on the GPU.

This behaviour is the same as in the original FINUFFT.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Well, while i agree the wrapping on a mathematical point of view, I think this isnt the valid in MRI. But i guess tenorflow-nufft is generic package so I guess its fine.

from tensorflow-nufft.

jmontalt commented on July 19, 2024

It does remain valid in MRI. Note that pi is, by definition, the maximum spatial frequency (in radians per pixel) that you can have in an image of a certain size (which is one of the inputs to nufft). In order to usefully sample higher frequencies than that, you would need a finer grid. But if you have a finer grid, then pi is simply redefined to be the maximum spatial frequency for that grid.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Yes, I get that. What i mean is that in the case of MRI if the inputs are given as is (not re-normalizing to pi), then:

with points beyond [-pi, pi] simply being folded back to this range.

This would be wrong.. But then again, as i told, I dont think thats the job of tensorflow-nufft as it is a more generic library..

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

For points beyond this extended range, behaviour is undefined.

If this is valid. I think this could be the issue. I observed on adding a lot of randomness to my k-space trajectory, the NaN issue came up suddenly. While I still cant pin point why, but my repro has always been in controlled setting where there is no points outside the range. Lets try to see how the code behaves in this scenario. I will try to create such an odd repro for now

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Ok, I think that was the issue!
If I take the repro from above and basically add np.pi to the k-space locations, while NUFFT works fine, the gradients with respect to the trajectories are NaN!

Not sure if this is realistic though. But proper guard checks are added on my side, I will report back if the issue still remains.

from tensorflow-nufft.

jmontalt commented on July 19, 2024

Great, thanks for keeping me in the loop.

So forward nufft works as expected (e.g., with wrapping) with points in [-3 * pi, 3 * pi], but the gradient only works in [-pi, pi]? Or is it that you had points beyond [-3 * pi, 3 * pi]?

I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).

from tensorflow-nufft.

jmontalt commented on July 19, 2024

I will also document this behaviour in the website to reduce the chances it happens again in the future.

I'm keeping the issue open until everything is clear.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Well as I just added pi, I dont think we can have points beyond [-3 * pi, 3 * pi]
But I must add, I also tried adding 0.1 * pi and 0.5 * pi, didnt face any issues there. So we cant say much yet. Need some more core analysis of whats going wrong where and why...

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).

This is really helpful!

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

I think I finally have some complicated case is reproducible!

But if I have a loss which is Multiscale SSIM.

I get NaNs in Graph mode, but no NaNs in eager mode.

However, Using tf.config.threading.set_inter_op_parallelism_threads(1) the issue still exists! It only vanishes in eager mode.

I am not sure what is the issue specifically yet, I think with this I can make a good minimum repro

Additional things UI observed, the output is pretty much wrong sometimes from the network even if it isn't NaN (Observed this as I had all the seeds set to 0 and ) , which prevents its good use anyway...

To me, this could be because the MSSIM is launched before the end of nufft operation.
This is very weird to me as I feel that the MSSIM should technically not launch before nufft operations are completed.
We should have a look at some basic network implementation possibly to be sure we are not missing on blocking the output.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

I wonder how much this issue is related to tensorflow/tensorflow#57353
Although my hypothesis above seems more valid to me as my results are just wrong sometimes...

from tensorflow-nufft.

jmontalt commented on July 19, 2024

Hi @chaithyagr, thank you for your efforts. My best guess would be that there's a missing device synchronisation somewhere, which would explain the randomness of the results. It'll be some time until I have the chance to take a look, but if you want to share your repro I will keep it in mind. There's also some significant refactoring in progress which should make debugging easier in the future.

from tensorflow-nufft.

chaithyagr commented on July 19, 2024

Updating with some more info.
I had some models that ran for really long time with a lot of NUFFTs!

I seem to have hit NaN issues even with eager mode with this.
I have a wrapper to recompute if NaN occurs, and also spew where I get the NaN from and it clearly is from the NUFFT operation.

I feel eager mode runs slower thereby making it harder to have NaNs due to missing device sync (less probable).

For now, I added device sync as seen in chaithyagr#1
I dont see a huge perf hit, even though it is extreme step at the moment. I think we need to check with nvprof if the run is happening in a different stream to see if this even is needed.

For now, I dont see any issues in my runs with this update. I have no idea though, I seem to have tackled this issue in some or other form everytime only for it to be back a month later :P

from tensorflow-nufft.

jmontalt commented on July 19, 2024

That's great, thank you! Fingers crossed it is finally fixed now. Good luck!

from tensorflow-nufft.

Still facing NaN issues! about tensorflow-nufft HOT 21 OPEN

Comments (21)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent