I observe that we get NaNs coming up in my computations. I have stop

I think I resolved the issue. It is as I expected, the need for a <code class="notrans

NaN's randomly pop up in computations about tensorflow-nufft HOT 18 CLOSED

mrphys commented on July 1, 2024

NaN's randomly pop up in computations

from tensorflow-nufft.

Comments (18)

chaithyagr commented on July 1, 2024

Adding here that we never faced this in original cufinufft. So i am feeling this is a more specific issue with the wrapper

from tensorflow-nufft.

jmontalt commented on July 1, 2024

I have never observed this behaviour personally, although it's not the first report I have had of runs occasionally going wrong. Unfortunately I have not had the time to investigate that yet. It may be hardware-specific or problem-specific. Could you share some code (as simple as possible) that causes the issue, even if only occasionally?

Do you know if it's possible to run into this issue with eager execution, or does it only happen in graph execution?

As for debugging yourself, I use a VS code container for development, which has all the necessary dependencies to compile and run the op. It might be a good starting point where you can configure any debugger you would like to use.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

It occurs both in eager mode and graph execution. I checked that.

I will try to setup a simple repro for you.

I will use the container for some debug

from tensorflow-nufft.

jmontalt commented on July 1, 2024

Great, thank you! I'll be very interested to hear about anything you can find out. I'll also do my best to have a look as soon as possible.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Adding additional details as i observe them

This issue occurs only when (take it with a apinch of salt as I really cant be sure given how random it is, i just run 100 iterations to be sure for now)

Gradients Tape is enabled
When we have multiscale ssim loss on the output.
It crops up even when optimizer is not used to update trajectory

I am currently still setting up a minimum repro, sadly, the code seems to work fine in repro, but I think its a matter on closing in on where the issue is being exactly triggered.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

I finally have a working reproducable example / test. Although, this works after running the code for many iterations.

Here is a simple boiled down case of a model which tries to learn a k-space sampling pattern for simple density compensated adjoint (here is the paper describing parts of it) :

import tensorflow as tf
import numpy as np
import tensorflow_nufft as tfft
import tensorflow_mri as tfmri


# We assume no gradients with respect to density compensation
def nograd_density_estimator(traj, img_size, method, max_iter):
    @tf.custom_gradient
    def get_density(traj):
        out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
        # Get approx scaling by take adj_op(op(1))
        test_im = tf.ones(img_size, dtype=tf.complex64)
        test_im_recon = tfft.nufft(
            tf.cast(out, tf.complex64) * tfft.nufft(
                test_im,
                traj,
                transform_type='type_2',
                fft_direction='forward'
            ),
            traj,
            grid_shape=img_size,
            transform_type='type_1',
            fft_direction='backward'
        )
        ratio = tf.reduce_mean(tf.math.abs(test_im_recon))
        out = out / tf.cast(ratio, out.dtype)
        def grad(dy):
            return None
        return out, grad
    return tf.stop_gradient(get_density(tf.transpose(traj)))

# Simple model to learn trajectory on Density compensated adjoint
class TrajModel(tf.keras.Model):
    def __init__(self, shape=(2, 32*512), img_size=(320, 320)):
        super(TrajModel, self).__init__()
        self.shape = shape
        # Radial initialization of trajectory
        self.init = tf.transpose(tf.reshape(tfmri.radial_trajectory(base_resolution=512/2, views=32), (-1, 2)))
        self.traj = self.add_weight(shape=self.shape, dtype=tf.float32)
        self.img_size = img_size
        self.traj.assign(self.init)

    def call(self, image):
        density_comp = nograd_density_estimator(self.traj, self.img_size, method='pipe', max_iter=10)
        kspace = tfft.nufft(image, tf.transpose(self.traj), transform_type='type_2', fft_direction='forward')
        kspace = kspace * tf.cast(density_comp, kspace.dtype)
        recon = tfft.nufft(kspace, tf.transpose(self.traj), grid_shape=image.shape[-2:], transform_type='type_1', fft_direction='backward')
        return tf.abs(recon[..., None])

#tf.config.run_functions_eagerly(True)
my_model = TrajModel()
np.random.seed(0)
I = np.random.random((1, 1, 320, 320)) + 1j * np.random.random((1, 1, 320, 320))
I = I.astype('complex64')
opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
for i in range(1000):
    print(i)
    with tf.GradientTape(persistent=True) as tape:
        recon = my_model(I)
        # loss = tf.reduce_mean(tf.abs(recon -tf.abs(I)))
        loss = tf.reduce_mean(1 - tf.image.ssim_multiscale(tf.abs(I[..., None]), recon, max_val=tf.reduce_max(tf.abs(I))))
        print("Loss : " + str(loss))
        print("Max value of Recon : " + str(np.max(recon)))
    grads = tape.gradient(loss, my_model.trainable_variables)
    print(tf.math.reduce_max(grads))

The same code seems to work fine (again pinch of salt, in my runs :P ) for L1 loss (if you comment the mSSIM loss and enable just the line above).

All the above details i provided in above comments ( #20 (comment) and #20 (comment) ) hold true here as well.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

I think I resolved the issue. It is as I expected, the need for a cudaDeviceSynchronize . (Also see https://stackoverflow.com/questions/55975775/non-deterministic-results-using-gpus-with-tensorflow-and-tensorflow-serving and https://forums.developer.nvidia.com/t/multiple-runs-different-results/37950 )

I will make a PR soon. After final tests. We cant exactly have a test in place to detect this as it would run on CPU where it will be fine, and also the above repro isnt exactly perfect.

I am just doing some final tests to check everything is running fine.

Thank you @jmontalt for your help!

from tensorflow-nufft.

jmontalt commented on July 1, 2024

Excellent! Race conditions of this sort are often easily fixed one you've closed in on the problem, but can be very tricky to pinpoint. I was not looking forward to it. So thanks to you @chaithyagr!

I agree we'll need to do without a test in this case. The nature of the problem means it's impossible to get deterministic behaviour, and the GitHub runner does not have a GPU anyway.

I will merge this ASAP.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Well ideally, the race conditions can easily be found by using racecheck by Nvidia. It just happened that I found this while looking through the codes even before I got that racecheck run up with line informations.
I think cufinufft has a similar issue and a colleague of mine @paquiteau has isolated it around the same time.
Turns out same discoveries at same time do happen xD

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Hey, it looks like the issue is still not fixed. I am not yet very sure whats the issue. @jmontalt can you pleas reopen this issue? I will try to see if I can get another PR with the fix

from tensorflow-nufft.

jmontalt commented on July 1, 2024

Oh, sure!

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

@jmontalt I took some time and went through the cuda API calls and trace in NSight. I suspected that most of the calls are occurring in the same stream due to which, we mostly wont need a device sync (my earlier fix isnt needed). It was a false fix given how variable this issue is, I just got wrong in my first thinking.

I had a question:

I see that for spread step, we do not set the global memory to 0 as is done for execute_type_1. Is there any particular reason? Do we expect tensorflow to do this?

I feel most our issues crop up from this, as I face this issue when I have density compensation estimation in my test code and if i run the same estimation and replace spread and interp with complete NUFFT operations (op and adj_op), everything runs smoothly for now (still testing to be very sure this time).

Further, I observed that the spread runs kernels which sends AtomicAdd operations to update the global memory, which will continue to hold on to the issue of uninitialized memory.

from tensorflow-nufft.

jmontalt commented on July 1, 2024

Hey @chaithyagr, many thanks for looking into this. I was able to reproduce your example. In my machine, it seems to happen at every iteration after the first one. After including the latest fix, I have not observed the issue in 1,000 iterations. You may have nailed it down this time.

You're absolutely right, the grid data should be initialized to 0. This memory was originally allocated with OpKernelContext::allocate_temp, documented here. I have no reason to believe this will initialize its memory.

I might remove the device synchronization instructions we added last time. There is no point in making the CPU wait if subsequent kernels will be launched in the same stream. Do you agree?

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Yes I agree with u. I tested without them also. We don't need them. Looking at the cuda trace I can be sure that we run almost everything in the same stream.

from tensorflow-nufft.

jmontalt commented on July 1, 2024

Great. I'll prepare the merge.

That certainly should be the case. The CUDA stream is assigned by the TensorFlow runtime and our code should respect its wishes.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Hello @jmontalt I dont see the next version up, I thought it was a CD?
Perhaps we can something like https://github.com/zaccharieramzi/tfkbnufft/blob/master/.github/workflows/publish.yml ?

from tensorflow-nufft.

jmontalt commented on July 1, 2024

It's certainly a CD, but releases are only triggered on tagged commits. I tagged it this morning and it was released half an hour ago.

from tensorflow-nufft.

chaithyagr commented on July 1, 2024

Ah I see, my bad. Thank u

from tensorflow-nufft.

NaN's randomly pop up in computations about tensorflow-nufft HOT 18 CLOSED

Comments (18)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent