Comments (21)
That is frustrating! Thank you for your efforts. Is this happening on the CPU or on the GPU?
Eager mode and graph mode follow the same code path. As far as I understand, the main difference between the two modes for our purposes is that in graph mode multiple ops may be executed concurrently. The errors might be related to this concurrent execution.
It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads
to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.
from tensorflow-nufft.
Is this happening on the CPU or on the GPU?
Currently I am only working on GPU, so it is happening on GPU
It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.
Thats a good thing to test. I will try this. However, that still build the question as to how will we fix such an issue in case it is this. I atleast hope this will be faster to run as compared to eager mode..
from tensorflow-nufft.
In all the runs I launched yesterday all ran fine for 20h these are:
- Repeat an earlier run which resulted in NaN
- Run in eager mode
- Run with tf.config.threading.set_inter_op_parallelism_threads as 1
This is getting really frustating as I really just dont have a setup to repro to even be ...
All t he above runs was made with all the seeds set to 0. I am pretty much lost at this point.. But I will keep the issue open to track the presence of such an issue.. Is that fine @jmontalt ?
from tensorflow-nufft.
Hi @chaithyagr, thank you so much for running these tests. It's very useful to know this. Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?
Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.
Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).
from tensorflow-nufft.
Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?
Correct, and I get your point where you expect this to fix our issue and frankly I dont get a huge speedup with inter-op
parallelism set as something other than 1. (I am working in 3D, so I guess we are pretty much using all resources already!)
However, as the repeat run also passed, I am confused :P
Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.
I think the same code from #20 (comment) should be good enough minimum repro. Although one problem I see with this version is that in TFMRI we now threhold the density (https://github.com/mrphys/tensorflow-mri/blob/cfd8930ee5281e7f6dceb17c4a5acaf625fd3243/tensorflow_mri/python/ops/traj_ops.py#L938) and this non_nan reciprocal could be bad:
out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).
I dont think it could be hardware as I run it on multiple machines and different GPUs. It could be possible that something is wrong with my codes specifically. I now plan to save the input when we get NaNs (although this may not be of much use as an extact repeat of run, which results in same output, did not result in NaN, signyfying that it is a more random issue)
from tensorflow-nufft.
A minor question, @jmontalt I was wondering what happens if the input points are not in [-pi, pi]
, could this possibly cause an errors / NaNs? Do we have any guard checks?
from tensorflow-nufft.
It should support up to the extended range [-3 * pi, 3 * pi]
, with points beyond [-pi, pi]
simply being folded back to this range.
This extended range should give some flexibility and robustness against points that are close to [-pi, pi]
, but note that I have never tested this. It would be useful to add such a test.
For points beyond this extended range, behaviour is undefined.
We do not currently assert that the points lie within the correct range. This would be rather inefficient, especially on the GPU.
This behaviour is the same as in the original FINUFFT.
from tensorflow-nufft.
Well, while i agree the wrapping on a mathematical point of view, I think this isnt the valid in MRI. But i guess tenorflow-nufft is generic package so I guess its fine.
from tensorflow-nufft.
It does remain valid in MRI. Note that pi
is, by definition, the maximum spatial frequency (in radians per pixel) that you can have in an image of a certain size (which is one of the inputs to nufft
). In order to usefully sample higher frequencies than that, you would need a finer grid. But if you have a finer grid, then pi
is simply redefined to be the maximum spatial frequency for that grid.
from tensorflow-nufft.
Yes, I get that. What i mean is that in the case of MRI if the inputs are given as is (not re-normalizing to pi
), then:
with points beyond [-pi, pi] simply being folded back to this range.
This would be wrong.. But then again, as i told, I dont think thats the job of tensorflow-nufft
as it is a more generic library..
from tensorflow-nufft.
For points beyond this extended range, behaviour is undefined.
If this is valid. I think this could be the issue. I observed on adding a lot of randomness to my k-space trajectory, the NaN issue came up suddenly. While I still cant pin point why, but my repro has always been in controlled setting where there is no points outside the range. Lets try to see how the code behaves in this scenario. I will try to create such an odd repro for now
from tensorflow-nufft.
Ok, I think that was the issue!
If I take the repro from above and basically add np.pi
to the k-space locations, while NUFFT works fine, the gradients with respect to the trajectories are NaN
!
Not sure if this is realistic though. But proper guard checks are added on my side, I will report back if the issue still remains.
from tensorflow-nufft.
Great, thanks for keeping me in the loop.
So forward nufft
works as expected (e.g., with wrapping) with points in [-3 * pi, 3 * pi]
, but the gradient only works in [-pi, pi]
? Or is it that you had points beyond [-3 * pi, 3 * pi]
?
I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).
from tensorflow-nufft.
I will also document this behaviour in the website to reduce the chances it happens again in the future.
I'm keeping the issue open until everything is clear.
from tensorflow-nufft.
Well as I just added pi
, I dont think we can have points beyond [-3 * pi, 3 * pi]
But I must add, I also tried adding 0.1 * pi
and 0.5 * pi
, didnt face any issues there. So we cant say much yet. Need some more core analysis of whats going wrong where and why...
from tensorflow-nufft.
I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).
This is really helpful!
from tensorflow-nufft.
I think I finally have some complicated case is reproducible!
But if I have a loss which is Multiscale SSIM.
I get NaNs in Graph mode, but no NaNs in eager mode.
However, Using tf.config.threading.set_inter_op_parallelism_threads(1)
the issue still exists! It only vanishes in eager mode.
I am not sure what is the issue specifically yet, I think with this I can make a good minimum repro
Additional things UI observed, the output is pretty much wrong sometimes from the network even if it isn't NaN
(Observed this as I had all the seeds set to 0 and ) , which prevents its good use anyway...
To me, this could be because the MSSIM is launched before the end of nufft
operation.
This is very weird to me as I feel that the MSSIM should technically not launch before nufft
operations are completed.
We should have a look at some basic network implementation possibly to be sure we are not missing on blocking the output.
from tensorflow-nufft.
I wonder how much this issue is related to tensorflow/tensorflow#57353
Although my hypothesis above seems more valid to me as my results are just wrong sometimes...
from tensorflow-nufft.
Hi @chaithyagr, thank you for your efforts. My best guess would be that there's a missing device synchronisation somewhere, which would explain the randomness of the results. It'll be some time until I have the chance to take a look, but if you want to share your repro I will keep it in mind. There's also some significant refactoring in progress which should make debugging easier in the future.
from tensorflow-nufft.
Updating with some more info.
I had some models that ran for really long time with a lot of NUFFTs!
I seem to have hit NaN issues even with eager mode with this.
I have a wrapper to recompute if NaN occurs, and also spew where I get the NaN from and it clearly is from the NUFFT operation.
I feel eager mode runs slower thereby making it harder to have NaNs due to missing device sync (less probable).
For now, I added device sync as seen in chaithyagr#1
I dont see a huge perf hit, even though it is extreme step at the moment. I think we need to check with nvprof if the run is happening in a different stream to see if this even is needed.
For now, I dont see any issues in my runs with this update. I have no idea though, I seem to have tackled this issue in some or other form everytime only for it to be back a month later :P
from tensorflow-nufft.
That's great, thank you! Fingers crossed it is finally fixed now. Good luck!
from tensorflow-nufft.
Related Issues (19)
- `tfft.util.estimate_density` is inaccurate HOT 1
- NUFFT ignores TensorFlow's intra-op parallelism setting HOT 1
- NUFFT ignores CUDA device/stream specified by TF framework HOT 1
- Accept tensors in the `grid_shape` argument HOT 1
- Implement 1D NUFFT for GPU HOT 1
- Finish code refactoring/linting
- I dont think the gradients wrt to `points` is right? HOT 5
- Implement shape inference function HOT 1
- NaN's randomly pop up in computations HOT 18
- Compiling without cuda? HOT 16
- Random failing to associate CUDA stream issues? HOT 4
- Test Case NUFFT1 and Real Valued Data HOT 2
- Allow user to specify `max_batch_size`? HOT 7
- Scaling for NUFFT operator HOT 1
- Cannot pip install HOT 2
- ImportError: cannot import name 'nufft_options_pb2' HOT 1
- Running NUFFT in parallel causes a seg fault in CPU kernel HOT 2
- how to specify number of modes/frequencies? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorflow-nufft.