Hi, I've been debugging an issue I initially found in ASPIRE running our test suite vi

Getting back into debugging this. Seeing some strange values for <code class="notrans

Help with possible pts dependent bug about cufinufft HOT 14 CLOSED

garrettwrong commented on August 30, 2024

Help with possible pts dependent bug

from cufinufft.

Comments (14)

ahbarnett commented on August 30, 2024

The first thing to look for would be global state of cufft. We had an issue like this with FFTW in FINUFFT (not involving 32 vs 64 bit types, but just with different threads). Maybe @MelodyShih @janden @JBlaschke have ideas.

PS @blackwer has started to integrate cufinufft into finufft, first at the repo level, but we plan to integrate as much code as possible, for long-term maintainability. There will be many decisions there, so I hope we can ping you about them or have a meeting.

Good to see you the other day. Best, Alex

from cufinufft.

garrettwrong commented on August 30, 2024

That makes sense. I did check that the plans have a distinct cufft plan but nothing further than that. I'll inspect some more there. Only failing for certain point sets is causing me some cognitive dissonance.

Yes that integration sounds great; big project. Was a pleasure, thanks!

from cufinufft.

garrettwrong commented on August 30, 2024

I think I narrowed corruption down to something occurring temporally around Step 3 of the second 2d1 execute; after returning from CUDECONVOLVE2D I see the junk... Continuing ...

from cufinufft.

garrettwrong commented on August 30, 2024

Getting back into debugging this. Seeing some strange values for fw, but haven't been able to track it down yet. I am able to tickle this just in single precision (without mixing data types or having multiple cufinufft plans). Using A100 w/ CUDA 11.7 driver and toolkit.

from cufinufft.

blackwer commented on August 30, 2024

Thanks Garrett! Let me know if there's anything I can do to help. So many changes it's hard to track this stuff down, so that kind of debugging should fall on me somewhat. I'd like to understand better anyway. Edit: lol I thought this was on my PR. I'm still going to check it out.

…

On Tue, Oct 18, 2022 at 9:25 AM Garrett Wright ***@***.***> wrote: Getting back into debugging this. Seeing some strange values for fw, but haven't been able to track it down yet. I am able to tickle this just in single precision (without mixing data types or having multiple cufinufft plans). — Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACY7USQSZDH4HCKD2BPAQTWD2QL7ANCNFSM6AAAAAAQ6XBLH4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from cufinufft.

garrettwrong commented on August 30, 2024

Hi! Today I created a cuda only case using the point set from the earlier python script; hoping to totally remove Python from the equation, and make better use of my debugger time. Its basically just the shipped 2d api example changed to use my points and problem size. (Attached standalone cuda cufinufft example).

Seems like I can get bad behavior in singles all by itself for 2d pretty reliably with this. Doubles doesn't seem to have any trouble that I have seen. The test (of a single value) that ships with the code seems to succeed, but collectively the values under singles seem unwell for my test case. Example below. Maybe if you have a chance you could confirm if you can reproduce?

~~The non deterministic behavior is a little concerning to me. I haven't been able to reproduce the non-deterministic behavior under cuda-memcheck yet. (interesting)~~

from cufinufft.

garrettwrong commented on August 30, 2024

There was an error in the points and printing for my last tarball, so please disregard. Will try to make another standalone test. Sorry about that.

from cufinufft.

garrettwrong commented on August 30, 2024

Easier to just stash work in a branch on my fork. Hopefully I'm making less mistakes now. Probably I should have gone right to cuda earlier when I was fresher, but I was initially inclined to think it was a bug in ASPIRE.

master...garrettwrong:cufinufft:bug_example

from cufinufft.

ahbarnett commented on August 30, 2024

Hi Garrett,
Can I close this issue or is there still a bug, IYHO ? :) Best, Alex

from cufinufft.

garrettwrong commented on August 30, 2024

Hrmm. I don't believe the bug was ever addressed directly. I haven't tried to reproduce it with the recent (final) version. I will try to reproduce it tomorrow and update the issue :).

from cufinufft.

garrettwrong commented on August 30, 2024

Hi again. I am still able to reproduce this issue's example using the v1.3 code built on our machine.

Linux caf.math.princeton.edu.private 3.10.0-1160.92.1.el7.x86_64 #1 SMP Tue Jun 6 10:04:24 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

rh/devtoolset/8

Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

(base) ➜  cuf_13_bug_repro_144 git clone -b bug_example [email protected]:garrettwrong/cufinufft.git    
Cloning into 'cufinufft'...
Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.114.3'
Offending key for IP in /u/gbwright/.ssh/known_hosts:11
Matching host key in /u/gbwright/.ssh/known_hosts:21
Are you sure you want to continue connecting (yes/no)? yes
remote: Enumerating objects: 4970, done.
remote: Counting objects: 100% (1284/1284), done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 4970 (delta 1177), reused 1146 (delta 1136), pack-reused 3686
Receiving objects: 100% (4970/4970), 1.72 MiB | 0 bytes/s, done.
Resolving deltas: 100% (3579/3579), done.
(base) ➜  cuf_13_bug_repro_144 cd cufinufft 
(base) ➜  cufinufft git:(bug_example) git remote add upstream [email protected]:flatironinstitute/cufinufft.git
(base) ➜  cufinufft git:(bug_example) git fetch upstream

# < snip>

From github.com:flatironinstitute/cufinufft
 * [new tag]         v1.3       -> v1.3
(base) ➜  cufinufft git:(bug_example) git rebase v1.3      
First, rewinding head to replay your work on top of it...
Applying: Add site for tigergpu at Princeton
Applying: Update README with Princeton sites
Applying: add debug example cases
Applying: Add PACM site
Applying: debug symbols seems to tickle much more often...
(base) ➜  cufinufft git:(bug_example) module load  cudatoolkit/11.7 rh/devtoolset/8
(base) ➜  cufinufft git:(bug_example) make all site=PACM -j

<snip>

(base) ➜  cufinufft git:(bug_example) bin/example_1_32 
[gpu ] one targ: rel err in c[64] is 9.3e+04
fk[0:8]: -16466.515625, -6339.429688, 2939.932373, -18116.044922, -37109.199219, -18116.048828, 2939.932617, -6339.421387, -5856.102539, -11939.065430, 
(base) ➜  cufinufft git:(bug_example) bin/example_1_64
[gpu ] one targ: rel err in c[64] is 1.71e+03
fk[0:8]: 14.591308, 19.666844, 23.890922, 21.967780, 20.733888, 21.967777, 23.890929, 19.666834, 19.666833, 22.993997, 
(base) ➜  cufinufft git:(bug_example)

from cufinufft.

blackwer commented on August 30, 2024

Hi @garrettwrong. Significant changes have been made across the board in the current version of this in the finufft repo. I'll try reproducing it there instead and link in this issue from there if it's reproducible. If you'd rather do the lifting, let me know. I'll probably start working on it in 10 minutes or so :)

from cufinufft.

blackwer commented on August 30, 2024

@garrettwrong I'm unable to reproduce this on the current master branch of finufft (flatironinstitute/finufft@710f6b6). GCC 11.4.0, cuda toolkit 11.8.0, on A6000 with: Driver Version: 535.104.05 and CUDA Version: 12.2. Can you try to reproduce this in your environment? CLI recipe and source are below. Make sure to update the cuda architecture in the cmake command.

Edit: the double and the float have the same results, though the relative error is very high still. Not sure if the claim is that this is part of the issue

#include <iostream>
#include <iomanip>
#include <math.h>
#include <helper_cuda.h>
#include <complex>

#include <cufinufft.h>

template <typename T>
T infnorm(int n, std::complex<T> *a) {
    T nrm = 0.0;
    for (int m = 0; m < n; ++m) {
        T aa = real(conj(a[m]) * a[m]);
        if (aa > nrm)
            nrm = aa;
    }
    return sqrt(nrm);
}

using namespace std;

/* The following represent a point set saved from ASPIRE. */
float freqs1[128] = {
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01216906, 0.062310982,
0.14539924, 0.25023413, 0.3626602, 0.46749508, 0.55058336, 0.6007253, 0.023870474, 0.12222739, 0.28521088, 0.49085197,
0.7113836, 0.9170247, 1.0800081, 1.178365, 0.034654554, 0.17744668, 0.41406205, 0.71260655, 1.0327688, 1.3313134,
1.5679288, 1.7107208, 0.04410688, 0.22584677, 0.52700096, 0.9069761, 1.3144654, 1.6944405, 1.9955949, 2.1773348,
0.051864203, 0.26556772, 0.6196876, 1.0664911, 1.5456479, 1.9924514, 2.3465714, 2.5602748, 0.05762841, 0.29508302,
0.68855995, 1.1850214, 1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.061178003, 0.31325847, 0.7309714, 1.258012,
1.823216, 2.3502564, 2.7679696, 3.02005, 0.062376548, 0.31939557, 0.74529195, 1.2826579, 1.8589348, 2.3963008,
2.8221972, 3.0792162, 0.061177995, 0.31325844, 0.73097134, 1.2580119, 1.8232158, 2.3502564, 2.7679694, 3.0200498,
0.05762841, 0.29508302, 0.68855995, 1.1850214, 1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.0518642, 0.2655677,
0.61968756, 1.0664909, 1.5456476, 1.9924511, 2.346571, 2.5602746, 0.04410688, 0.22584677, 0.52700096, 0.9069761,
1.3144654, 1.6944405, 1.9955949, 2.1773348, 0.03465455, 0.17744665, 0.414062, 0.7126065, 1.0327687, 1.3313133,
1.5679287, 1.7107207, 0.02387046, 0.12222735, 0.2852108, 0.49085173, 0.7113832, 0.9170242, 1.0800077, 1.1783645,
0.012169059, 0.06231098, 0.14539923, 0.2502341, 0.36266017, 0.46749502, 0.55058336, 0.60072523
};

float freqs2[128] = {
0.062376548, 0.31939557, 0.74529195, 1.2826579, 1.8589348, 2.3963008, 2.8221972, 3.0792162, 0.061177995, 0.31325844,
0.73097134, 1.2580119, 1.8232158, 2.3502564, 2.7679694, 3.0200498, 0.05762841, 0.29508302, 0.68855995, 1.1850214,
1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.051864203, 0.2655677, 0.61968756, 1.066491, 1.5456477, 1.9924512,
2.3465712, 2.5602746, 0.04410688, 0.22584677, 0.52700096, 0.9069761, 1.3144654, 1.6944405, 1.9955949, 2.1773348,
0.03465455, 0.17744665, 0.414062, 0.7126065, 1.0327687, 1.3313133, 1.5679287, 1.7107207, 0.023870474, 0.12222739,
0.28521088, 0.49085197, 0.7113836, 0.9170247, 1.0800081, 1.178365, 0.012169055, 0.062310956, 0.14539917, 0.250234,
0.36266002, 0.46749485, 0.5505831, 0.600725, -2.7265654e-09, -1.3961223e-08, -3.2577745e-08, -5.6066757e-08, -8.1256616e-08, -1.04745624e-07,
-1.2336216e-07, -1.3459682e-07, -0.01216906, -0.062310982, -0.14539924, -0.25023413, -0.3626602, -0.46749508, -0.55058336, -0.6007253,
-0.023870474, -0.12222741, -0.2852109, -0.490852, -0.71138364, -0.9170247, -1.0800083, -1.1783651, -0.03465456, -0.17744671,
-0.41406208, -0.7126067, -1.0327691, -1.3313136, -1.5679291, -1.7107211, -0.04410688, -0.22584677, -0.52700096, -0.9069761,
-1.3144654, -1.6944405, -1.9955949, -2.1773348, -0.051864203, -0.26556772, -0.6196876, -1.0664911, -1.5456479, -1.9924514,
-2.3465714, -2.5602748, -0.05762842, -0.29508305, -0.68856, -1.1850214, -1.717432, -2.2138934, -2.6073704, -2.844825,
-0.061178003, -0.31325847, -0.7309714, -1.258012, -1.823216, -2.3502564, -2.7679696, -3.02005
};


int main(int argc, char* argv[])
{
  int N1 = 8;
  int N2 = 8;
  int M = 128;

  double tol=1e-7;

  int iflag=1;

  cout<<scientific<<setprecision(3);
  int ier;

  // malloc host arrays
  float *x, *y;
  complex<float> *c, *fk;
  checkCudaErrors(cudaMallocHost(&x, M*sizeof(float)));
  checkCudaErrors(cudaMallocHost(&y, M*sizeof(float)));
  checkCudaErrors(cudaMallocHost(&c, M*sizeof(complex<float>)));
  checkCudaErrors(cudaMallocHost(&fk, N1*N2*sizeof(complex<float>)));

  // malloc device arrays
  float *d_x, *d_y;
  cuFloatComplex *d_c, *d_fk;
  checkCudaErrors(cudaMalloc(&d_x, M*sizeof(float)));
  checkCudaErrors(cudaMalloc(&d_y, M*sizeof(float)));
  checkCudaErrors(cudaMalloc(&d_c, M*sizeof(cuFloatComplex)));
  checkCudaErrors(cudaMalloc(&d_fk, N1*N2*sizeof(cuFloatComplex)));


  // Making data
  for (int i = 0; i < M; i++) {
    x[i] = freqs1[i];
    y[i] = freqs2[i];
  }
  for(int i=0; i<N1*N2; i++){
    fk[i].real(0);
    fk[i].imag(0);
  }
  for(int i=0; i<M; i++){
    c[i].real(1);
    c[i].imag(0);
  }


  // Copy data to device memory, real users might just populate in memory.
  checkCudaErrors(cudaMemcpy(d_x, x, M*sizeof(float),cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_y, y, M*sizeof(float),cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_fk, fk, N1*N2*sizeof(complex<float>), cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_c, c, M*sizeof(complex<float>), cudaMemcpyHostToDevice));


  // construct plan
  cufinufftf_plan dplan;
  int dim = 2;
  int type = 1;

  int64_t nmodes[3];
  int ntransf = 1;
  nmodes[0] = N1;
  nmodes[1] = N2;
  nmodes[2] = 1;

  // Make Plan
  ier = cufinufftf_makeplan(type, dim, nmodes, iflag, ntransf, tol, &dplan, NULL);

  if (ier!=0){
    printf("err: makeplan\n");
    return ier;
  }


  // Set Non uniform points
  ier=cufinufftf_setpts(dplan, M, d_x, d_y, NULL, 0, NULL, NULL, NULL);
  if (ier!=0){
    printf("err: setpts\n");
    return ier;
  }


  // Execute the plan on the data
  ier=cufinufftf_execute(dplan, d_c, d_fk);

  if (ier!=0){
    printf("err: exec\n");
    return ier;
  }


  // Destroy the plan when done processing
  ier=cufinufftf_destroy(dplan);
  if (ier!=0){
    printf("err: destroy\n");
    return ier;
  }


  // Copy test data back to host and compare
  checkCudaErrors(cudaMemcpy(fk, d_fk, N1*N2*sizeof(complex<float>), cudaMemcpyDeviceToHost));

  complex<float> J = complex<float>(0,1)*(float)iflag;
  complex<float> ct = complex<float>(0,0);
  int m=0;
  int jt = M/2;          // check arbitrary choice of one targ pt
  for (int m2=-(N2/2); m2<=(N2-1)/2; ++m2)  // loop in correct order over F
    for (int m1=-(N1/2); m1<=(N1-1)/2; ++m1)
      ct += fk[m++] * exp(J*(m1*x[jt] + m2*y[jt]));   // crude direct
  printf("[gpu ] one targ: rel err in c[%ld] is %.3g\n",(int64_t)jt,abs(c[jt]-ct)/infnorm(M,c));


  /* Look at some other data */
  int cm1;
  printf("fk[0:%d]: ", N1);
  for(cm1=0; cm1<10; cm1++){
    printf("%f, ", (double)fk[cm1].real());
  };
  printf("\n");


  // Cleanup
  checkCudaErrors(cudaFreeHost(x));
  checkCudaErrors(cudaFreeHost(y));
  checkCudaErrors(cudaFreeHost(c));
  checkCudaErrors(cudaFreeHost(fk));
  checkCudaErrors(cudaFree(d_x));
  checkCudaErrors(cudaFree(d_y));
  checkCudaErrors(cudaFree(d_c));
  checkCudaErrors(cudaFree(d_fk));

  return 0;
}

% mkdir -p build && cd build
% cmake .. -DFINUFFT_USE_CUDA=on -DCMAKE_CUDA_ARCHITECTURES="86" -DCMAKE_BUILD_TYPE=relwithdebinfo
% make -j
% nvcc float_bug.cu -I../include -I../include/cufinufft/contrib/cuda_samples libcufinufft_static.a -lcuda -lcufft
% ./a.out
setup_spreader: warning, increasing tol=1e-07 to eps_mach=1.19e-07.
[gpu ] one targ: rel err in c[64] is 1.71e+03
fk[0:8]: 14.591318, 19.666874, 23.890945, 21.967796, 20.733913, 21.967796, 23.890951, 19.666851, 19.666849, 22.994013,

from cufinufft.

garrettwrong commented on August 30, 2024

Hi, using the finufft master code does seem to resolve the issue on our platform as well.

Not worried about the error. The issue was the junk results in floats (often times was yielding nans as well). Right now we have to cast everything to doubles to run with cufinufft. It will be nice to resolve that when finufft is release and we update things.

@ahbarnett , it seems this can be resolved as fixed upstream in finufft, thanks for following up. Closing.

from cufinufft.

Help with possible pts dependent bug about cufinufft HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent