Coder Social home page Coder Social logo

Comments (22)

albapa avatar albapa commented on August 20, 2024

We've studied this quite a bit with @Sideboard and @jameskermode and our conclusion was that the linear system is so ill-conditioned that even the smallest perturbation will change the weights quite considerably. The good thing is that the physics of the model does not seem to be sensitive. Our idea for a test was a Delta-type test, and those work very reliably.

from quip.

albapa avatar albapa commented on August 20, 2024

And for even differences way larger than 1e-8 the model is close to equivalent.

from quip.

bernstei avatar bernstei commented on August 20, 2024

Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?

from quip.

albapa avatar albapa commented on August 20, 2024

Can you change the test to something that uses np.isclose() or equivalent?

from quip.

bernstei avatar bernstei commented on August 20, 2024

Is the seed being applied to only the sparse points, but not the jitter, or something?

from quip.

albapa avatar albapa commented on August 20, 2024

Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?

I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.

from quip.

bernstei avatar bernstei commented on August 20, 2024

I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.

It's not exact, but it is deterministic (in the absence of OpenMP anyway - I guess I should check that it's disabled for this test, but I think it is).

from quip.

albapa avatar albapa commented on August 20, 2024

The jitter is a simple constant, although somewhat counterintuitively. If the sparse points are the same (and they should be) then it's the way the covariance matrix is built, and then the solution of the system.

from quip.

albapa avatar albapa commented on August 20, 2024

Are the array operations in Fortran deterministic? Like addition, multiplication, sums etc? We make extensive use of those.

But @Sideboard built in a series of prints that can dump all kinds of intermediate matrices, so it is possible to check where things go awry.

from quip.

bernstei avatar bernstei commented on August 20, 2024

I believe that without OpenMP they are deterministic, but I can investigate this. If you could tell me the interface to the intermediate print stuff (just verbosity?) it'd save me the time to add exactly that.

from quip.

bernstei avatar bernstei commented on August 20, 2024

The alphas are similar, so I assume that the sparse points are the same.

from quip.

bernstei avatar bernstei commented on August 20, 2024

Interesting - looks like maybe it is an OpenMP thing, because when I run twice manually it is reproducible exactly. If I convince myself of this, I'll close this issue.

from quip.

gabor1 avatar gabor1 commented on August 20, 2024

Is this true also for turbosoap?

from quip.

bernstei avatar bernstei commented on August 20, 2024

Don't know yet. But I currently think it's a false positive, and my fault for not fully disabling OpenMP.

from quip.

gabor1 avatar gabor1 commented on August 20, 2024

For unit testing you could make the design matrix small enough and the environments distinct enough that the condition number is not huge and then you'd get better reproducibility

from quip.

bernstei avatar bernstei commented on August 20, 2024

It's already not bad, even with OpenMP - at the first 2 iterations I compare the predicted energies and they are within 1e-5. It's only that it builds on itself, and by the 3rd iteration the RSS minima are different, and then it falls apart. But as I said, I'm pretty sure I just failed to disable OpenMP everywhere (specifically in run generating the reference data, which is eventually used by the final pytest where OpenMP was disabled).

[added] the design matrix is small just so the fits are fast.

from quip.

bernstei avatar bernstei commented on August 20, 2024

Looks like it's a subtle np.dot() issue, perhaps a bug. Apparently I get a slightly different dot product, therefore a different config energy_sigma, therefore different GAP, and it all builds from there.

from quip.

bernstei avatar bernstei commented on August 20, 2024

It's not a determinism issue, apparently - just different results from np.dot(v1, v2) and np.sum(v1 * v2) on different machines (a compute node generating the reference data and the head node runningthe final pytest)

from quip.

bernstei avatar bernstei commented on August 20, 2024

Here's a script that shows the problem. On our older cpus it prints 0.0, but on newer CPUs (with OpenBLAS, set to 1 thread) it prints a number of order 1e-16. The head node running pytest is new enough to give the 1e-16, while the node generating reference data was old enough to give 0.0.

import numpy as np

p = np.asarray([16.90513420661038423986610723659396171569824218750000, 0.50000000000000000000000000000000000000000000000000, -1.71583683999999991875995419832179322838783264160156])
v = np.asarray([0.01628648030646704172874628113731887424364686012268, -0.17188353256547875269610869963798904791474342346191, -0.98498264035059979182307188239064998924732208251953])

val_dot = np.dot(v, p)
val_sum = np.sum(v * p)

np.show_config()
print("dot - sum", val_dot - val_sum)

from quip.

bernstei avatar bernstei commented on August 20, 2024

I opened this OpenMathLib/OpenBLAS#3583. Once I'm 100% sure it's this, I'll close this issue, but we should all be careful with numpy, which uses OpenBLAS by default.

from quip.

Sideboard avatar Sideboard commented on August 20, 2024

I remember having different results for identical Turbomole runs once (in the last digits). The reason was afair different math libraries on the nodes. I suppose you use a cluster with uniform nodes so that shouldn't be a problem.

For testing gap_fit I also encountered the problem that even OpenMP runs with identical options (threads, chunk size) are non-deterministic as addition is not commutable for floating point arithmetic, and threads may be faster or slower. This can be dampened by explicit ordered reduction. I have a branch where I tested it in one spot. It works but makes the code more verbose, so the question is if it's worth the effort. The restriction to identical OpenMP options would still apply.

from quip.

bernstei avatar bernstei commented on August 20, 2024

Sorry, I should have closed this. Serial gap_fit is deterministic. I agree that it may be possible to make OpenMP and/or MPI deterministic as well (I know I did something like that a few years ago for CP2K with ScaLAPACK or PBLAS or something), but I don't think it's worth the effort for what I need. My real problem is that np.dot is architecture dependent when using OpenBLAS (at least the version provided by conda), because on avx512 machines it uses some operation that is slightly more accurate than the default behavior. That was affecting my GAP multistage fit script, and therefore giving slightly different potentials.

from quip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.