Coder Social home page Coder Social logo

Comments (17)

julmb avatar julmb commented on September 2, 2024

I looked into this a little today. My findings so far:

The function Numeric.AD.Internal.Reverse.Double.partials does not behave the same for the pure and the +ffi version. In my example, the pure version returns [Infinity,NaN,0.0], while the +ffi version returns [Infinity,0.0]. It is very concerning to me that these are not even the same length! It seems that the number of items returned by the pure version is calculated as part of the backPropagate function, while the +ffi version always returns pTape->variables many items.

The way these results are subsequently used in Numeric.AD.Internal.Reverse.Double.partialArrayOf pads them out with zeroes, so that the result of this function is array (0,2) [(0,Infinity),(1,NaN),(2,0.0)] for the pure version and array (0,2) [(0,Infinity),(1,0.0),(2,0.0)] for the +ffi version. It seems that the actual grad function then uses the first two items of this array.

I also looked at the tapes. The pure version produces the tape Head 5 (Unary 4 Infinity (Binary 3 1 1.0 0.0 (Unary 0 1.0 Nil))). The +ffi version produces a tape data structure that looks like this:

tape->idx 3
tape->offset 2
tape->size 4096
tape->variables 2
tape->val 0xa5b1b8
tape->lnk 0xa6b1b8
tape->prev (nil)
tape->val[0, 1] 1.000000 0.000000
tape->lnk[0, 1] 0 0
tape->val[2, 3] 1.000000 0.000000
tape->lnk[2, 3] 2 1
tape->val[4, 5] inf 0.000000
tape->lnk[4, 5] 3 0

I believe that this would roughly translate to Unary 3 Infinity (Binary 2 1 1.0 0.0 (Unary 0 1.0 Nil)). So it seems like the indexing is different here.

from ad.

julmb avatar julmb commented on September 2, 2024

After some more digging, the issue seems to be in this block of code:

ad/cbits/tape.c

Lines 90 to 105 in 8155b0e

double v = buffer[idx + pTape->offset];
if (v == 0.0) continue;
int i = pTape->lnk[idx*2];
double x = pTape->val[idx*2];
if (x != 0.0)
{
buffer[i] += v*x;
}
int j = pTape->lnk[idx*2 + 1];
double y = pTape->val[idx*2 + 1];
if (y != 0.0)
{
buffer[j] += v*y;
}

The checks if (x != 0.0) and if (y != 0.0) appear to be optimizations, guarding the increments buffer[i] += v*x; and buffer[j] += v*y; from being executed if the second factor is zero, as this would imply that the product is also zero and thus the increment operation would do nothing. However, this assumption does not hold in the presence of IEEE floating point special values, as for example Infinity * 0 = NaN, and thus the increment operation can indeed have an effect.

I suspect that the if (v == 0.0) continue; check could have a similar effect if buffer contains special values, but I have not tried to come up with an example for this yet.

I am not sure how to fix this issue, as I do not know why these optimizations were introduced or how much time they actually save. Removing all three of the conditionals makes the discrepancy go away, but there might be a performance penalty. Checking for NaN values as part of the conditionals should also work, but that seems error prone and the code would not be "obviously correct". Furthermore, that could make the check itself as expensive as the operation it intends to optimize away, so I do not know if that is a good idea.

from ad.

cartazio avatar cartazio commented on September 2, 2024

from ad.

cartazio avatar cartazio commented on September 2, 2024

from ad.

RyanGlScott avatar RyanGlScott commented on September 2, 2024

@sofusmortensen, as the author of the FFI code, do you have any insight into why these if (x != 0.0)/if (y != 0.0)` checks are present?

from ad.

julmb avatar julmb commented on September 2, 2024

I see what you mean, when would v have either infinity or nan as a value?

This happens in the example in the original post of this issue, where buffer will contain Infinity as the partial derivative d sqrt (x * 1) / d x at x = 0.

from ad.

julmb avatar julmb commented on September 2, 2024

It turns out this is even more complicated than I thought. While the pure implementation has separate constructors for unary and binary operation entries on the tape, the +ffi implementation encodes unary operations as binary operations with the second operand set to variable index 0 and a partial derivative of 0.0.

This means that the buffer entry for variable 0 can be inadvertently set to NaN by this "fake" binary entry. This was previously caught by the if (y != 0.0) check, although I am not sure this was deliberate, as the completely symmetrical if (x != 0.0) check serves no such purpose.

Thus, in the absence of the conditionals, the example works for x = [1, 0], but it does not work for x = [0, 1], giving a different (but still wrong) result of [NaN,NaN].

My intuition is that this encoding is fundamentally ambiguous and cannot be handled properly in the presence of IEEE floating point special values. Given an entry Binary 1 0 1.0 0.0, I see no way of knowing whether to treat this as a binary operation between variables 1 and 0, which would require multiplying by 0.0 to get the correct behaviour for special values and a unary operation on variable 1, which would require ignoring the second part of the entry entirely to get the correct behavior.

from ad.

RyanGlScott avatar RyanGlScott commented on September 2, 2024

My intuition is that this encoding is fundamentally ambiguous and cannot be handled properly in the presence of IEEE floating point special values.

Alright. Do you think that this pitfall is worth advertising in the Haddocks somewhere?

from ad.

julmb avatar julmb commented on September 2, 2024

Well, ideally, I think we should fix the problem so that all implementations yield the same results.

The variable indices in the C struct are of type int, so we could use -1 instead of 0 to indicate that an operand is not present (as is the case with a unary tape entry). It's not pretty, but the function tape_backPropagate is the only place that needs to take this into account, so it should be a fairly minor change. The index -1 can never conflict with a real variable, so we can then reliably check whether to incorporate a value into the calculation or not, rather than relying on the == 0.0 check.

Once that change is in place, the code could be adjusted to handle IEEE floating point special values correctly.

If that sounds like an acceptable way of dealing with this, I could probably create a pull request in the next few days (or a separate pull request for the refactoring and the bugfix, if that would be preferable).

from ad.

RyanGlScott avatar RyanGlScott commented on September 2, 2024

OK. I trust your judgment on this, so I'd welcome a PR that changes the FFI implementation to use -1 as placeholder values for missing operands. (Also, feel free to submit a single PR that does the refactoring and bugfix in separate commits.)

from ad.

julmb avatar julmb commented on September 2, 2024

I am still working on the pull request. In the mean time, I have found another example that causes an issue. The expression RD.grad (\ [x, y] -> sqrt x * sqrt y) [0, 0] evaluates to [0.0,0.0] when using +ffi. The correct result, which is yielded by the other modes, is [NaN,NaN].

This is due to the if (v == 0.0) continue; statement skipping the loop body while v = 0 and x = inf, which is not correct, since v * x would have resulted in NaN, so the statement cannot be skipped.

from ad.

cartazio avatar cartazio commented on September 2, 2024

So this is another example where the guard should be the product?

from ad.

cartazio avatar cartazio commented on September 2, 2024

How can I help you on this? Also this is wonderful sleuthing

from ad.

julmb avatar julmb commented on September 2, 2024

So this is another example where the guard should be the product?

Pretty much, although there are two products involved here (v*x and v*y).

After performing the previously suggested refactoring involving -1 indices, I tried removing both the if (x != 0.0) and if (y != 0.0) optimizations (A) as well as the the if (v == 0.0) optimization (B). Removing both A and B resolves all the inconsistencies in the +ffi version of Reverse.Double that I have observed so far. Unfortunately, there are some performance penalties involved.

Removing optimization A had no measurable performance impact in the bench/BlackScholes.hs benchmark or in my application. Removing optimization B had no measurable performance impact in the bench/BlackScholes.hs benchmark, but it did slow down my application by a factor of almost 2 (calculating a Hessian went from 2.1s to 3.9s). It turns out that the if (v == 0.0) guard was able to skip 80% of the backpropagation steps (despite the resulting Hessian being a dense 56x56 matrix with no zero entries) and this had a significant impact on performance.

This is very unfortunate, because it means that if we want correctness, then about half the performance is lost in checking for a condition that should probably never arise in normal operation anyway (assuming that not many people would want to calculate gradients with NaN values in them).

As an aside, I suspect that this zero check could also be used to speed up the non-ffi version of Reverse.Double in a similar way, albeit with the same correctness issues in the presence of IEEE floating point special values.

How can I help you on this? Also this is wonderful sleuthing

Thank you! I am not sure how to proceed from here, so any feedback would be appreciated. I will do some more thinking and probably additional tests tomorrow to see if anything can be done to achieve both correctness and good performance.

from ad.

cartazio avatar cartazio commented on September 2, 2024

from ad.

julmb avatar julmb commented on September 2, 2024

Did you measure with doing the if then with the equality test against the product? That way you preserve the tape sparsity? It’s not the multiply that’s expensive. It’s writing down add zero onto the tape I think.

You are right, checking for zeroness of v*x and v*y does make a big difference. I thought that since even with these checks, 4 out of the 6 memory accesses would still happen, there would not be much to gain. But it looks like the reads (which are mostly sequential, although backwards) are much cheaper than the writes. There is still a factor 1.2 performance impact though.

I will submit a pull request so we can discuss what should be done.

from ad.

cartazio avatar cartazio commented on September 2, 2024

awesome. I think you're measuring worst case overhead, rather than average case. but either way, look forward to the PR

from ad.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.