Coder Social home page Coder Social logo

[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 about paddle HOT 3 CLOSED

zeroRains avatar zeroRains commented on June 2, 2024 1
[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致

from paddle.

Comments (3)

zeroRains avatar zeroRains commented on June 2, 2024

分析部分有点问题,由于在推导过程中忽略了前向计算中使用的std::abs T term1 = (x > 0) ? x : 0;的梯度计算,所以现在修改前向计算公式如下:

$$ res = where(x>0,x,0) - x*label + In(1+e^{-|x|})*posWeight $$

经过推导得到的反向梯度计算为:

$$ \frac {\partial_{res}}{\partial_x} = where(x>0 , 1, 0)-label + \frac{-e^{-|x|} * where(x>=0 ,1, -1)* posWeight}{1+e^{-x}} $$

其中where(x>0, 1, 0)是前向计算中T term1 = (x > 0) ? x : 0;的梯度,where(x>=0,1, -1)std::abs的梯度

对应的修复PR:

from paddle.

zeroRains avatar zeroRains commented on June 2, 2024

kernel反向计算的结果,向numpy中采用数值求解的方式(见源码:op_test.py#L148-L323)计算的结果对齐,而拆解算子执行梯度的方式是通过自动微分求解的,其与kernel反向计算结果对齐。推断是Kernel反向实现的计算,存在问题。验证如下:

在执行sigmoid_cross_entropy_with_logits op的TestSigmoidCrossEntropyWithLogitsOp4中,可以观察到相对误差容忍阈值max_relative_error=0.005,设置得比较大,此时当前develop分支对反向kernel的实现可以通过此单测(虽然通过了,但是肉眼可见两个tensor确实有一些不同)

W0515 05:27:49.445072 36810 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 11.2
W0515 05:27:49.449699 36810 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 1.84299020e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  1.03724500e-04, -3.28235993e-04],
       [ 1.30288861e-04, -4.92659295e-04, -9.86970736e-05, ...,
         3.65435473e-04,  4.38155145e-04, -7.09606712e-04],
       [-1.62492418e-04, -4.71084140e-04,  1.10182313e-04, ...,
         1.25990168e-04,  1.87285167e-04, -5.22634377e-04],
       ...,
       [-3.11346327e-05, -3.74937645e-04,  3.26225574e-04, ...,
        -3.02651920e-04, -5.20642019e-04, -4.17576055e-04],
       [-3.08952871e-04,  2.82421633e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  6.31943427e-05, -1.84394142e-04],
       [-1.67904546e-04, -1.19940036e-05, -2.59066405e-04, ...,
         3.36552688e-04,  2.25882243e-05, -9.09629301e-05]])]
max_relative_error : 
0.005
.
----------------------------------------------------------------------
Ran 1 test in 2.453s

OK

但是当我把这个容忍阈值改为max_relative_error=0.0005时,则会得到如下结果。

I0515 06:12:58.692179 17707 program_interpreter.cc:221] New Executor is Running.
I0515 06:12:58.693336 17707 interpreter_util.cc:652] Standalone Executor is Used.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 1.84299020e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  1.03724500e-04, -3.28235993e-04],
       [ 1.30288861e-04, -4.92659295e-04, -9.86970736e-05, ...,
         3.65435473e-04,  4.38155145e-04, -7.09606712e-04],
       [-1.62492418e-04, -4.71084140e-04,  1.10182313e-04, ...,
         1.25990168e-04,  1.87285167e-04, -5.22634377e-04],
       ...,
       [-3.11346327e-05, -3.74937645e-04,  3.26225574e-04, ...,
        -3.02651920e-04, -5.20642019e-04, -4.17576055e-04],
       [-3.08952871e-04,  2.82421633e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  6.31943427e-05, -1.84394142e-04],
       [-1.67904546e-04, -1.19940036e-05, -2.59066405e-04, ...,
         3.36552688e-04,  2.25882243e-05, -9.09629301e-05]])]
max_relative_error : 
0.0005
F
======================================================================
FAIL: test_check_grad (__main__.TestSigmoidCrossEntropyWithLogitsOp4)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/paddle/test/deprecated/legacy_test/test_sigmoid_cross_entropy_with_logits_op.py", line 178, in test_check_grad
    self.check_grad(['X'], 'Out', check_pir=True)
  File "/paddle/build/test/legacy_test/op_test.py", line 2986, in check_grad
    self.check_grad_with_place(
  File "/paddle/build/test/legacy_test/op_test.py", line 3298, in check_grad_with_place
    numeric_grads = self.check_grad_with_place_for_static(
  File "/paddle/build/test/legacy_test/op_test.py", line 3089, in check_grad_with_place_for_static
    self._assert_is_close(
  File "/paddle/build/test/legacy_test/op_test.py", line 2942, in _assert_is_close
    self.assertLessEqual(max_diff, max_relative_error, err_msg())
AssertionError: 0.0007811970012982192 not less than or equal to 0.0005 : Operator sigmoid_cross_entropy_with_logits error, Gradient Check On Place(cpu) variable X (shape: (64, 20), dtype: float64) max gradient diff 7.811970e-04 over limit 5.000000e-04, the first error element is 3, expected 5.481218e-04, but got 2.099690e-05.

----------------------------------------------------------------------
Ran 1 test in 0.521s

FAILED (failures=1)

因此可以推断,是由于容忍阈值比较大,所以使得反向计算错误的问题没有暴露出来。

在修复pr将max_relative_error=0.0005,仍然可以得到相对正确的计算结果,如下图:

W0515 06:13:53.214535 18318 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 11.2
W0515 06:13:53.220155 18318 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 3.77183699e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  2.45672821e-04, -6.17087178e-06],
       [ 4.84435362e-04,  2.84202718e-04,  1.83934196e-05, ...,
         6.29521507e-04,  5.20318417e-04,  2.33842613e-05],
       [ 3.12574821e-04, -4.71084140e-04,  1.10182313e-04, ...,
         6.98864469e-04,  2.33956823e-04, -7.56920088e-05],
       ...,
       [ 7.15934877e-04, -3.74937645e-04,  3.26225574e-04, ...,
         3.84217363e-05, -5.20642019e-04, -4.17576055e-04],
       [ 1.96948667e-05,  3.88698345e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  3.67850951e-04, -1.84394142e-04],
       [-9.46347782e-05,  1.44751433e-05, -2.59066405e-04, ...,
         5.43416127e-04,  5.17164533e-05,  5.20940836e-04]])]
max_relative_error : 
0.0005
.
----------------------------------------------------------------------
Ran 1 test in 2.753s

OK

from paddle.

zeroRains avatar zeroRains commented on June 2, 2024

BUG已修复,详细见PR

from paddle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.