Coder Social home page Coder Social logo

Can't kill sa-solver. about silentarmy HOT 16 CLOSED

mbevand avatar mbevand commented on July 20, 2024
Can't kill sa-solver.

from silentarmy.

Comments (16)

mbevand avatar mbevand commented on July 20, 2024

Can you run "sa-solver --nonces 1000000000" on one or more GPUs and let it run a few hours, and report me if it crashes, and with what error message? It seems to be an issue with your hardware.

from silentarmy.

 avatar commented on July 20, 2024

@mbevand I have ran your commnad

sa-solver --nonces 100000000 --use 5

I have set gpu to sixth because that card was first which stopped working.

It worked 20-25 minutes and then just stopped. Process just hangs, no segmentation faults or other errors, it looks like it just waits for something.

Example:

Nonce 261b000000000000000000000000000000000000000000000000000000000000: 5 sols
Nonce 271b000000000000000000000000000000000000000000000000000000000000: 0 sols
Nonce 281b000000000000000000000000000000000000000000000000000000000000: 1 sol
Nonce 291b000000000000000000000000000000000000000000000000000000000000: 2 sols
Nonce 2a1b000000000000000000000000000000000000000000000000000000000000: 4 sols
Nonce 2b1b000000000000000000000000000000000000000000000000000000000000: 5 sols
Nonce 2c1b000000000000000000000000000000000000000000000000000000000000: 1 sol
Nonce 2d1b000000000000000000000000000000000000000000000000000000000000: 0 sols
Nonce 2e1b000000000000000000000000000000000000000000000000000000000000: 3 sols
Nonce 2f1b000000000000000000000000000000000000000000000000000000000000: 1 sol
Nonce 301b000000000000000000000000000000000000000000000000000000000000: 2 sols
Nonce 311b000000000000000000000000000000000000000000000000000000000000: 1 sol
Nonce 321b000000000000000000000000000000000000000000000000000000000000: 2 sols
Nonce 331b000000000000000000000000000000000000000000000000000000000000: 0 sols
Nonce 341b000000000000000000000000000000000000000000000000000000000000: 3 sols
Nonce 351b000000000000000000000000000000000000000000000000000000000000: 2 sols
Nonce 361b000000000000000000000000000000000000000000000000000000000000: 1 sol

Screenshot:

default

from silentarmy.

mbevand avatar mbevand commented on July 20, 2024

Yeah this really looks like a hardware hang. Can you show me the output of "dmesg" when it hangs?

from silentarmy.

 avatar commented on July 20, 2024

@mbevand dmesg shows nothing about that.

Full dmesg log:

https://gist.github.com/ddctd143/2930af337b6eb98e1dad583b95f96d45

Output of last lines of dmesg:

[   14.645697] audit: type=1400 audit(1478386725.379:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=741 comm="apparmor_parser"
[   14.645704] audit: type=1400 audit(1478386725.379:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=741 comm="apparmor_parser"
[   14.645707] audit: type=1400 audit(1478386725.379:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=741 comm="apparmor_parser"
[   14.645711] audit: type=1400 audit(1478386725.379:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=741 comm="apparmor_parser"
[   14.706941] audit: type=1400 audit(1478386725.439:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/lightdm/lightdm-guest-session" pid=740 comm="apparmor_parser"
[   14.706949] audit: type=1400 audit(1478386725.439:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/lightdm/lightdm-guest-session//chromium" pid=740 comm="apparmor_parser"
[   14.727493] audit: type=1400 audit(1478386725.459:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/ubuntu-core-launcher" pid=744 comm="apparmor_parser"
[   14.773709] audit: type=1400 audit(1478386725.507:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="webbrowser-app" pid=745 comm="apparmor_parser"
[   14.773717] audit: type=1400 audit(1478386725.507:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="webbrowser-app//oxide_helper" pid=745 comm="apparmor_parser"
[   14.799727] audit: type=1400 audit(1478386725.531:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/cups-browsed" pid=746 comm="apparmor_parser"
[   18.489534] IPv6: ADDRCONF(NETDEV_UP): enp7s0: link is not ready
[   18.592354] r8169 0000:07:00.0 enp7s0: link down
[   18.592374] r8169 0000:07:00.0 enp7s0: link down
[   18.592446] IPv6: ADDRCONF(NETDEV_UP): enp7s0: link is not ready
[   20.235716] r8169 0000:07:00.0 enp7s0: link up
[   20.235724] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready

Totally i have 3 rigs, they are all the same configuration.

each with 6gpu saphire RX 480 4gb Nitro+ OC.

And only one rig has this problem.
Also all rigs has same ubuntu and same driver.

from silentarmy.

 avatar commented on July 20, 2024

Also after i kill silentarmy screen, I can't list devices anymore.

This command just hangs:

xxx@worker1:~/mining/zcash/silentarmy/default$ ./sa-solver --list

and

clinfo

also just hangs

But when silentarmy has been killed, and I try to run after that sa-solver with nonces i got this dmesg output:

[ 1079.910237] INFO: task sa-solver:2290 blocked for more than 120 seconds.
[ 1079.910262]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910301] sa-solver       D ffff88029323fb38     0  2290      1 0x00000002
[ 1079.910304]  ffff88029323fb38 ffff88029323fbb8 ffff8802bd18e600 ffff8802bd5b5940
[ 1079.910306]  ffff880293240000 ffffffffc004e004 ffff8802bd5b5940 00000000ffffffff
[ 1079.910307]  ffffffffc004e008 ffff88029323fb50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910309] Call Trace:
[ 1079.910316]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910319]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910321]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910322]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910336]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910340]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910342]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910344]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910347]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910349]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.910351]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910353]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910356]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910385]  [<ffffffffc0181071>] ? amdgpu_drm_ioctl+0x71/0x80 [amdgpu]
[ 1079.910387]  [<ffffffff8122124f>] ? do_vfs_ioctl+0x29f/0x490
[ 1079.910390]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910392]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910394]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910395] INFO: task sa-solver:2270 blocked for more than 120 seconds.
[ 1079.910414]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910452] sa-solver       D ffff8802c03afb38     0  2270      1 0x00000006
[ 1079.910454]  ffff8802c03afb38 ffff8802c03afbb8 ffff8802bd18e600 ffff8802bd5372c0
[ 1079.910455]  ffff8802c03b0000 ffffffffc004e004 ffff8802bd5372c0 00000000ffffffff
[ 1079.910457]  ffffffffc004e008 ffff8802c03afb50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910459] Call Trace:
[ 1079.910461]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910463]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910464]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910466]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910475]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910477]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910479]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910481]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910483]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910485]  [<ffffffff810ffe1c>] ? __unqueue_futex+0x2c/0x60
[ 1079.910487]  [<ffffffff8110093e>] ? futex_wait+0x16e/0x280
[ 1079.910489]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910491]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910493]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910495]  [<ffffffff810be235>] ? pick_next_task_fair+0x335/0x4f0
[ 1079.910497]  [<ffffffff81037e19>] ? sched_clock+0x9/0x10
[ 1079.910498]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910500]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910502]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910503] INFO: task sa-solver:2271 blocked for more than 120 seconds.
[ 1079.910521]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910558] sa-solver       D ffff8802bc857b38     0  2271      1 0x00000006
[ 1079.910560]  ffff8802bc857b38 ffff8802bc857bb8 ffff8802bd18e600 ffff8802bd188cc0
[ 1079.910561]  ffff8802bc858000 ffffffffc004e004 ffff8802bd188cc0 00000000ffffffff
[ 1079.910563]  ffffffffc004e008 ffff8802bc857b50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910564] Call Trace:
[ 1079.910566]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910568]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910569]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910571]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910577]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910579]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910581]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910583]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910584]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910586]  [<ffffffff810ffe1c>] ? __unqueue_futex+0x2c/0x60
[ 1079.910588]  [<ffffffff8110093e>] ? futex_wait+0x16e/0x280
[ 1079.910589]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910591]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910593]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910594]  [<ffffffff810be235>] ? pick_next_task_fair+0x335/0x4f0
[ 1079.910596]  [<ffffffff81037e19>] ? sched_clock+0x9/0x10
[ 1079.910597]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910599]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910600]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910601] INFO: task sa-solver:2272 blocked for more than 120 seconds.
[ 1079.910619]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910656] sa-solver       D ffff8802c3117b38     0  2272      1 0x00000006
[ 1079.910658]  ffff8802c3117b38 ffff8802c3117bb8 ffffffff81e11500 ffff8802bf0d1980
[ 1079.910659]  ffff8802c3118000 ffffffffc004e004 ffff8802bf0d1980 00000000ffffffff
[ 1079.910661]  ffffffffc004e008 ffff8802c3117b50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910662] Call Trace:
[ 1079.910664]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910666]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910667]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910669]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910675]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910678]  [<ffffffff8141a832>] ? __percpu_counter_add+0x52/0x70
[ 1079.910680]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910681]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910683]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910685]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910687]  [<ffffffff810ffe1c>] ? __unqueue_futex+0x2c/0x60
[ 1079.910688]  [<ffffffff8110093e>] ? futex_wait+0x16e/0x280
[ 1079.910690]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.910691]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910693]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910695]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910696]  [<ffffffff810be235>] ? pick_next_task_fair+0x335/0x4f0
[ 1079.910698]  [<ffffffff81037e19>] ? sched_clock+0x9/0x10
[ 1079.910699]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910701]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910702]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910704] INFO: task sa-solver:2284 blocked for more than 120 seconds.
[ 1079.910722]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910759] sa-solver       D ffff88029524fb38     0  2284      1 0x00000002
[ 1079.910761]  ffff88029524fb38 ffff88029524fbb8 ffff8802bee60cc0 ffff880015562640
[ 1079.910762]  ffff880295250000 ffffffffc004e004 ffff880015562640 00000000ffffffff
[ 1079.910764]  ffffffffc004e008 ffff88029524fb50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910765] Call Trace:
[ 1079.910767]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910769]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910770]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910771]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910777]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910779]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910781]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910783]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910785]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910786]  [<ffffffff810ffe1c>] ? __unqueue_futex+0x2c/0x60
[ 1079.910788]  [<ffffffff8110093e>] ? futex_wait+0x16e/0x280
[ 1079.910789]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910791]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910793]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910794]  [<ffffffff810be235>] ? pick_next_task_fair+0x335/0x4f0
[ 1079.910796]  [<ffffffff81037e19>] ? sched_clock+0x9/0x10
[ 1079.910797]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910799]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910800]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910802] INFO: task sa-solver:2281 blocked for more than 120 seconds.
[ 1079.910819]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910837] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910857] sa-solver       D ffff88029796bb38     0  2281      1 0x00000002
[ 1079.910858]  ffff88029796bb38 ffff88029796bbb8 ffff8802bb3f5940 ffff8802bb3f0cc0
[ 1079.910860]  ffff88029796c000 ffffffffc004e004 ffff8802bb3f0cc0 00000000ffffffff
[ 1079.910861]  ffffffffc004e008 ffff88029796bb50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.910863] Call Trace:
[ 1079.910865]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910867]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.910868]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.910869]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.910875]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.910877]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.910879]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.910881]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.910882]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.910884]  [<ffffffff810ffe1c>] ? __unqueue_futex+0x2c/0x60
[ 1079.910885]  [<ffffffff8110093e>] ? futex_wait+0x16e/0x280
[ 1079.910887]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.910889]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.910890]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.910892]  [<ffffffff810be235>] ? pick_next_task_fair+0x335/0x4f0
[ 1079.910893]  [<ffffffff81037e19>] ? sched_clock+0x9/0x10
[ 1079.910895]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.910896]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.910898]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.910899] INFO: task sa-solver:2285 blocked for more than 120 seconds.
[ 1079.910917]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.910934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.910954] sa-solver       D ffff8802952d7ac8     0  2285      1 0x00000002
[ 1079.910956]  ffff8802952d7ac8 ffffffff81407011 ffff8802bcf65940 ffff8802bede9980
[ 1079.910958]  ffff8802952d8000 ffff8802bb4e5930 ffff8802bb4e58f8 ffff88029e1ff000
[ 1079.910959]  ffff8802bb4e28c8 ffff8802952d7ae0 ffffffff8182d7c5 ffff88029e1ff4b0
[ 1079.910960] Call Trace:
[ 1079.910963]  [<ffffffff81407011>] ? __kfifo_free+0x11/0x40
[ 1079.910965]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.910990]  [<ffffffffc020a160>] amd_sched_entity_fini+0x70/0x100 [amdgpu]
[ 1079.910993]  [<ffffffff810c3dd0>] ? wake_atomic_t_function+0x60/0x60
[ 1079.911012]  [<ffffffffc01ada71>] amdgpu_ctx_do_release+0xa1/0xd0 [amdgpu]
[ 1079.911031]  [<ffffffffc01ae23a>] amdgpu_ctx_mgr_fini+0x7a/0x90 [amdgpu]
[ 1079.911046]  [<ffffffffc018631e>] amdgpu_driver_postclose_kms+0x3e/0xb0 [amdgpu]
[ 1079.911053]  [<ffffffffc0013de4>] drm_release+0x254/0x500 [drm]
[ 1079.911055]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.911057]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.911059]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.911061]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.911062]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.911064]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.911066]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.911067]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.911082]  [<ffffffffc0181071>] ? amdgpu_drm_ioctl+0x71/0x80 [amdgpu]
[ 1079.911084]  [<ffffffff8122124f>] ? do_vfs_ioctl+0x29f/0x490
[ 1079.911085]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.911087]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.911088]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.911090] INFO: task sa-solver:2282 blocked for more than 120 seconds.
[ 1079.911108]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.911126] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.911146] sa-solver       D ffff880296b23b38     0  2282      1 0x00000002
[ 1079.911148]  ffff880296b23b38 ffff880296b23bb8 ffff8802bedecc80 ffff8802bb3f5940
[ 1079.911150]  ffff880296b24000 ffffffffc004e004 ffff8802bb3f5940 00000000ffffffff
[ 1079.911151]  ffffffffc004e008 ffff880296b23b50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.911152] Call Trace:
[ 1079.911155]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.911157]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.911158]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.911159]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.911166]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.911168]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.911170]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.911171]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.911173]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.911175]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.911177]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.911178]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.911180]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.911194]  [<ffffffffc0181071>] ? amdgpu_drm_ioctl+0x71/0x80 [amdgpu]
[ 1079.911196]  [<ffffffff8122124f>] ? do_vfs_ioctl+0x29f/0x490
[ 1079.911198]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.911199]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.911201]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.911202] INFO: task sa-solver:2288 blocked for more than 120 seconds.
[ 1079.911220]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.911238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.911258] sa-solver       D ffff880294bbbb38     0  2288      1 0x00000002
[ 1079.911260]  ffff880294bbbb38 ffff880294bbbbb8 ffff880015562640 ffff8802beded940
[ 1079.911261]  ffff880294bbc000 ffffffffc004e004 ffff8802beded940 00000000ffffffff
[ 1079.911263]  ffffffffc004e008 ffff880294bbbb50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.911264] Call Trace:
[ 1079.911266]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.911268]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.911269]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.911271]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.911277]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.911279]  [<ffffffff8141a832>] ? __percpu_counter_add+0x52/0x70
[ 1079.911281]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.911283]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.911285]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.911286]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.911288]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.911290]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.911291]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.911293]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.911307]  [<ffffffffc0181071>] ? amdgpu_drm_ioctl+0x71/0x80 [amdgpu]
[ 1079.911309]  [<ffffffff8122124f>] ? do_vfs_ioctl+0x29f/0x490
[ 1079.911310]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.911312]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.911313]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f
[ 1079.911315] INFO: task sa-solver:2291 blocked for more than 120 seconds.
[ 1079.911333]       Tainted: G           OE   4.4.0-45-generic #66-Ubuntu
[ 1079.911350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.911370] sa-solver       D ffff880292ca7b38     0  2291      1 0x00000002
[ 1079.911372]  ffff880292ca7b38 ffff880292ca7bb8 ffff8802bee60cc0 ffff8802bedecc80
[ 1079.911374]  ffff880292ca8000 ffffffffc004e004 ffff8802bedecc80 00000000ffffffff
[ 1079.911375]  ffffffffc004e008 ffff880292ca7b50 ffffffff8182d7c5 ffffffffc004e000
[ 1079.911377] Call Trace:
[ 1079.911379]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[ 1079.911381]  [<ffffffff8182da6e>] schedule_preempt_disabled+0xe/0x10
[ 1079.911382]  [<ffffffff8182f6a9>] __mutex_lock_slowpath+0xb9/0x130
[ 1079.911383]  [<ffffffff8182f73f>] mutex_lock+0x1f/0x30
[ 1079.911389]  [<ffffffffc0013bc8>] drm_release+0x38/0x500 [drm]
[ 1079.911391]  [<ffffffff8120f594>] __fput+0xe4/0x220
[ 1079.911393]  [<ffffffff8120f70e>] ____fput+0xe/0x10
[ 1079.911395]  [<ffffffff8109ed41>] task_work_run+0x81/0xa0
[ 1079.911397]  [<ffffffff81083e51>] do_exit+0x2e1/0xb00
[ 1079.911398]  [<ffffffff815ad00d>] ? fence_wait_timeout+0x7d/0x160
[ 1079.911400]  [<ffffffff810846f3>] do_group_exit+0x43/0xb0
[ 1079.911402]  [<ffffffff810908c2>] get_signal+0x292/0x600
[ 1079.911403]  [<ffffffff8102e537>] do_signal+0x37/0x6f0
[ 1079.911418]  [<ffffffffc0181071>] ? amdgpu_drm_ioctl+0x71/0x80 [amdgpu]
[ 1079.911419]  [<ffffffff8122124f>] ? do_vfs_ioctl+0x29f/0x490
[ 1079.911421]  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[ 1079.911422]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 1079.911424]  [<ffffffff81831a10>] int_ret_from_sys_call+0x25/0x8f

from silentarmy.

Phistr90 avatar Phistr90 commented on July 20, 2024

I have the exact same problem, also rx 480 nitro 8gb

from silentarmy.

mbevand avatar mbevand commented on July 20, 2024

In the last dmesg I see amdgpu_drm_ioctl() in the stack trace. This is more evidence pointing toward "asic hangs" and flaky/unreliable hardware. Not sure I can be of much help :-(

from silentarmy.

 avatar commented on July 20, 2024

@mbevand

I disconnected GPU #5(6) and turned on miner for a night, from 5 cards, only first one now is working, all other are giving 0 sol/s.

screen shot 2016-11-06 at 13 50 56

In dmesg i see only one line:

[ 4449.551183] perf interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

from silentarmy.

Phistr90 avatar Phistr90 commented on July 20, 2024

yeah its unlikely the gpu because I switched cards and they work in the other rig with 4 gpus. and on the 6gpu rig other cards are failing now...

from silentarmy.

 avatar commented on July 20, 2024

@Phistr90 i'm checking my hardware right now. It also could be a PSU.
Do you have dual PSU setup on six cards ? or one PSU ? can you write your rig full specs. ?

from silentarmy.

Phistr90 avatar Phistr90 commented on July 20, 2024

@ddctd143: 6x rx 480, h81 pro btc, unpowered riser cable, 1250W ocz psu, 2x 4gb ram, celeron cpu

unlikely the psu as well, because I switched them too :)

from silentarmy.

 avatar commented on July 20, 2024

I have dual PSU (750W) rig with H81 pro btc., 6x rx480 4gb, and powered risers. I have disconnected second PSU with 3 cards and risers. First 3 cards have worked for 40 minutes without a problem. Now i have connected back secondary PSU and first card on that PSU. Will try to check if this could be a PSU or risers problem.

from silentarmy.

 avatar commented on July 20, 2024

Update.

Tried to change RAM, PSU, tried to change risers, but still on two rigs some of gpus stop working and hangs. After that I cant kill zombie processes and also can't restart or shutdown. Only hard poweroff helps. It does not look like hardware problem, as other people in forum.z.cash also complains about freezes.

from silentarmy.

mbevand avatar mbevand commented on July 20, 2024

I am sorry but this really does look like a hardware problem. I used to run a 20 kW GPU farm and these issues were always symptoms of hardware reliability problems.

See what happens is when a GPU becomes flaky, the kernel driver may never release some software locks, and this deadlocks other GPUs (who are totally OK hardware-wise) which explains why many of them drop to 0.

You either have a system-wide problem (eg. insufficient PSU). Or a GPU-specific problem. In the second case you can try isolating which GPU is the cause by removing one GPU at a time from the system, until it works stably.

I am going to close this github issue, but feel free to continue discussing here and/or forum.z.cash.

from silentarmy.

knicefire avatar knicefire commented on July 20, 2024

@ddctd143 have you solved the issue? it seems that I'm having the exact same problem with my rx480 but I'm using ethminer..
The trace codes helped me find your issue.

from silentarmy.

kyjak avatar kyjak commented on July 20, 2024

Same issue here, any progress?

from silentarmy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.