Coder Social home page Coder Social logo

Comments (21)

ftsguy avatar ftsguy commented on August 19, 2024

Ill get some more logs here in a bit that may help.

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

ignore the fan = 0 on gpu 6 its asus card and had never reported right

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Based on those logs, it seems the kernel is stuck waiting for the dead GPU to respond, which is why the miner won't exit. Can you share a little about what your restart script does?

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

sure just attempts a reboot but seems it never makes it there i tried a simple echo "testing" in the script as well and never echoed it.

#!/bin/bash
sudo shutdown -r now

also tried this
#!/bin/bash
echo "testing"

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Have you tested that the script is run correctly when you use the --watchdog_test option?

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

Yes it works before a GPU failure

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

I'm running on 18.5 drivers temped to try 18.3

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Well that is really frustrating. The only thing I can think of is that attempting to execute the script causes a page fault in the miner, which gets hung because of the hung dma operation which has the memory pinned.

How much dram does this machine have? Is it possible to add another stick to see if that resolves the problem?

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

4 gig. I don't have a swap partion either if you think that would matter. I'll try the ram and report back.

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

Also the kernel still responds to normal bash commands while its froze if that helps

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

If you have any code mod ideas let me know its driving me nuts I'll test it

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

Darn was hoping it was the ram. still the same issue. Any other logs i might be able to pull from somewhere ?

Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1839 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1839 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: ? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu]
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: rwsem_down_read_failed+0xee/0x150
Mar 28 01:59:54 rig2 kernel: call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: ? call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: down_read+0x20/0x40
Mar 28 01:59:54 rig2 kernel: __do_page_fault+0x43d/0x4d0
Mar 28 01:59:54 rig2 kernel: do_page_fault+0x2e/0xe0
Mar 28 01:59:54 rig2 kernel: ? page_fault+0x2f/0x50
Mar 28 01:59:54 rig2 kernel: page_fault+0x45/0x50
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81a1f2180
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7f431fcb0 EFLAGS: 00010202
Mar 28 01:59:54 rig2 kernel: RAX: 0000000000000000 RBX: 00007fd7d8000b50 RCX: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RDX: 00007fd7f431fc80 RSI: 0000000000001000 RDI: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RBP: 00007fd7d8007ad8 R08: 0000000000000300 R09: ffffffffffffffff
Mar 28 01:59:54 rig2 kernel: R10: 00007fd7f431f960 R11: 0000000000000202 R12: 00007fd7d832b860
Mar 28 01:59:54 rig2 kernel: R13: 0000000000000000 R14: 00007fd7f0337370 R15: 00007fd7f431fdc0
Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1840 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1840 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: ? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu]
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: rwsem_down_read_failed+0xee/0x150
Mar 28 01:59:54 rig2 kernel: call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: ? call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: down_read+0x20/0x40
Mar 28 01:59:54 rig2 kernel: __do_page_fault+0x43d/0x4d0
Mar 28 01:59:54 rig2 kernel: ? SyS_futex+0x13b/0x180
Mar 28 01:59:54 rig2 kernel: do_page_fault+0x2e/0xe0
Mar 28 01:59:54 rig2 kernel: ? page_fault+0x2f/0x50
Mar 28 01:59:54 rig2 kernel: page_fault+0x45/0x50
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81a1814c7
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7d3ebeb60 EFLAGS: 00010246
Mar 28 01:59:54 rig2 kernel: RAX: 00007fd821c106d0 RBX: 00007fd7cc009d30 RCX: 00007fd819e3dbb0
Mar 28 01:59:54 rig2 kernel: RDX: 00007fd7cc104938 RSI: 00007fd7cc012c20 RDI: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RBP: 00007fd7cc104938 R08: 00000000000002de R09: ffffffffffffffff
Mar 28 01:59:54 rig2 kernel: R10: 00007fd7d3ebe960 R11: 0000000000000202 R12: 0000000000000001
Mar 28 01:59:54 rig2 kernel: R13: 000000000000000f R14: 00007fd7cc007c50 R15: 00007fd7cc104fd0
Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1848 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1848 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: rwsem_down_read_failed+0xee/0x150
Mar 28 01:59:54 rig2 kernel: call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: ? call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: down_read+0x20/0x40
Mar 28 01:59:54 rig2 kernel: __do_page_fault+0x43d/0x4d0
Mar 28 01:59:54 rig2 kernel: ? SyS_futex+0x13b/0x180
Mar 28 01:59:54 rig2 kernel: do_page_fault+0x2e/0xe0
Mar 28 01:59:54 rig2 kernel: ? page_fault+0x2f/0x50
Mar 28 01:59:54 rig2 kernel: page_fault+0x45/0x50
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81f2f5a70
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7caffcdd0 EFLAGS: 00010203
Mar 28 01:59:54 rig2 kernel: RAX: 00000000000000a1 RBX: 0000000000000003 RCX: 00007fd81f2f5a66
Mar 28 01:59:54 rig2 kernel: RDX: 0000000000001000 RSI: 00007fd81ffb66a4 RDI: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 28 01:59:54 rig2 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd81ffb66a4
Mar 28 01:59:54 rig2 kernel: R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000003c00
Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1849 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1849 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: rwsem_down_read_failed+0xee/0x150
Mar 28 01:59:54 rig2 kernel: call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: ? call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: down_read+0x20/0x40
Mar 28 01:59:54 rig2 kernel: __do_page_fault+0x43d/0x4d0
Mar 28 01:59:54 rig2 kernel: ? SyS_futex+0x13b/0x180
Mar 28 01:59:54 rig2 kernel: do_page_fault+0x2e/0xe0
Mar 28 01:59:54 rig2 kernel: ? page_fault+0x2f/0x50
Mar 28 01:59:54 rig2 kernel: page_fault+0x45/0x50
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81f2f5a70
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7ca7fbc90 EFLAGS: 00010207
Mar 28 01:59:54 rig2 kernel: RAX: 0000000000000147 RBX: 0000000000000004 RCX: 00007fd81f2f5a66
Mar 28 01:59:54 rig2 kernel: RDX: 0000000000010000 RSI: 00007fd81ffbff8c RDI: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 28 01:59:54 rig2 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd81ffbff8c
Mar 28 01:59:54 rig2 kernel: R13: 0000000000010000 R14: 0000000000000000 R15: 00007fd81ffbff90
Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1850 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1850 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: rwsem_down_read_failed+0xee/0x150
Mar 28 01:59:54 rig2 kernel: call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: ? call_rwsem_down_read_failed+0x18/0x30
Mar 28 01:59:54 rig2 kernel: down_read+0x20/0x40
Mar 28 01:59:54 rig2 kernel: __do_page_fault+0x43d/0x4d0
Mar 28 01:59:54 rig2 kernel: do_page_fault+0x2e/0xe0
Mar 28 01:59:54 rig2 kernel: ? page_fault+0x2f/0x50
Mar 28 01:59:54 rig2 kernel: page_fault+0x45/0x50
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81f2b79da
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7c9ffa950 EFLAGS: 00010207
Mar 28 01:59:54 rig2 kernel: RAX: 0000000000000000 RBX: 00007fd7c9ffaa50 RCX: 00007fd81f2b79d0
Mar 28 01:59:54 rig2 kernel: RDX: 0000000000000000 RSI: 00007fd7c9ffaa50 RDI: 0000000000000000
Mar 28 01:59:54 rig2 kernel: RBP: 00007fd7c9ffaa50 R08: 0000000000000000 R09: 00007fd820061be8
Mar 28 01:59:54 rig2 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000003
Mar 28 01:59:54 rig2 kernel: R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000003
Mar 28 01:59:54 rig2 kernel: INFO: task teamredminer:1851 blocked for more than 120 seconds.
Mar 28 01:59:54 rig2 kernel: Tainted: G OE 4.15.0-46-generic #49-Ubuntu
Mar 28 01:59:54 rig2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 01:59:54 rig2 kernel: teamredminer D 0 1851 1809 0x00000000
Mar 28 01:59:54 rig2 kernel: Call Trace:
Mar 28 01:59:54 rig2 kernel: __schedule+0x291/0x8a0
Mar 28 01:59:54 rig2 kernel: schedule+0x2c/0x80
Mar 28 01:59:54 rig2 kernel: schedule_timeout+0x1cf/0x350
Mar 28 01:59:54 rig2 kernel: ? memcg_kmem_charge_memcg+0x7d/0xb0
Mar 28 01:59:54 rig2 kernel: dma_fence_default_wait+0x1c7/0x260
Mar 28 01:59:54 rig2 kernel: ? dma_fence_release+0xa0/0xa0
Mar 28 01:59:54 rig2 kernel: kcl_fence_default_wait+0x12/0x20 [amdkcl]
Mar 28 01:59:54 rig2 kernel: dma_fence_wait_timeout+0x3e/0xf0
Mar 28 01:59:54 rig2 kernel: reservation_object_wait_timeout_rcu+0x17d/0x370
Mar 28 01:59:54 rig2 kernel: amdgpu_mn_invalidate_range_start_gfx+0xa0/0x100 [amdgpu]
Mar 28 01:59:54 rig2 kernel: __mmu_notifier_invalidate_range_start+0x58/0x80
Mar 28 01:59:54 rig2 kernel: copy_page_range+0x650/0x6a0
Mar 28 01:59:54 rig2 kernel: ? memcg_kmem_get_cache+0x5d/0x160
Mar 28 01:59:54 rig2 kernel: ? vma_gap_callbacks_rotate+0x1e/0x30
Mar 28 01:59:54 rig2 kernel: ? __rb_insert_augmented+0x1b3/0x250
Mar 28 01:59:54 rig2 kernel: copy_process.part.35+0xd93/0x1b00
Mar 28 01:59:54 rig2 kernel: _do_fork+0xdf/0x400
Mar 28 01:59:54 rig2 kernel: ? SyS_futex+0x13b/0x180
Mar 28 01:59:54 rig2 kernel: ? _cond_resched+0x19/0x40
Mar 28 01:59:54 rig2 kernel: ? task_work_run+0x46/0xc0
Mar 28 01:59:54 rig2 kernel: SyS_clone+0x19/0x20
Mar 28 01:59:54 rig2 kernel: do_syscall_64+0x73/0x130
Mar 28 01:59:54 rig2 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 28 01:59:54 rig2 kernel: RIP: 0033:0x7fd81f2b7b1c
Mar 28 01:59:54 rig2 kernel: RSP: 002b:00007fd7c97f9d50 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Mar 28 01:59:54 rig2 kernel: RAX: ffffffffffffffda RBX: 00007fd7c97f9d50 RCX: 00007fd81f2b7b1c
Mar 28 01:59:54 rig2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Mar 28 01:59:54 rig2 kernel: RBP: 00007fd7c97f9dc0 R08: 00007fd7c97fa700 R09: 0000000000000000
Mar 28 01:59:54 rig2 kernel: R10: 00007fd7c97fa9d0 R11: 0000000000000246 R12: 0000000000000000
Mar 28 01:59:54 rig2 kernel: R13: 0000000000000020 R14: 0000000000000001 R15: 00007fd820467498

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Yeah, that's really frustrating. It looks like the miner is trying to fork() to run the script, but the fork() get's hung in the kernel because the amdgpu code is stuck holding some locks effectively.

You could always try to reduced clocks to prevent the GPU from crashing.
Alternatively, you can run a cron job type of script to check if there are any hung tasks being reported by the kernel, and if there are to reboot the machine

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

Yeah I was thinking the cron script would work but seems even a forced shutdown locks up waiting for the miner to exit. I cant get a systemctl command to to stop it either. I'm trying all kinds of stuff now with the GPU parms in grub

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

I can get xmrstak to crash a GPU and it seems to exit fine but your hash rate is way more so hate to fall back to that.

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

Well a little progress i believe. Believe i have a messed up card in the bunch. I enable this on kernel load amdgpu.halt_if_hws_hang=1. And than HW failures started to appear under the hw stat after 45 minutes or so. Notice GPU 2. I think i can do my cron job off that to reboot. Any thoughts ?

Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] Stats Uptime: 0 days, 01:24:00
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 0 [54C, fan 28%] cnr: 505.8 h/s, avg 503.0 h/s, pool 476.7 h/s a:32 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 1 [65C, fan 28%] cnr: 506.0 h/s, avg 503.2 h/s, pool 595.9 h/s a:40 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 2 [57C, fan 28%] cnr: 532.9 h/s, avg 515.8 h/s, pool 327.8 h/s a:22 r:0 hw:24
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 3 [56C, fan 28%] cnr: 505.5 h/s, avg 502.8 h/s, pool 506.6 h/s a:34 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 4 [57C, fan 28%] cnr: 505.8 h/s, avg 503.0 h/s, pool 596.0 h/s a:40 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 5 [58C, fan 28%] cnr: 505.8 h/s, avg 503.1 h/s, pool 566.2 h/s a:38 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] GPU 6 [52C, fan 23%] cnr: 506.0 h/s, avg 503.2 h/s, pool 461.9 h/s a:31 r:0 hw:0
Mar 29 03:17:01 rig2 mine.sh[946]: [2019-03-29 03:17:01] Total cnr: 3.568kh/s, avg 3.534kh/s, pool 3.531kh/s a:237 r:0 hw:24

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

i stopped and restarted the miner and the card came back to life. Would love a feature to call a script on hw > x call a script :)

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Yes, that might be something for us to look at in the future. If you felt like automating it before we can address it, I believe the data you need is available via the API.

from teamredminer.

ftsguy avatar ftsguy commented on August 19, 2024

So the issue reappeared but I did end up finding a solution. I turned off the watch dog and wrote a simple app to call the api to monitor. Since linux was even failing on safe reboots I had to enable the/proc/sysrq-trigger to mimic a hard reset. So keep the api functionally I depend on it now ;). Thanks for the suggestions.

from teamredminer.

todxx avatar todxx commented on August 19, 2024

Glad you found a solution :)

The API functionality will not be removed any time soon, so your setup is safe.
Since your script seems to have resolved the problem, I'll close this issue now.

from teamredminer.

iGerald1 avatar iGerald1 commented on August 19, 2024

My first Rx 6800 is always saying GPU 0 detected dead

from teamredminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.