Coder Social home page Coder Social logo

Is this still valid? about ryzen-test HOT 51 OPEN

suaefar avatar suaefar commented on July 23, 2024
Is this still valid?

from ryzen-test.

Comments (51)

infoveinx avatar infoveinx commented on July 23, 2024 1

I've been building computers since the 1990s, and many of those have been specific for running Linux as either a deskstop or server for personal use. In all of those years I've never encountered the kind of problems I've seen with this CPU. I've used both AMD and Intel too. I feel like 2017/2018 have been the worst given these issues, and not to mention things like spectre/meltdown muddying the waters even more so.

AMD needs to open up about this issue, because it's quite obvious that there are real problems with this generation of processors. I know I'm preaching to the choir here. If you look at the link I posted above way down you'll see where folks have gone to several of the top techie news sites and reported some of the issues with these processors. They either get no response, or a response that states they aren't having the issues reported. I can't imagine this is the case with so many people reporting the same problems. It feels like one giant coverup if you ask me that even involves news and tech sites that test and write reviews on hardware.

from ryzen-test.

suaefar avatar suaefar commented on July 23, 2024

There is no official information on which CPUs are affected (or not affected).
Your description here does fit the ryzen segfault bug.
Prime95 and Memtest86 are not (as) sensitive to the bug as this workload.
If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected.
If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Welp I think this script confirmed it. Another one for RMA. After much toying around (aka wasting valuable time) I tried the suggestion from #23 about disabling OpCode cache. As soon as I did that the kill ryzen script ran for about 20 minutes without crashing - by far the longest it has gone yet. I had to stop the script as I needed to get back on the machine but this proved that my CPU is affected as well.

YD1700BBM88AE
UA 1743SUT

Very disappointing that AMD still hasn't gotten this under control. Now Newegg is giving me hassle about replacing it too. Ugh!

Anyway, thanks for the script and the response to this issue!

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

@suaefar I just installed a new replacement I purchased and hit this AGAIN. The new CPU is a week 33: 1733PGS. Testing was the same - a fresh 17.04 flash drive.

sudo dmidecode -t memory | grep -i -E "(rank|speed|part)" | grep -v -i unknown Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz uname -a Linux ubuntu 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux cat /proc/sys/kernel/randomize_va_space 2 / /mnt/ramdisk/workdir /mnt/ramdisk/workdir Using 16 parallel processes [KERN] -- Logs begin at Fri 2018-02-23 15:36:39 EST. -- [KERN] Feb 23 15:36:55 ubuntu systemd[1]: snapd.refresh.timer: Adding 3h 21min 14.449848s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: apt-daily.timer: Adding 2h 14min 40.656684s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: motd-news.timer: Adding 51min 54.931579s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 1h 54min 27.129499s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 44min 49.245281s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: apt-daily.timer: Adding 4h 55min 48.521122s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: motd-news.timer: Adding 18min 53.032133s random time. [KERN] Feb 23 15:39:10 ubuntu kernel: zram: Added device: zram0 [KERN] Feb 23 15:39:10 ubuntu kernel: zram0: detected capacity change from 0 to 68719476736 [KERN] Feb 23 15:39:10 ubuntu kernel: EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: discard [loop-0] Fri Feb 23 15:39:46 EST 2018 start 0 [loop-1] Fri Feb 23 15:39:47 EST 2018 start 0 [loop-2] Fri Feb 23 15:39:48 EST 2018 start 0 [loop-3] Fri Feb 23 15:39:49 EST 2018 start 0 [loop-4] Fri Feb 23 15:39:50 EST 2018 start 0 [loop-5] Fri Feb 23 15:39:51 EST 2018 start 0 [loop-6] Fri Feb 23 15:39:52 EST 2018 start 0 [loop-7] Fri Feb 23 15:39:53 EST 2018 start 0 [loop-8] Fri Feb 23 15:39:54 EST 2018 start 0 [loop-9] Fri Feb 23 15:39:55 EST 2018 start 0 [loop-10] Fri Feb 23 15:39:56 EST 2018 start 0 [loop-11] Fri Feb 23 15:39:57 EST 2018 start 0 [loop-12] Fri Feb 23 15:39:58 EST 2018 start 0 [loop-13] Fri Feb 23 15:39:59 EST 2018 start 0 [loop-14] Fri Feb 23 15:40:00 EST 2018 start 0 [loop-15] Fri Feb 23 15:40:01 EST 2018 start 0 [loop-12] Fri Feb 23 15:42:13 EST 2018 build failed [loop-12] TIME TO FAIL: 147 s [KERN] Feb 23 15:42:13 ubuntu kernel: traps: bash[32728] general protection ip:445b20 sp:7fff1ce38448 error:0 [KERN] Feb 23 15:42:13 ubuntu kernel: in bash[400000+100000] [KERN] Feb 23 15:42:26 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [loop-8] Fri Feb 23 15:43:32 EST 2018 build failed [loop-8] TIME TO FAIL: 226 s [KERN] Feb 23 15:43:32 ubuntu kernel: bash[21958]: segfault at d ip 0000000000431f2e sp 00007ffc28f648c0 error 4 in bash[400000+100000] [KERN] Feb 23 15:47:41 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:52:57 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:58:12 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready

Checking build-8 log I see:
/bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2 /bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

and loop-12:
Makefile:864: recipe for target 'libgmp.la' failed make[5]: *** [libgmp.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:954: recipe for target 'all-recursive' failed make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:773: recipe for target 'all' failed make[3]: *** [all] Error 2 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:5521: recipe for target 'all-stage1-gmp' failed make[2]: *** [all-stage1-gmp] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

Do you think I'm really that unlucky to get two post week 25 chips with the bug? I've already replaced the motherboard as well.

Thanks!

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

You should try to run the memory at stock settings, just to be sure.
Unstable memory also can result in segfaults.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

I’ve tried JDEC SPD, XMP, and everything in between with the same results. Also tired bumping DRAM and SOC above the XMP values of 1.35v and 1.1v respectively to no avail.

I just tried something new - popped out one stick of ram so I’m down to 8gb. The test ran for 10 minutes before running out of memory. That’s 8 minutes longer than ever before so maybe it’s ram after all or maybe the bug affects the memory controller?

Any recommendations on params to run with 8gb? 2 loops and 2 threads?

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

I'm in the same boat as you. My first R7 1700 was 1734 and had the bug show up. went through the RMA process with NewEgg and just received... 1734PGS ... same week, showing segfaults so far... lets see if it segfaults at 1.35 volts...

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

Then you both probably got a faulty CPUs again.
We don't know what exactly is wrong, possibly something memory-related... the controller, or cache coherency.
We don't know how to distinguish good from bad ones (AMD did not tell us, maybe even they don't know).
The only tool we have is to run workloads on these CPUs which are likely to trigger the behavior.

I cannot understand how AMD gets away with this.
There must be thousands of faulty CPUs around, and they still sell them :(

I am deeply disappointed.

With 8Gb RAM better go for 3 loops 5 threads, or 2 loops 8 threads.

Good luck!

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Thanks! Yet a third CPU will be here tomorrow. I’m thinking it might be time to look at getting a class action suit together to get them talk. I used to love AMD but this is beyond ridiculous!

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

where do you live that NewEgg RMAs so fast? or did you buy from another source? what mobo and ram do you have? I'm even trying older BIOS versions to see if that might help.. hasn't so far. I use this PC for work and have lost hours to this, i need a place that will do an advanced replacement... or suck it up and get a zen+ processor when they come out. hopefully its not in that micro architecture as well...

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Newegg refused to exchange it because it’s past 30 days. I was fighting some other issues and decided to RMA the motherboard first. By time I switched it out the motherboard Newegg closed my RMA for the CPU. They also pissed me off because they wouldn’t take the motherboard back because I had sent the UPC in for the rebate.

This time I ordered from Amazon. I also use my PC for work as I work from home and couldn’t afford downtime. Newegg won’t do advanced replacement or returns on CPUs btw. I ended up filing a claim with my credit card for the return protection because of this mess.

I finally got back a response from AMD days later and they approved an RMA no questions asked and gave me a 2 day label.

Amazon was a once click exchange and they do advanced replacement so hopefully the one coming tomorrow is not bugged. If it is, I’ll go through the AMD RMA.

One way or another I’m getting to the bottom of this. I compile code on Linux for work and need this stable!

I’ll send that back an

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Motherboard is an ASRock X370 Taichi. I’ve tried different bios versions without luck. For memory I’ve got 2x8gb GSkill FlareX.

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

I have the same ram as you but have the aorus gaming k7. I think I'm going to wait until zen+ comes out to rma it, then sell it because i seem to have bad luck. At least this one is 100% stable with my RAMs xmp profile according to stressapptest and aida64

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Okay so I received my third CPU which is a 1744SUS this time and this script failed in about two minutes again with the segfault error. Given that this is the third post-week 25 chip I've had fail I'm pretty convinced at this point that something else is going on. Either something with the way this script runs on my machine (some weird thing when using zram?), memory settings, etc. I did experience some random lockups in both Windows and Linux without any MCE or BSODs with my first chip so I definitely thing that one had something wrong.

At this point I'm going to try running some real-world workloads and see if I can reproduce it. If so I'll dig deeper into the motherboard & ram. One other interesting note is that AMD told me "please update your motherboard BIOS to the latest version with AGESA 1.0.0.6b after installing the CPU" which I have but it does tell me something about the BIOS could be impacting this. I've been on 3.30 but I'll try a few other versions.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

A short update on the segfault saga: I've determined that disabling ASLR does indeed workaround the segfault issue or at least make it so I can't reproduce it with this script. I'm not sure I want to leave it disabled though as it is a small security risk running without it.

I also received my RMA replacement from AMD today and it's a 1733SUS. Funny thing is it seems to be very common to get this batch number when you RMA it for this issue so perhaps it's a "known good" batch. I'll get it installed this week and run some tests.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Finally and end to all this. I was able to complete 12 hours of the ryzen test without any issues using the 1733SUS that AMD send as a replacement.

It's very peculiar that 3/3 of the retail purchases were bugged but AMD sent me a non-bugged item. I also notice MANY people are getting the 1733SUS back as a replacement. It makes me wonder if this is some sort of golden batch that is know working and AMD kept them to use for replacements. Meanwhile the other CPUs on the shelf are most likely bugged regardless of the week number, at least is my painful experiences.

So the answer is yes, this test is still valid. Thanks for putting this together and shame on AMD for selling known bugged CPUs!

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

Thank you for sharing your story.
It is unbelievable that they get away with this...

from ryzen-test.

suaefar avatar suaefar commented on July 23, 2024

I will re-open this issue until it finally is no issue anymore...

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

I'm witnessing something interesting.

When I run kill-ryzen.sh with no parameters it runs a lot longer before failing compared to kill-ryzen.sh 4 4 and it failing under two minutes is it just how the processor is being stressed that causes the difference in rate if failure? I'm waiting until the 2700x has been out for a while before buying that as my replacement, and to make sure others test it to make sure the segfault bug doesn't exist with the refrrsh

from ryzen-test.

Oxalin avatar Oxalin commented on July 23, 2024

@disturbednny : what is the exact error you are hitting? Running 4 X 4 means you have for loops with 4 threads; without any parameter, you are running as many loops as there are threads on your CPU. Each loop will take longer to compile GCC. However, the stress will be similar or a bit higher with the latter. If it takes more time to fail with no parameters, that could indicate a problem with the compilation itself, not with the CPU.

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

I'll have to run them again to get the segfault errors, but they are kernel segfault checks that show up when I type dmesg, and follow the error format in the log entries jstarcher posted with the line starting with make[5]

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

Heres the dmesg output:
[KERN] Apr 15 22:40:00 ubuntu kernel: traps: bash[12803] general protection ip:435bc4 sp:7ffe2774fec0 error:0
[KERN] Apr 15 22:40:05 ubuntu kernel: bash[18352]: segfault at 6e61c4 ip 000000000043d790 sp 00007ffe54c53900 error 6 in bash[400000+100000]

loop-2 log
make[5]: *** [rint.lo] Segmentation fault (core dumped)

loop-0 log
Makefile:761: recipe for target 'set_ui.lo' failed
make[5]: *** [set_ui.lo] Segmentation fault (core dumped)

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

This looks like you got a faulty Ryzen :(
I wonder how many are still out there producing erroneous results every day...

from ryzen-test.

suaefar avatar suaefar commented on July 23, 2024

Probably many.
People still report faulty CPUs as of week 48 in 2017: UA 1748PGS (https://community.amd.com/message/2857007#comment-2857007)

from ryzen-test.

disturbednny avatar disturbednny commented on July 23, 2024

Some Good news,

I received my R7 2700X this past saturday, and successfully ran the kill-ryzen script for 8 hours straight with no segfault. So it looks like it is not present in the R7 and R5 2000 series. RMA'd my 1700 after installing the 2700X so we'll see what they give me.

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

That's good news! I was really hoping that they would get it under control eventually.

from ryzen-test.

7Z0t99 avatar 7Z0t99 commented on July 23, 2024

Thank you very much for providing this test!
A few days ago, I got a Ryzen 5 1600 (lot 1743SUS), which failed the test in under 3 minutes.
The dealer was so kind to take it back and let me order a Ryzen 5 2600 (lot 1806SUT) as a replacement, which seems to work just fine.

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Hi. So to be clear is disabling ASLR the answer to some of these issues? I've had a Ryzen 1700 1733PGS since Oct 2017 and the thing has been nothing but trouble. Dealing with this https://bugzilla.kernel.org/show_bug.cgi?id=196683 in additional to the general protection faults.

I am running latest AGESA. My memory has been tested ok. Basically I can reproduce a general protection fault very easily by just running something that uses several threads. For instance using Saltstack config management commands I could repo a fault just about every time I ran a somewhat intensive job with ASLR on. With ASLR off I get no protection faults.

Thanks!

from ryzen-test.

Oxalin avatar Oxalin commented on July 23, 2024

@infoveinx : short answer is we don't know. As long as AMD won't recognize and disclose the problem, we can't tell for sure.

from ryzen-test.

protox avatar protox commented on July 23, 2024

Basically AMD needs to release more info, because I don't think we fully understand this issue, hopefully 2700s are fixed as you say.

My UA1733PGS has been running with no issues since my other thread and upped voltages.

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

Increased voltages mean increased power consumption, more heat, higher temperatures and possibly lower performance... it is a workaround but no fix.
Nobody should need to touch the stock voltages to get a stable system.

from ryzen-test.

protox avatar protox commented on July 23, 2024

Yes there is obviously an inherent issue, wouldn't be surprised if it's a design flaw in the end.

from ryzen-test.

skarr avatar skarr commented on July 23, 2024

suaefar, thanks for the effort from your side to help isolate and reproduce the problem.

I've got a 1700 and three 1800X CPUs. I bought my 1700 in March 2017, I expected problems with new micro architecture as we have seen in the past. I was surprised when I found my 1800X CPUs are from the first week of production even though I bought them individually in December 2017 to January 2018.

I tried to RMA one (1707) in April 2018. I can't afford to stop using all of them and I wasn't sure if replacements would work. AMD accepted my request (thanks to your "ryzen-test"), however when it came to shipping address I found out that AMD does not support my country. It looks like I am in it for the long run.

Until a month ago I did not have so many problems, but since I starting using newer kernels >= 4.15 I've seen multiple crashes per day, resulting in filesystem corruption beyond the point that fsck will repair. I've tried various distributions/kernels, CPU pinning, hugepages to try and isolate workloads in virtual machines. Thus far my best was 188 days uptime on my 1700 using pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve) it's based on debian 9.4. Strange observation that my machine with the most memory did the best, it has 4x16GB RAM.

I know this project aims to determine if your hardware is faulty or not. I'm begging for any advice on a workaround that would make my system(s) stable without spending a huge sum of money i.e. buying new CPUs or Windows license for each machine. Can we pressure AMD to assist kernel devs or provide more information on what they changed between chip revisions?

from ryzen-test.

suaefar avatar suaefar commented on July 23, 2024

Hi skarr,

disabling ASLR, µOP-caching and SMT was on some occasions reported to increase stability.
Depending on your workload, it might also help to pin processes to certain CPUs (with "taskset"), but I am not sure.

Unfortunately, we know close to nothing because AMD never shared any bit of information on this issue with us.
You bought a product which does not work as expected.
I would simply return it.

from ryzen-test.

skarr avatar skarr commented on July 23, 2024

Thanks for the advice, I really appreciate it.

I found a product errata document by AMD in this post. This one stood out to me:

1109 MWAIT Instruction May Hang a Thread

Description
Under a highly specific and detailed set of internal timing conditions, the MWAIT instruction may cause a thread to hang in SMT (Simultaneous Multithreading) Mode.
Potential Effect on System
The system may hang or reset.
Suggested Workaround
System software may contain the workaround for this erratum.
Fix Planned
No fix planned

I also found new responses in Kernel.org Bugzilla stating idle=nomwait fixed all hangs. I am in the process of testing this for myself. My long term strategy is to try these/other workarounds while I continue to the fight with the local suppliers. Looks like reddit users are happy with RMA process which is not helping my case. Now as many people obtain newer CPUs it looks like this issue is going in under the carpet, "nothing to see here please disperse".

I will create a new issue/update this one if I find anything useful.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

@skarr have a look at this project: https://github.com/qrwteyrutiyoup/ryzen-stabilizator

Disabling C6, ALSR, and enabling the power supply idle workaround helped me. Without these even my replacement “not bugged” CPU had random reboots and soft pickups on Ubuntu. I created a systemd startup unit to make these changes automatically.

TL;DR there’s other problems with first gen Ryzen in Linux outside of this compilation bug :(

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Finally did the RMA. They sent me a UA1733SUS, the one I sent back was a UA1733PGS. The new one is also not behaving. I reset bios and I'm also running latest bios. The only change I made under advanced CPU section was to enable the typical current idle, and SMV. I run about 5 qemu-kvm VMs. Basically the ones that actually do stuff are crashing with kernel panics randomly (just like before). If you let one sit long enough in panic state without killing it the host system will eventually have some kind of kernel issue and lock up. Generally the system is close to idle though as the VMs don't have much activity.

To be clear I'm now on a 4.17 kernel from Debian Stretch backports. I've used 4.12, 4.13, 4.14, 4.15, and 4.16 prior. All of them unstable (although once upon a time 4.13 had a long uptime) but that was after disabling C-states both in bios and in software, and disabling ASLR. Subsequent kernel versions didn't seem to make a difference even with all of that disabled. Even with 4.13 every so often I would see a VM go to 100% and have to be restarted, though much less frequent. I also had a much older BIOS at that time.

During the RMA process I moved all of the VM images back to an old 2012 Intel i3 that I had used prior. Not a single problem from that system in the week that I ran it and at times under heavy load. I was going to try a bunch of stuff like, CPU pinning with VMs etc, but I've read other folks tried that and it still crashed. I'm not going to continue trying to make this work. I'm just not going to buy AMD ever again. In fact if I must I will purchase older gen processors after doing research to ensure they can handle running VMs under Linux.

This has been a fight since November 2017 and I've lost countless hours. If anyone has suggestions I'm open to them, but at this point it seems like a lost fight and time to move on.

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Update. The initial crashes that I encountered with the replacement CPU were still under 4.16 kernel. The only incident I had with 4.17 kernel was starting the 5 linux VMs simultaneously and then one of them crashed not long after startup.

Something else I did was disable IOMMU in bios almost two days ago and I can't quite recall if I did this prior to the single 4.17 kernel incident. I let it sit mostly idle for a little over one day and didn't experience a crash or idle lockup. Today I tried to use every method prior to crash it and it never experienced a single hiccup. I'm not sure what to make of it yet so going to let it go longer and see what happens. Unfortunately in the past I've seen it crash anywhere from within minutes, to multiple days.

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Latest update. System went 7 days without issue but is now back to being completely unstable. During those 7 days I ran it through a gambit of things from normal tasks, to many simultaneous things involving network I/O, disk both SATA and USB 3.0 transfers along with some stress-ng runs. I had 6 Linux vms running on it and it never hinted a single issue. I find it on 7th day locked up with a kernel panic. Since then it is completely unstable, with or without vms running. It's very hard for me to understand how it can run so perfect and the suddenly become so unstable with no changes.

Some thoughts on this. I would assume if the mobo were bad that the behavior would have shown well before 7 days. The same for PSU. I did run RAM through an 8 hour memtest some time back and saw no problems. The only conclusion I can come to here is that the Linux kernel itself is just not working well with this CPU for whatever reason. The other thing I wondered is if perhaps something about the mobo is providing incorrect voltages and over time is degrading the CPU in someway.

I can't really deal with this any longer so I think for now the system will just get shelved and replaced with a previous gen Intel. I've been reading around again and I see several folks who are essentially dealing with the same kind of conditions. Ie, stable then completely unstable with general protection faults/segfaults and basic system lockups etc. Yes it does seem like many people who RMA are getting 2017 Week 33 replacements.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Have you tried everything outlined here: http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

These changes seemed to really help me. At this point the only thing I get is a soft lockup on occasion which I’m 99% sure it’s an Nvidia driver issue. I can ssh to the machine and Xorg is locked up and I have errors in the Xorg log but I haven’t seen evidence of CPU instability. Also try with completely stock ram settings, not XMP profile. XMP isn’t guaranteed to be stable. Worth upping soc voltage to 1.1v if you haven’t yet as well.

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Yeah, thanks. I've tried all of those things and then some. The longest run I had was on kernel 4.13 (I'm running Debian 9.x so I'm using backports to get newer kernels). Basically it came down to disable C-states in bios, disable the remaining states via the Zenstates script, disable ASLR, and finally blacklisting nouveau driver. I went over 100 days uptime with that, however I still had the occasional VM lockup when I would do a heavy file transfer over network on the host system itself (not a VM on said system). Since then I've updated to latest bios and iterated over kernels 4.14, 4.15, 4.16, and now on 4.17.

I have also tried not doing XMP with no change in results. Running XMP is showing the RAM rated timings in bios fwiw. That 100+ days of uptime was also on the bios that came with mobo which was quite old and prior to the addition of the power supply configurable idle states that AMD added. I haven't had a soft lockup in a very long time, my issues all appear to be related to memory now. I also tried to disable SMT, Opcache, etc. In the end the only way to keep it working was to pass maxcpus=1 to kernel so that only a single CPU was used. In that case I was able to copy files across network (Samb) and back and forth to a USB 3.0 drive (as backup) with no crashes.

I'm not running Xorg or any GUI on this system, but I do have a Geforce 210 as the video card. I've yet to see kernel errors related to video. I tried adding voltage slowly and that seemed to increase problems (on the returned CPU), but I am willing to try it again with new CPU.

For clarity here are details of my system.

Gigabyte AB350-Gaming 3 Bios F23d
CORSAIR CX-M Series CX550M 550W PSU
Ryzen 7 1700
G.SKILL Flare X Series 32GB (4 x 8GB) 288-Pin DDR4 SDRAM DDR4 2400
Geforce 210 video card
Intel EXPI9301CTBLK Network Adapter 10/100/1000Mbps PCI-Express
SAMSUNG 850 PRO 512GB SSD
HGST Deskstar NAS 3.5" 10TB x2 running in Raid 1
Debian 9.5 kernel 4.17 running on the Samsung SSD

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Also noticing these in boot log. Don't recall seeing them prior. I know there has been talk of fixes related to this in newer kernels. No idea how it relates but as stated prior I'm already on 4.17 kernel.

Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

Did the following.

  • Reset bios to optimized defaults
  • Turned on SMV
  • Set PSU to typical current
  • Made sure XMP Profile disabled
  • Set VCORE SOC to 1.116 (fluctuates from 1.104 - 1.128)
  • Disabled IOMMU
  • CSM mode with UEFI for boot devices

System is still randomly unstable. Had a hard locked CPU related to KVM, random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes. Still boggles my mind that it went 7 days with no issue running all kinds of tasks.

No vms running, got these when I went to copy the qcow2 images to the Raid 1 to start backup and rebuild process with another system.

page:fffff89a1dd21880 count:0 mapcount:0 mapping:0000000000000f00 index:0x1

I rebooted and set SOC back to Auto and then copied the 67G worth of qcow2 to the Raid1 no problem. Maybe there is just some kind of voltage regulator problem here I'm not sure. I'll mess with it on the side while I have the hopefully stable replacement up.

I put a load of 21 on it last night via converting some h.265 to h.264 video with ffmpeg, all while running other things in a loop to try to break it, and of course it had zero problems. Running VMs though is a matter of time (and much shorter time lately).

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes

This sounds familiar...
My advice: "If it does not run stable with stock settings, save the time and RMA it."

from ryzen-test.

infoveinx avatar infoveinx commented on July 23, 2024

This sounds familiar...
My advice: "If it does not run stable with stock settings, save the time and RMA it."

Unfortunately this is with the RMA CPU. I sent in a 1733PGS and got a 1733SUS back.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

I agree, might be time to RMA the motherboard and/or RAM. Maybe try one stick of ram at a time to try to isolate if you have a bad stick.

from ryzen-test.

v0idwalker avatar v0idwalker commented on July 23, 2024

There is no official information on which CPUs are affected (or not affected).
Your description here does fit the ryzen segfault bug.
Prime95 and Memtest86 are not (as) sensitive to the bug as this workload.
If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected.
If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

Helo, just a quick question.
The memtest86 can be influenced by this bug? Can memtest86 trigger this bug and make the ram look faulty?

from ryzen-test.

m-r-s avatar m-r-s commented on July 23, 2024

Theoretically, yes.
But one of the particular observations was that memtest86 ran fine on the faulty CPUs while the compilation of GCC failed.

from ryzen-test.

jstarcher avatar jstarcher commented on July 23, 2024

Same experience here. Memtest ran overnight without finding any errrors. This bug requires heavy CPU usages across all threads to trigger which memtest doesn’t do.

That isn’t to say it is impossible for it to cause it to fail though.

from ryzen-test.

v0idwalker avatar v0idwalker commented on July 23, 2024

Well, I am sending my 1700x for rma. Meanwhile I borrowed a 2600 and will check if the problem persist.
Was this bug observed on Zen+ too? (I have an ASRock taichi x470, so there should be no problem with compatibility.

Also, what is the expected final step of this script?

from ryzen-test.

doug65536 avatar doug65536 commented on July 23, 2024

If you have this problem today, you can disable the uop cache in AMD CBS settings in your BIOS (UEFI settings). uop as in micro-op, as in mu-op (μop). It hardly reduces performance if you disable it, but it completely fixes this issue. The code has to be tons of huge instructions to even be able to measure a difference in performance. The instruction decoder is so good, you hardly even need the uop cache. I didn't see "uop" mentioned in this thread, hopefully this isn't redundant.

from ryzen-test.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.