andikleen / mcelog Goto Github PK

View Code? Open in Web Editor NEW

131.0 21.0 62.0 985 KB

Linux kernel machine check handling middleware

Home Page: http://www.mcelog.org

License: GNU General Public License v2.0

Makefile 1.45% C 82.96% Python 0.53% Shell 9.82% Roff 5.25%

machine-check ras memory predictive-failure-analysis intel linux

mcelog's People

Contributors

Stargazers

Watchers

Forkers

bldewolf birdcai123 distrotech vertipub dangzhiqiang everpalm yaoyfy bocohan llsdimple harpsichord pokab dllhlx baruch srilakshmidj matthewrsj ugiwgh mjtrangoni wangandrew2016 splbio paulmenzel landscape82 rousya madhutummi neocui yanjason vhulagov lihongguang ptabc evilutopia wbcuc appla zoucao-ali anselmolsm lmingcsce aegl xiaochunlee hygonsoc shubhamnarlawar77 prarit yang-treep konglai2020 jiridluhosrh devninja7-c mingli-yu buffernihility global-localhost global19 global19-atlassian-net havetrytwo yangzz-97 huangxiaodui meow-watermelon clayne pauldx fruitfly638 vfazio taylorj-406 ggvl 120cs0121 listout danielodelgado nanningtigo

mcelog's Issues

don´t start mcelog

I installed yesterday by apt install mcelog, which worked well, then device / dev / mcelog did not exist, this I have manually with mknod / dev / mcelog c 10 227 created, according to the website of mcelog.

So I also wanted to start mcelog with systemctl that appears:

systemctl status mcelog.service
● mcelog.service - LSB: Machine Check Exceptions (MCE) collector & decoder
Loaded: loaded (/etc/init.d/mcelog; generated; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2018-01-31 08:09:35 CET; 9s ago
Docs: man:systemd-sysv-generator(8)
Process: 26961 ExecStop=/etc/init.d/mcelog stop (code=exited, status=0/SUCCESS)
Process: 26979 ExecStart=/etc/init.d/mcelog start (code=exited, status=1/FAILURE)

Jan 31 08:09:35 NAS systemd[1]: Starting LSB: Machine Check Exceptions (MCE) collector & decoder...
Jan 31 08:09:35 NAS mcelog[26979]: mcelog: Family 6 Model 92 CPU: only decoding architectural errors
Jan 31 08:09:35 NAS mcelog[26983]: Family 6 Model 92 CPU: only decoding architectural errors
Jan 31 08:09:35 NAS mcelog[26979]: Starting Machine Check Exceptions decoder:
Jan 31 08:09:35 NAS systemd[1]: mcelog.service: Control process exited, code=exited status=1
Jan 31 08:09:35 NAS systemd[1]: Failed to start LSB: Machine Check Exceptions (MCE) collector & decoder.
Jan 31 08:09:35 NAS systemd[1]: mcelog.service: Unit entered failed state.
Jan 31 08:09:35 NAS systemd[1]: mcelog.service: Failed with result 'exit-code'.

(root)-(/usr/lib/znc)->journalctl -p err -u mcelog.service
-- Logs begin at Wed 2018-01-17 13:39:01 CET, end at Wed 2018-01-31 08:09:35 CET. --
Jan 31 08:09:27 NAS mcelog[26971]: Family 6 Model 92 CPU: only decoding architectural errors
Jan 31 08:09:27 NAS systemd[1]: Failed to start LSB: Machine Check Exceptions (MCE) collector & decoder.
Jan 31 08:09:35 NAS mcelog[26983]: Family 6 Model 92 CPU: only decoding architectural errors
Jan 31 08:09:35 NAS systemd[1]: Failed to start LSB: Machine Check Exceptions (MCE) collector & decoder.
`

In mcelog/tests/socket, each trigger is executed twice for an injected socket error

(Observed in RedHat.)
For an error injected in socket space (in mcelog/tests/socket), the trigger script is invoked twice which causes the test to fail, complaining of incorrect number of trigger invokations. After looking into the code, I have found that, in memdb.c, in memory_error(), socket errors are really accounted twice:

/* ... /
if (sockdb_enabled) {
md = get_memdimm(m->socketid, -1, -1, 1);
account_over(&sockets, md, m, corr_err_cnt);
account_memdb(&sockets, md, m);
}
/ ... */

This invokes two triggers (provided that thresholds are set appropriately): one from account_over (with note "Fallback memory error"), and another one from account_memdb.
As this is, more or less, the expected behavior, I think the test should be updated to take this into account.

setterm: $TERM is not defined. /etc/cron.hourly/mcelog.cron:

Hello,

I'm in RHEL 5.9 having Intel Xeon E7-4860 and using mcelog-0.9pre-1.32.el5 and getting this error hourly on mail from /etc/cron.hourly/mcelog.cron :
setterm: $TERM is not defined.
setterm: $TERM is not defined.
/etc/cron.hourly/mcelog.cron:

setterm: $TERM is not defined.
Could you help for this error as it fills our mailbox.

Thanks in advance

Report to external services

For a proper system integration it would be nice if the only method to know that something happened isn't by reading a log file and parsing a text file.

I'd envision a pipe that some other process can read in binary format with the structures defined in a provided header file. A text message is also fine if it is made in an easily parsed form in a consistent format that is kept as an ABI.

throttling

There are instances when page offlining can't succeed. If this occurs in in tandem with a "stuck bit" then mcelog repeatedly tries and fails to offline a page. Is there a technique or configuration for giving up on page offline attempts after N attempts?

Similarly is there a way to stop logging after a certain number of correctable errors?

MCE can't trigger dimm error when error occurs on the purley platform

Kernel crashes from 'make test'

Using the included tests (make test) crashes my system. It's a remove server, thus I can't really know the panic cause. I'm not exactly using the standard 'make test' as I'm running under SELinux: all the changes I made can be found on https://github.com/Feandil/lerya.net-overlay/tree/master/app-admin/mcelog (one sed inside the ebuild and various patchs in the files directory)

Is it normal ? Is there a way to have crash-free tests ?

Here are the last logs before the crash:

Jan 14 16:25:26 lerya kernel: [ 1202.285787] soft_offline: 0x1bc3: unknown non LRU page type 100000000000400
Jan 14 16:25:26 lerya kernel: [ 1202.286006] MCE 0x1bc3: non LRU page recovery: Ignored
++++++++++++ running memdb test +++++++++++++++++++
Please delete /tmp/tmp.XdYY3kQOia after you checked /tmp/tmp.XdYY3kQOia/*.log /tmp/tmp.XdYY3kQOia/return
Jan 14 16:25:27 lerya kernel: [ 1203.476569] type=1400 audit(1358177127.877:195): avc: denied { getsched } for pid=3660 comm="mce-inject" ipaddr=82.67.68.201 scontext=staff_u:sysadm_r:mcelog_inject_t tcontext=staff_u:sysadm_r:mcelog_inject_t tclass=process
Jan 14 16:25:27 lerya kernel: [ 1203.476600] Starting machine check poll CPU 0

Jan 14 16:25:27 lerya kernel: [ 1203.476611] Machine check poll done on CPU 0

Processor information (/proc/cpuinfo):
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Celeron(R) CPU 2.66GHz
stepping : 9
microcode : 0x3
cpu MHz : 2659.972
cache size : 256 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl tm2 cid cx16 xtpr lahf_lm
bogomips : 5319.94
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

Suppress specific messages

would be great to be able to suppress specific messages!
reason: mcelog send's me this "randomly":

mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors

if i understand that right, this is nothing really bad, or? so...of course i want to be informed if something is going wrong with the hardware, but i don't want to get this message every day.... using an email-filter is not a very nice solution..also, i'm not a linux-crack and don't know how i can suppress this via some "filter-script" in cron....
i think, it would be nice if users can put some exceptions in the config-file for messages, they dont want to get send out via mail?
what do you think?

mcelog doesn't detect edac_mce_amd

I have an AMD Ryzen CPU. mcelog doesn't detect edac_mce_amd, despite it's loaded.

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported

lsmod |grep edac_mce_amd
edac_mce_amd 28672 0

[Web site] Improve SSL configuration

It’d be great if the SSL configuration of the Web server could be improved.

$ curl -I https://www.mcelog.org
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html
[…]
$ openssl s_client -showcerts -verify_hostname www.mcelog.org -connect www.mcelog.org:443
[…]

The Qualys SSL Report for the domain grades an F.

So I’d be great, if the Web server sent the intermediate certificate.

Let's Encrypt Authority X3
Fingerprint SHA1: e6a3b45b062d509b3382282d196efe97d5956ccb
Pin SHA256: YLh1dUR9y6Kja30RrAn7JKnbQG/uEtLMkBgFF2Fuihg=
RSA 2048 bits (e 65537) / SHA256withRSA

The Web site Cipherli.st also contains SSL configuration snippets for popular HTTPD servers.

mcelog not working on ubuntu kernel

Hi,

I have 3 different machines where I can't get mcelog running, all are running Ubuntu 12.04, I have tried with your git version and get the same result, how would I go about debugging this issue?

tlb@tlbdesk:/git/webqueue$ cat /proc/cpuinfo|grep 'model name'
model name : Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
tlb@tlbdesk:/git/webqueue$ sudo mcelog
mcelog: warning: 16 bytes ignored in each record
mcelog: consider an update

tlb@tlbserv:$ cat /proc/cpuinfo|grep 'model name'
model name : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
tlb@tlbserv:$ sudo mcelog
[sudo] password for tlb:
mcelog: warning: 16 bytes ignored in each record
mcelog: consider an update

root@web6:/sys/devices/system/machinecheck# cat /proc/cpuinfo|grep 'model name'
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
tlb@tlbserv:~$ sudo mcelog
[sudo] password for tlb:
mcelog: warning: 16 bytes ignored in each record
mcelog: consider an update

mcelog.c:747: possible off by one error ?

[mcelog.c:747]: (error) Width 100 given in format string (no. 3) is larger than destination buffer 'symbol[100]', use %99s to prevent overflowing it.

Source code is

        n = sscanf(s, "%02x:<%016Lx> {%100s}%n",
               &cs,
               &m.ip,
               symbol, &next);

fail to run mce-log test at ubuntu16.04 vm

I use ubuntu16.04 vm to test, the work dir is mce-log/tests:
When one cpu:

./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: no process found
<stdin>:5: cpu 1 not online

<stdin>:5: cpu 1 not online

<stdin>:5: cpu 1 not online

Then hang on and no output.
when two cpu:

./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: no process found
<stdin>:5: larger machine check bank 2 than supported on this cpu (0)

<stdin>:5: larger machine check bank 2 than supported on this cpu (0)

<stdin>:5: larger machine check bank 2 than supported on this cpu (0)

Then hang on and no output too.

mcelog is not logging mem & page errors

Hi,
I tested injecting errors using the scripts in the test folder. The cache error injection seems to trigger logging in syslog, however the memdb and page error-injections seems to be getting ignored by mcelog. They do get reported by the kernel and EDAC. Are there any any specifc knobs I need to enable for turning on this feature?

P.S. Is there a mailing list for posting questions? Or the github issues the right place to ask? Thanks.

"warning: 8 bytes ignored in each record" should be a critical error

Please correct me if I am wrong, but it seems this "warning" actually prevents mcelog from producing any output. This is not a "warning" but rather a critical error and should be marked as such.

build failure with dash shell

When /bin/sh is pointing to dash or some other non-bash shells, the Makefile code generating version.c fails. The "echo -n" literally prints an unwanted "-n" into the file. This results in the build failure below:

x86_64-pc-linux-gnu-gcc -c -O2 -pipe -march=native -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement -o version.o version.c
version.c:1:1: error: expected identifier or ‘(’ before ‘-’ token
-n char version[] = "
^
version.c:1:21: warning: missing terminating " character
-n char version[] = "
^
version.c:1:1: error: missing terminating " character
-n char version[] = "
^
version.c:3:1: warning: missing terminating " character
";
^
version.c:3:1: error: missing terminating " character
make: *** [Makefile:88: version.o] Error 1

AMD CPU Family 15 read error

mcelog uses /proc/cpuinfo to read cpu family information. However, it reads 21 (dec) for cpu family 15 (hex) and aborts operation saying doesn't support anything other than 15. It needs to be fixed. I have access to 6 AMD machines and mcelog is running on none because of this.

'make test' fails

hi,

I have just compiled mcelog from git on CentOS 6.8, however make test fails:

make test

make -C tests test DEBUG=""
make[1]: Entering directory `/tmp/mcelog/tests'
./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: no process killed
./inject: line 5: mce-inject: command not found
./inject: line 6: mce-inject: command not found
./inject: line 7: mce-inject: command not found

and it hangs like this.

this is always reproducible.

cheers.

mcelog doesn

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported

lsmod |grep edac_mce_amd
edac_mce_amd 28672 0

`mcelog --version` returns `mcelog unknown`

$ git describe --tag
v147-6-g9a11988
$ make
$ ./mcelog --version
mcelog unknown

DIMM ID and Channel ID Numbering missing after running the mcelog

Hello All,

To test the memory errors, we are using mce-inject test application at our side. When the memory errors are injected using this tool the alram is raised when the threshold value of memory errors is exceeded.
But the alarm information is wrong on our board.

Alarm raised on Board1:-
2014 Jun 2 16:39:24 ALARM RAISE SP=70370 MO=/CLA-0 AP=fsClusterId=ClusterRoot SE=0 NINFO="Single bit error threshold
count exceeded on Unit={CPU= ,Cache Level=}" TIME=1401716364342 UTCSHIFT=180

Alaram raised on Board2:-
2014 Jun 2 12:30:02 ALARM RAISE SP=70370 MO=/CLA-0 AP=fsClusterId=ClusterRoot SE=0 NINFO="Single bit error threshold
count exceeded on Unit={DIMM ID=1 ,Channel ID=1}" TIME=1401701402634 UTCSHIFT=180

What is the reason for missing the DIMM ID and Channel ID on Board1. I have run the dmidecode on the board, Please find below

Eg:-

Handle 0x003F, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 128 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: RIMM
Set: None
Locator: J5A2
Bank Locator: CHANNEL F DIMM 0
Type:
Type Detail: RAMBus Synchronous
Speed: 1067 MHz
Manufacturer: CE00
Serial Number: FFEA0E06
Asset Tag: 0123456789
Part Number: M392B5673EH1-CF8
Rank: Unknown

mcelog read: No such device

mcelog: failed to prefill DIMM database from DMI data
mcelog: mcelog read: No such device
Hardware event. This is not a software error.
MCE 0
CPU 9 BANK 5
MISC 100
TIME 1549963550 Tue Feb 12 17:25:50 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 405
STATUS fa00000000400405 MCGSTATUS 0
MCGCAP 1000c18 APICID 42 SOCKETID 1
CPUID Vendor Intel Family 6 Model 47

mcelog --client output nothing when inject error to memory，But the messages can catch the error log.

Lately, We tested RAS function about memory inject error on the Purley platform of lenovo SR630, The OS is RHEL8,kernel version kernel-4.18.0-67.el8, mcelog version is 159.
The test steps list as below:

Mount the einj module
linux-1rz0:~ # modprobe einj param_extension=1
linux-1rz0:~ #
Start the mcelog daemon
linux-1rz0:~ # mcelog --daemon
linux-1rz0:~ #
Check whether the einj module loaded successfully
linux-1rz0:~ # cd /sys/kernel/debug/apei/einj/
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # ls
available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags
linux-1rz0:/sys/kernel/debug/apei/einj #

4.Inject uncorrectable error to memory mirror range
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x10 > error_type
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x12345 > param1
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0xfffffffffffff000 > param2
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > error_inject
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > notrigger
linux-1rz0:/sys/kernel/debug/apei/einj #

Below is some informations about the outcome:

[root@rhel8-ose-test rastools]# systemctl status mcelog
● mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-01-30 02:22:29 EST; 51min ago
Main PID: 1177 (mcelog)
Tasks: 1 (limit: 26213)
Memory: 856.0K
CGroup: /system.slice/mcelog.service
└─1177 /usr/sbin/mcelog --ignorenodev --daemon --foreground
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

[root@rhel8-ose-test rastools]#tail -f /var/log/dmesg
Jan 30 02:29:26 rhel8-ose-test kernel: mce: Uncorrected hardware memory error in user-access at 6696d1040
Jan 30 02:29:26 rhel8-ose-test kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: Killing einj_mem_uc:8974 due to hardware memory corruption
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: recovery action for dirty LRU page: Recovered
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Hardware event. This is not a software error.
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCE 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPU 21 BANK 1 TSC 8a4ce5d5aa0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: RIP 33:403c4b
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MISC 86 ADDR 6696d1040
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: TIME 1548833366 Wed Jan 30 02:29:26 2019
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCG status:RIPV EIPV MCIP LMCE
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi status:
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Uncorrected error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

But when we execute the mcelog --client ,there is nothing output. So I research the code about the mcelog and found that it was blocked by mce status bit settings，The partial resource code of the mcelog tool list as below：

125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };

Injected uncorrectable errors are not showing up

All other injected error are reported by mcelog (bus/IOMCA, unknown, correctable DIMM/Socket, cache, page) but uncorrectable errors are not reported.
mcelog is running in daemon mode on a broadwell-de (Linux 3.19.5).

I am using the following to generate an uncorrectable error.
GENMEM 0 1 0 1 1 | mce-inject

With tolerant level at default of 1, the system panics as expected.

With tolerant level of 3, no uncorrectable errors are reported by mcelog
echo “3” > /sys/devices/system/machinecheck/machinecheck0/tolerant

retrieve hardware failure information from the mcelog socket

Hi,

I'm writing a small program to detect hardware failure and would like to use mcelog socket to retrieve error messages. However, it seems the socket just returns "done" but nothing else.

[server] 
# socat -d -d TCP-LISTEN:8080,fork UNIX:/var/run/mcelog-client 

[client] 
# echo "pages" | nc localhost 8080
done
# echo "dump bios all" | nc localhost 8080
done

So basically I expect to see some output like http://mcelog.org/protocol.html mentions. Is that normal behavior? If yes, is there any way that we can retrieve hardware failures from mcelog?

Thanks.

How to decode error?

The following was logged by mcelog.

2017-05-05T08:57:16+02:00 avaritia mcelog: MCG status:
2017-05-05T08:57:16+02:00 avaritia mcelog: MCi status:
2017-05-05T08:57:16+02:00 avaritia mcelog: Corrected error
2017-05-05T08:57:16+02:00 avaritia mcelog: Error enabled
2017-05-05T08:57:16+02:00 avaritia mcelog: MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
2017-05-05T08:57:16+02:00 avaritia mcelog: Transaction: Generic undefined request
2017-05-05T08:57:16+02:00 avaritia mcelog: STATUS 900000400009008f MCGSTATUS 0
2017-05-05T08:57:16+02:00 avaritia mcelog: MCGCAP 1000c18 APICID 80 SOCKETID 2 
2017-05-05T08:57:16+02:00 avaritia mcelog: CPUID Vendor Intel Family 6 Model 47
2017-05-05T08:57:16+02:00 avaritia mcelog: Hardware event. This is not a software error.
2017-05-05T08:57:16+02:00 avaritia mcelog: MCE 15
2017-05-05T08:57:16+02:00 avaritia mcelog: CPU 2 BANK 9 
2017-05-05T08:57:16+02:00 avaritia mcelog: TIME 1493932594 Thu May  4 23:16:34 2017

Reading the FAQ I understand, that I sholud

$ ls -l /dev/mcelog
crw------- 1 root system 10, 227 May  7 12:18 /dev/mcelog
$ more /dev/shm/test.txt
2017-05-05T08:57:16+02:00 avaritia mcelog: MCG status:
2017-05-05T08:57:16+02:00 avaritia mcelog: MCi status:
2017-05-05T08:57:16+02:00 avaritia mcelog: Corrected error
2017-05-05T08:57:16+02:00 avaritia mcelog: Error enabled
2017-05-05T08:57:16+02:00 avaritia mcelog: MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
2017-05-05T08:57:16+02:00 avaritia mcelog: Transaction: Generic undefined request
2017-05-05T08:57:16+02:00 avaritia mcelog: STATUS 900000400009008f MCGSTATUS 0
2017-05-05T08:57:16+02:00 avaritia mcelog: MCGCAP 1000c18 APICID 80 SOCKETID 2 
2017-05-05T08:57:16+02:00 avaritia mcelog: CPUID Vendor Intel Family 6 Model 47
2017-05-05T08:57:16+02:00 avaritia mcelog: Hardware event. This is not a software error.
2017-05-05T08:57:16+02:00 avaritia mcelog: MCE 15
2017-05-05T08:57:16+02:00 avaritia mcelog: CPU 2 BANK 9 
2017-05-05T08:57:16+02:00 avaritia mcelog: TIME 1493932594 Thu May  4 23:16:34 2017
$ sudo ./mcelog --client < /dev/shm/test.txt
mcelog: client connect: No such file or directory
mcelog: client command write: Transport endpoint is not connected
mcelog: client read: Invalid argument
mcelog: client connect: No such file or directory
mcelog: client command write: Transport endpoint is not connected
mcelog: client read: Invalid argument

Probably, I am confusing something. So sorry for that in advance.

Error count

I am a bit confused with the reporting of mcelog --client.

Either my understanding is wrong or there is something odd with how the total of errors is counted :

Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
2654 total
100 in 24h
uncorrected memory errors:
0 total
0 in 24h

SOCKET 1 CHANNEL 3 DIMM any
corrected memory errors:
1003 total
10 in 24h
uncorrected memory errors:
0 total
0 in 24h

SOCKET 1 CHANNEL 3 DIMM 0
corrected memory errors:
1651 total
10 in 24h
uncorrected memory errors:
0 total
0 in 24h

Shouldn't the number of corrected errors on SOCKET 1 CHANNEL 3 DIMM any be the same as SOCKET 1 CHANNEL 3 DIMM 0 if only one dimm is failling ?

Can you enlighten me on this as I am afraid I misunderstand the meaning of the output.

mcelog triggers not firing?

Hi Andi. Thanks for mcelog.

I am injecting errors with mce-inject, but I can see no evidence that triggers are being executed. Built mcelog from latest git. Injected many many mces (threshold is 10/24h).

Should I expect triggers to work when injecting fake mces?

I created .local triggers for dimm/socket/page which touched files in /tmp, wrote to /proc/kmsg, etc. Nothing appeared. I ran "strace -f mcelog" and see no evidence that it forks any triggers.

cat m1
CPU 0 BANK 4
STATUS CORRECTED
ADDR 0xabcd

mce-inject < m1 #ran this command dozens of times.

Kernel: 3.0.42

/etc/mcelog/mcelog.conf:
daemon = yes
filter = yes
raw = yes
syslog = no
no-syslog = yes
logfile = /dev/kmsg
[server]
client-user = root
[dimm]
dimm-tracking-enabled = yes
dmi-prepopulate = yes
uc-error-threshold = 1 / 24h
ce-error-threshold = 10 / 24h
[socket]
socket-tracking-enabled = yes
mem-uc-error-threshold = 10 / 24h
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 10 / 24h
mem-ce-error-log = yes
[cache]
cache-threshold-trigger = cache-error-trigger
cache-threshold-log = yes
[page]
memory-ce-threshold = 10 / 24h
memory-ce-trigger = page-error-trigger
memory-ce-log = yes
memory-ce-action = soft-then-hard
[trigger]
children-max = 2
directory = /etc/mcelog

leaky bucket

The recent leaky bucket update looks wrong to me. Testing with mce-inject shows that the threshold is exceeded on every event up to the bucket capacity. This is because __bucket_account() changed from >= to < in its comparison.

I have a simple fix, but I'm not clear on the correct repeated threshold behavior if the bucket fills faster than it ages. What is the purpose of "excess"?

Please add domains {,www.}mcelog.org to certificate

Currently, the browser shows the warning below.

www.mcelog.org uses an invalid security certificate.

The certificate is only valid for the following names:
firstfloor.org, www.firstfloor.org

Error code: SSL_ERROR_BAD_CERT_DOMAIN

It’d be great if the domains could be added to the certificate.

mcelog does not start on Haswell-EP (bisected)

Running strace mcelog --daemon fails with the following error:

[...]
connect(6, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
close(6) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=3997, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f040abc1000
open("/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
fstat(6, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents(6, /* 22 entries /, 32768) = 624
getdents(6, / 0 entries */, 32768) = 0
close(6) = 0
open("/dev/cpu/0/msr", O_RDWR) = 6
pread64(6, 0x7ffe6c0b8320, 8, 383) = -1 EIO (Input/output error)
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 7
connect(7, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
close(7) = 0
write(3, "Kernel does not support page off"..., 126) = 126
exit_group(1) = ?
+++ exited with 1 +++

I've bisected the problem. The offending commit is 0bcf16f which seems to match my hardware:

Xeon E5-1620 v3
Supermicro X10SRL-F, BIOS v2.0
The kernel version is 4.3.5.

Use secure HTTPS in GitHub description

Could the URL GitHub project description please be changed to the more secure HTTPS URL?

Why THRESHOLD env. variable is increasing with number of errors ?

Hello,

I would like to ask about why the THRESHOLD environment variable is set as it is set by mcelog.

On a system there was a HW-related problem which triggered a lot of correctable errors. The logging was weird because as you see below, the number of correctable errors were logged as same as threshold value, i.e. the threshold value was increasing with the number of correctable errors!
Log snippet (between each line a few minutes elapsed):

MCE-SOCKET-TRIGGER: Socket  0  , Correctable  19899  , Uncorrectable  0  , Treshold  19899 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  12559069  , Uncorrectable  0  , Treshold  12559069 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  432343640  , Uncorrectable  0  , Treshold  432343640 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  874393634  , Uncorrectable  0  , Treshold  874393634 in 24h

By checking source code of mcelog, I understood that this behavior may be by design. Is it really working by design? What is the purpose of such threshold logging which is continuously increasing with the number of errors? Shouldn't be threshold logged as a different value such as bucket size?

I understand the code works as following:
When a predefined per socket threshold is exceeded, mcelog calls "socket-memory-error-trigger.local" script. This script does that printout where we see threshold increasing with error count continuously:

echo "MCE-SOCKET-TRIGGER: Socket " $SOCKETID " , Correctable " $CECOUNT " , Uncorrectable " $UCCOUNT " , Treshold " $THRESHOLD > $MCE_LOG_FILE

This script prints environment variables such as THRESHOLD. Since same number is logged for correctable and threshold, I think in this case CECOUNT equals to THRESHOLD environment variable.

Environment variables are set by memdb.c :: memdb_trigger(). It sets value of THRESHOLD env. variable to value of thresh variable, and that thresh variable is set by leaky-bucket.c :: bucket_output(). It sets value to: b->count + b->excess.
So what we see logged as threshold is: bucket’s count plus bucket’s excess value.

When an error comes, bucket’s leaky-bucket.c :: bucket_account() function is called. It increases bucket’s count value. If count reaches the capacity of the bucket, excess is increased by count, then count becomes zero.
So for example, if bucket capacity is 10: when an error comes in, count is increased. After 10 errors, count reaches 10, so excess is increased by 10 so it becomes 10, count is reset to 0. Again more errors come: after 10 errors, count reaches 10, excess is increased by 10 so it becomes 20, count is reset to 0.
Since we set THRESHOLD env. variable to count+excess, it is always the number of errors registered so far since we initialized our bucket at the very beginning. This way it will continuously increase with the number of detected errors. So currently I don’t see how the printed THRESHOLD environment variable is actually a threshold.
Is this value really calculated and set how it should be calculated and set by mcelog?

Thanks,
Ádám Szabó

mcelog client fails with "permission denied"

There is an issue on systems with linux systems running kernel XXX or higher (any kernels where https://android.googlesource.com/kernel/common.git/+/16e5726269611b71c930054ffe9b858c1cea88eb has been applied). Mcelog client randomly fails with a "permission denied" error.

Here is how it happens:

mcelog daemon creates a unix streaming socket with listen()
mcelog client connects to it
mcelog daemon calls accept()
mcelog daemon calls setsockopt(SO_PASSCRED) and asks the kernel to pass uid/gid of the client
mcelog client sends its commands
mcelog daemon receives commands, checks uid to be 0, responds, all good

It is all good when it works, but there is a race between steps 4 and 5. If the client is too fast, the message arrives without credentials and kernel just sets uid=nobody. Mcelog daemon checks that and denies the request.

After the initial commit to linux, there has been a fix in 90c6bd34f884cd9cee21f1d152baf6c18bcac949 (described here: https://lists.ubuntu.com/archives/kernel-team/2013-October/033188.html), that, if mcelog code is changed to set SO_PASSCRED on the listen socket, would fix the race.

non-standard snprintf use

bitfield.c and bitfield.h use the escape sequence %Lu which is a GNUism for %llu and not implemented by alternative libc such as musl libc. Instead, the ISO and POSIX-compatible %llu and %llx should be used.

Please link to https://git.kernel.org

The download page links to the Linux kernel cgit instance.

The latest changes are shown in gitweb

This site is available over HTTPS, so it’d be great if you could update the link.

NUC7PJYH (J5005) - mce: [Hardware Error]

Description of problem:

From time to time system becomes instable and several applications reports
some stranges exception (non app error but system/hardware error)

calling dmesg report this message:

$ dmesg | grep mce
[    0.039982] mce: CPU supports 7 MCE banks
[    0.060928] mce: [Hardware Error]: Machine check events logged
[    0.060932] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[    0.060941] mce: [Hardware Error]: TSC 0 ADDR fef4c9e0 
[    0.060949] mce: [Hardware Error]: PROCESSOR 0:706a1 TIME 1530266046 SOCKET 0 APIC 0 microcode 22

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 1997.494
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 1595.178
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 2559.706
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 2607.391
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

I tried several kernels (4.15.0-23; 4.17.0-041700; 4.17.1-041701; 4.17.2-041702; 4.18.0-041800rc1; 4.18.0-041800rc2) - and even the last available

$ cat /proc/version
Linux version 4.18.0-041800rc2-generic (root@ubuntu) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #201806241430 SMP Fri Jun 29 09:34:50 CEST 2018

list installed micro (microcode-20180425)

$ ls /lib/firmware/intel-ucode/
06-03-02  06-06-05  06-08-01  06-0a-01  06-0f-02  06-16-01  06-1c-02  06-26-01  06-3a-09.initramfs  06-3e-06            06-45-01            06-4e-03            06-56-03  06-8e-09  0f-00-0a  0f-02-09  0f-04-04  0f-06-04
06-05-00  06-06-0a  06-08-03  06-0b-01  06-0f-06  06-17-06  06-1c-0a  06-2a-07  06-3c-03            06-3e-07            06-45-01.initramfs  06-4f-01.initramfs  06-56-04  06-8e-0a  0f-01-02  0f-03-02  0f-04-07  0f-06-05
06-05-01  06-06-0d  06-08-06  06-0b-04  06-0f-07  06-17-07  06-1d-01  06-2d-06  06-3c-03.initramfs  06-3f-02            06-46-01            06-55-03            06-56-05  06-9e-09  0f-02-04  0f-03-03  0f-04-08  0f-06-08
06-05-02  06-07-01  06-08-0a  06-0d-06  06-0f-0a  06-17-0a  06-1e-05  06-2d-07  06-3d-04            06-3f-02.initramfs  06-46-01.initramfs  06-55-04            06-5c-09  06-9e-0a  0f-02-05  0f-03-04  0f-04-09
06-05-03  06-07-02  06-09-05  06-0e-08  06-0f-0b  06-1a-04  06-25-02  06-2f-02  06-3d-04.initramfs  06-3f-04            06-47-01            06-56-02            06-5e-03  06-9e-0b  0f-02-06  0f-04-01  0f-04-0a
06-06-00  06-07-03  06-0a-00  06-0e-0c  06-0f-0d  06-1a-05  06-25-05  06-3a-09  06-3e-04            06-3f-04.initramfs  06-47-01.initramfs  06-56-02.initramfs  06-7a-01  0f-00-07  0f-02-07  0f-04-03  0f-06-02

CPU supported :

mcelog$ if ./mcelog --is-cpu-supported; then echo "CPU is supported!"; else echo "No luck!"; fi
mcelog: Family 6 Model 122 CPU: only decoding architectural errors
CPU is supported!

How reproducible:
Difficult to descripe. Some applications seem to mess up the system.
Immediately the system seems to get in an instable state. Applications
starts showing some strange error with some indicattions, that the
system/hardware may have a problem.

mcelog fails on centos 7 with systemd, but works when executed manually.

https://gist.github.com/anonymous/6862f71fc60cf184491a

^ Here is the log showing systemd trying to start it.

If I do this though: /usr/sbin/mcelog --daemon --foreground --config-file /etc/mcelog/mcelog.conf

It works fine.

/etc/mcelog/mcelog.setup: https://gist.github.com/anonymous/8e71915d69a1e1055def has a reference to you. It says:
"""
An upstream kernel bug prevents mcelog from starting normally in
daemon mode the first time it is run. So, in the systemd service,
we want to start it twice - one as a ExecStartPre that will fail.
But systemd will abort the process if the "pre" fails, so we use
this script - temporarily - to start the first process.
Waiting on Andi Kleen to fix upstream.
"""

Is this fixed in upstream already? I tried a fedora 21 rpm from here: https://kojipkgs.fedoraproject.org//packages/mcelog/101/1.9bfaad8f92c5.fc21/x86_64/mcelog-101-1.9bfaad8f92c5.fc21.x86_64.rpm

This rpm uses this commit: 9bfaad8 which is about 1.5 months old relative to your latest commit in master.

Any idea on how I can fix this so service mcelog start, or service mcelog status using systemd will work? I'm going to have hundreds of machines running mcelog through puppet, and it has to be managed as a service.

Review Debian patches

Hello. The Debian mcelog package has quite a lot of patches, all of them seem useful to all users. Please review them and apply ones that are useful.

http://anonscm.debian.org/cgit/collab-maint/mcelog.git/tree/debian/patches

Segmentation fault of the mcelog daemon on Broadwell E5-2660V4

The mcelog daemon crashes because of segmentation fault on our Broadwell E5-2660V4 processors. We see the following kernel message on hosts with ECC memory errors:

Mar 13 15:17:17 server kernel: [22312023.287737] mcelog[473830]: segfault at 80 ip 000000000040aac8 sp 00007fffa4bec658 error 4 in mcelog[400000+15000]
Mar 14 10:23:54 server kernel: [22380816.522131] mcelog[169928]: segfault at 70 ip 000000000040aac8 sp 00007ffc191d87e8 error 4 in mcelog[400000+15000]

We use version 153, but the problem is reproduced on the last version 156 as well. The kernel version is 4.4.88-42.

To simplify analysis we compiled the binary mcelog.debug with -g3 -O0 options and get the core dump. The both files are here mcelog-crash.zip.

mcelog test failures

I have run the mcelog tests on two test machines in my lab: an Intel Atom and an Intel Xeon. Both have a couple failures (not bad).

Atom:

[root@atompc tests]# make test
./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: cache.c:92: parse_cpumap: Assertion `len == c * sizeof(unsigned)' failed.
./test: line 42: 3198 Aborted $D ../../mcelog --foreground --daemon --debug-numerrors --config $conf --logfile $log >> result

[root@atompc tests]# cat */results
cache.conf: no triggers at all
cache.conf: triggers did not trigger as expected: 2 != 0

Xeon:

[root@rincon1-ew tests]# cat */results

socket-1.conf: triggers did not trigger as expected: 2 != 4
socket-2.conf: triggers did not trigger as expected: 1 != 2
socket-memdb.conf: triggers did not trigger as expected: 4 != 6

The O/S on both systems is CentOS 6.4 x86_64, and I wrote these instructions to run the mcelog tests:

Testing mcelog triggers requires mce-inject from the ras-utils package and page-types.c from the kernel-doc package.

yum install ras-utils kernel-doc

cd /usr/share/doc/kernel-doc-2.6.32/Documentation/vm

gcc -o page-types page-types.c

mv page-types /usr/bin/

Run the mcelog package test suite.

cd /root/rpmbuild/BUILD/mcelog-1.0pre3_20120814_2/tests

service mcelogd stop

ln -s /usr/sbin/mcelog ../mcelog

modprobe mce-inject

make clean

make test

The results are below.

Thanks,

Larry Baker

----- Atom -----

[root@atompc tests]# rpm -q -a | grep mce
mcelog-1.0pre3_20120814_2-0.6.el6.x86_64

[root@atompc tests]# more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 28
model name : Intel(R) Atom(TM) CPU D525 @ 1.80GHz
stepping : 10
cpu MHz : 1799.899
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl tm2 sss
e3 cx16 xtpr pdcm movbe lahf_lm dts
bogomips : 3599.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
--More--(0%)

[root@atompc tests]# make clean
rm -f _/_log
rm -f */results
[root@atompc tests]# make test
./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: no process killed
mcelog: cache.c:92: parse_cpumap: Assertion `len == c * sizeof(unsigned)' failed.
./test: line 42: 3198 Aborted $D ../../mcelog --foreground --daemon --debug-numerrors --config $conf --logfile $log >> result
./test page ""
++++++++++++ running page test +++++++++++++++++++
mcelog: no process killed
./test memdb ""
++++++++++++ running memdb test +++++++++++++++++++
mcelog: no process killed
./test socket ""
++++++++++++ running socket test +++++++++++++++++++
mcelog: no process killed
./test pfa ""
++++++++++++ running pfa test +++++++++++++++++++
mcelog: no process killed
+++ start the injection for page-account.conf +++
inject for page type slab at physical address 0x13ca44000 [ NO. 0 ]
inject for page type slab at physical address 0x13ca44000 [ NO. 1 ]
+++ start the injection for page-hard.conf +++
inject for page type slab at physical address 0x137d77000 [ NO. 0 ]
inject for page type slab at physical address 0x137d77000 [ NO. 1 ]
+++ start the injection for page-soft.conf +++
inject for page type slab at physical address 0x12a355000 [ NO. 0 ]
inject for page type slab at physical address 0x12a355000 [ NO. 1 ]
+++ start the injection for page-soft-then-hard.conf +++
inject for page type slab at physical address 0x1255a7000 [ NO. 0 ]
[root@atompc tests]# cat */results
cache.conf: no triggers at all
cache.conf: triggers did not trigger as expected: 2 != 0
memdb-1.conf: triggers trigger as expected
memdb-2.conf: triggers trigger as expected
page-account.conf: triggers trigger as expected
page-hard.conf: triggers trigger as expected
page-memdb.conf: triggers trigger as expected
page-off.conf: triggers trigger as expected
page-soft.conf: triggers trigger as expected
page-soft-then-hard.conf: triggers trigger as expected
page-account.conf: triggers trigger as expected
page-hard.conf: triggers trigger as expected
page-soft.conf: triggers trigger as expected
page-soft-then-hard.conf: triggers trigger as expected
socket-1.conf: triggers trigger as expected
socket-2.conf: triggers trigger as expected
socket-memdb.conf: triggers trigger as expected

----- Xeon -----

[root@rincon1-ew tests]# rpm -q -a | grep mce
mcelog-1.0pre3_20120814_2-0.6.el6.x86_64

[root@rincon1-ew tests]# more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU L5630 @ 2.13GHz
stepping : 2
cpu MHz : 1596.000
cache size : 12288 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmp
erf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pci
d dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dts tpr_shadow vnmi flexpriority
ept vpid
bogomips : 4267.09
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
--More--(0%)

[root@rincon1-ew tests]# make clean
rm -f _/_log
rm -f */results
[root@rincon1-ew tests]# make test
./test cache ""
++++++++++++ running cache test +++++++++++++++++++
mcelog: no process killed
./test page ""
++++++++++++ running page test +++++++++++++++++++
mcelog: no process killed
./test memdb ""
++++++++++++ running memdb test +++++++++++++++++++
mcelog: no process killed
./test socket ""
++++++++++++ running socket test +++++++++++++++++++
mcelog: no process killed
./test pfa ""
++++++++++++ running pfa test +++++++++++++++++++
mcelog: no process killed
+++ start the injection for page-account.conf +++
inject for page type slab at physical address 0x160096000 [ NO. 0 ]
inject for page type slab at physical address 0x160096000 [ NO. 1 ]
+++ start the injection for page-hard.conf +++
inject for page type slab at physical address 0x177742000 [ NO. 0 ]
inject for page type slab at physical address 0x177742000 [ NO. 1 ]
+++ start the injection for page-soft.conf +++
inject for page type slab at physical address 0x15556e000 [ NO. 0 ]
inject for page type slab at physical address 0x15556e000 [ NO. 1 ]
+++ start the injection for page-soft-then-hard.conf +++
inject for page type slab at physical address 0x179bfa000 [ NO. 0 ]
[root@rincon1-ew tests]# cat */results
cache.conf: triggers trigger as expected
memdb-1.conf: triggers trigger as expected
memdb-2.conf: triggers trigger as expected
page-account.conf: triggers trigger as expected
page-hard.conf: triggers trigger as expected
page-memdb.conf: triggers trigger as expected
page-off.conf: triggers trigger as expected
page-soft.conf: triggers trigger as expected
page-soft-then-hard.conf: triggers trigger as expected
page-account.conf: triggers trigger as expected
page-hard.conf: triggers trigger as expected
page-soft.conf: triggers trigger as expected
page-soft-then-hard.conf: triggers trigger as expected
socket-1.conf: triggers did not trigger as expected: 2 != 4
socket-2.conf: triggers did not trigger as expected: 1 != 2
socket-memdb.conf: triggers did not trigger as expected: 4 != 6

For uncorrected errors, how do we determine which memory address the error has occurred?

For ex: Consider the log for uncorrected error:
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 2 TSC 41cc435c49086
MISC 40000
TIME 1462844438 Tue May 10 01:40:38 2016
MCG status:
MCi status:
Uncorrected error
Error enabled

Consider the log for corrected error:
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 2
MISC 50000 ADDR 389f000
TIME 1462783860 Mon May 9 08:51:00 2016
MCG status:
MCi status:
Corrected error

Why isn't "ADDR 389f000" not present for uncorrected errors and how do we determine it?

suspicious RCU usage

[36915.633804] =============================
[36915.633805] WARNING: suspicious RCU usage
[36915.633808] 4.13.4-301.fc27.x86_64+debug #1 Not tainted
[36915.633809] -----------------------------
[36915.633811] arch/x86/kernel/cpu/mcheck/dev-mcelog.c:60 suspicious mce_log_get_idx_check() usage!
[36915.633812]
other info that might help us debug this:

[36915.633813]
rcu_scheduler_active = 2, debug_locks = 1
[36915.633815] 3 locks held by kworker/1:2/14637:
[36915.633816] #0: ("events"){.+.+.+}, at: [] process_one_work+0x1d0/0x6a0
[36915.633827] #1: ((&mce_work)){+.+...}, at: [] process_one_work+0x1d0/0x6a0
[36915.633833] #2: ((x86_mce_decoder_chain).rwsem){++++..}, at: [] blocking_notifier_call_chain+0x2f/0x70
[36915.633840]
stack backtrace:
[36915.633843] CPU: 1 PID: 14637 Comm: kworker/1:2 Not tainted 4.13.4-301.fc27.x86_64+debug #1
[36915.633844] Hardware name: Gigabyte Technology Co., Ltd. Z87M-D3H/Z87M-D3H, BIOS F11 08/12/2014
[36915.633847] Workqueue: events mce_gen_pool_process
[36915.633849] Call Trace:
[36915.633854] dump_stack+0x8e/0xd6
[36915.633858] lockdep_rcu_suspicious+0xc5/0x100
[36915.633862] dev_mce_log+0xf6/0x1e0
[36915.633865] notifier_call_chain+0x39/0x90
[36915.633869] blocking_notifier_call_chain+0x49/0x70
[36915.633873] mce_gen_pool_process+0x41/0x70
[36915.633876] process_one_work+0x253/0x6a0
[36915.633883] worker_thread+0x4d/0x3b0
[36915.633888] kthread+0x133/0x150
[36915.633890] ? process_one_work+0x6a0/0x6a0
[36915.633892] ? kthread_create_on_node+0x70/0x70
[36915.633896] ret_from_fork+0x2a/0x40

dmesg.txt

What about this problem?

mcelog: ERROR: AMD Processor family 22: mcelog does not support this processor.

Hello dear Developer!

I have a CPU: Quad Core AMD A8-6410 APU with AMD Radeon R5 Graphics (-MCP-)

For every linux distribution I get this error:
mcelog: ERROR: AMD Processor family 22: mcelog does not support this processor. Please use the edac_mce_amd modul

Please fix this bug.

Thank you very much!
László Kardos # #

Specify CPU on cli by family and model

It is far easier for the user to find the cpu family and model by looking at /proc/cpuid then by knowing the family name. The code already knows how to translate the family and model to the right decoding functions.

only decoding architectural errors

localhost user.err mcelog: Family 6 Model 4d CPU: only decoding architectural errors
localhost user.err mcelog: Kernel does not support page offline interfacealsh..

We get these errors during Linux boot, What we are wondering about is if its only the decoding of the errors that are missing or if the actual errors reported by cpu is missing...

mcelog fails to initialize when 'dmi' option is turned on

After enabling the dmi option in the mcelog.conf file and restarting the mcelog service fails. The 'service mcelog status' shows the following:
linux-y3iu:/home/vpedabal # service mcelog status
mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
Active: failed (Result: exit-code) since Tue 2015-03-17 00:34:01 PDT; 1s ago
Process: 2761 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)
Main PID: 2761 (code=exited, status=1/FAILURE)

Is there a way to see what was the underlying failure? The syslog doesn't seem to report anything.

mcelog throws "only decoding architectural errors" on skylake processors

mcelogs need support for new Intel skylake processors. Below errors is seen in /var/log/mcelog.

mcelog: mcelog read: No such device
mcelog: Family 6 Model 55 CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 11
MISC 229aa040900086 ADDR fffc4b00
TIME 1458750006 Wed Mar 23 12:20:06 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae2000000003110a MCGSTATUS 0
MCGCAP 6000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85

Are trigger files needed?

I want to send myself an e-mail when an mcelog trigger occurs on a remote server. I thought I would find the right trigger file and add a .local extension file. However, the Red Hat mcelog-1.0pre3_20120814_2-0.6.el6.x86_64 RPM for RHEL 6.4 not include any trigger files. I'm puzzled. I could not figure out from the documentation whether trigger files were required. From what I see on the GIT source site, they look like they are necessary for mcelog to function properly. Is that the case? Why would Red Hat remove them?

Thank you.

Larry Baker

mcelog hangs in the non-dameon mode when unknown error is encountered

In the non-daemon mode, when the child processes exit the SIGCHLD
is being read off the queue, but the child handler is not called,
and the num_children is not decremented. This leads to the process
hanging in trigger_wait() function in an infinite loop waiting for
SIGCHLD signals that never arrive - (sigwait(&mask, &sig)).

This behaviour was discovered when the mcelog tried to process an
unknown error on a machine, and created child processes. The parent
process never exited. Since mcelog was run by a cron task every x
minutes, numerous hung-mcelog processes accumulated whenever it hit
the unknown-error-trigger.

Sample error lines:
mcelog: Running trigger unknown-error-trigger' mcelog: Running triggerunknown-error-trigger'
mcelog: CPU 5 on socket 0 received unknown error
mcelog: CPU 5 on socket 0 received unknown error

The fix is to call waitpid() to reap the exit status of all terminating
children, calling finish_child() for each. This fix has been tested
on the machine the issue was discovered on.

srilakshmidj-patch-1 on branch srilakshmidj/mcelog fixes the problem.

Signed-off-by: Tim Bingham [email protected]
Signed-off-by: Sri Jayaramappa [email protected]

andikleen / mcelog Goto Github PK

mcelog's People

Contributors

Stargazers

Watchers

Forkers

mcelog's Issues

Here are the last logs before the crash:

Jan 14 16:25:27 lerya kernel: [ 1203.476611] Machine check poll done on CPU 0

make test

yum install ras-utils kernel-doc

cd /usr/share/doc/kernel-doc-2.6.32/Documentation/vm

gcc -o page-types page-types.c

mv page-types /usr/bin/

cd /root/rpmbuild/BUILD/mcelog-1.0pre3_20120814_2/tests

service mcelogd stop

ln -s /usr/sbin/mcelog ../mcelog

modprobe mce-inject

make clean

make test

Recommend Projects

Recommend Topics

Recommend Org