Coder Social home page Coder Social logo

Comments (27)

CounterPly avatar CounterPly commented on July 29, 2024 1

@Rechenschieber

Thank you very much. These reports confirm my suspicion that asmFish is crashing for an entirely different reason than Stockfish. All of your reported crashes (as well as the crash I generated) indicate that the problem is not in the NUMA code like it was in SF, but rather in the following code block in QSearch:

if PvNode = 1
lea r8, [._pv]
mov r9, qword[rbx+State.pv]
mov dword[.oldAlpha], ecx
mov qword[rbx+1*sizeof.State+State.pv], r8
mov dword[r9], edx
end if

Specifically, the mov instruction on line 94 is executed prematurely before we have fully loaded our destination address into r9 on line 91. According to your reports, this is what causes the memory access violation (0xC0000005) and resulting crash.

I believe this crash occurs due to an extremely rare occurrence of the L1 DCache speculatively executing load/store requests in an order that is different than what is written. Usually, this type of out-of-order execution is desirable since it can essentially recycle data from an earlier store to the same address, saving considerable time. In this particular case, however, the ordering of the instructions here seems to be confusing the memory disambiguator in such a way that it fails to predict the dependency of line 94 on line 91.

Excerpt from Page 75 of Intel's 64 and IA-32 Architectures Optimization Reference Manual:

Memory Disambiguation
A load operation may depend on a preceding store. Many microarchitectures block loads until all preceding store addresses are known. The memory disambiguator predicts which loads will not depend on any previous stores whose addresses aren’t yet known. When the disambiguator predicts that a load does not have such a dependency, the load takes its data from an earlier store to the same address. This hides the load latency. Eventually, the prediction is verified. If the load did indeed depend on a store whose address was unknown at the time the load executed, this conflict is detected and the load and all succeeding instructions are re-executed.

In theory, this should be fixable through some minor changes and slight reordering of the instructions. If I can get a continuous analysis to run for longer than 4-5 days on my hardware, then I will send you that asmFish version to test/verify on your machines.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

You mentioned this was a problem in SF 10. Did it eventually get corrected? If you post the patch that corrected this behavior, I would be happy to port it for you.

The asmFish version on my asmFish fork corresponds to Stockfish 10+ (i.e. Stockfish from December 2018, see commit notes). Let me know if you encounter the same problem there. You may need to assemble the latest asmFish first using make.bat as I don't build executables with every new commit.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Thank you very much. These crashes are really rare events - in fact I didn't see any in the last 4 days. I started writing log files then - maybe this can help.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

What I found out so far:
These crashes are not related to computers with more than 1 processor group - they also occur on a 32 core Threadripper running 64 threads. I got very good support by Joe Drake, a chess enthusiast who owns one.
I tried to reproduce these crashes by repeating those commands before the crashes - but was not able to reproduce them.
asmFishW_10 also crashes - like Stockfish 10 does.
asmFishW-04-05 does not crash, so the problem was added between 04-05 and 05-18.
Stockfish 010719 also doesn't crash - so the problem has been fixed.

from asmfish.

MichaelB7 avatar MichaelB7 commented on July 29, 2024

This is typical behavior as the machines (hardware) are starting to outpace the software design - overflows will almost always cause a crash or cause some other undefined and unwanted behavior. Dann Corbit has found numerous bugs since he started running big hardware on days on one position. Unfortunately, the best way to find the bug is to use the debug version as the errors messages are often obscure and are not easy to duplicate when using multiple threads since multiple threads result in non-determinant behavior which may not always reappear at the same point in processing.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

I think I've found the problem: [https://github.com/official-stockfish/Stockfish/commit/0194da0d80f35faba08c86e3a6845bdc1268e4c6(url)]
https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/gA6aoMEuOwg

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Sorry first link doesn't work
https://github.com/official-stockfish/Stockfish/commit/0194da0d80f35faba08c86e3a6845bdc1268e4c6

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

@Rechenschieber Thank you for these links. I'll take a look at it this weekend.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

On Wednesday, December 26, 2018 at 10:54:42 PM UTC, Ronald de Man wrote:

OK, I'll explain this to myself :-)

The <= condition was correct, but the problem is the ptr->Size check. Once the last record has been processed, ptr points to the first element AFTER the buffer, so accessing ptr->Size can segfault.

So ptr->Size >0 should indeed be replaced with "offset < returnLength". The additional "&& offset + ptr->Size <= returnLength" is optional.

@tthsqe12
Question for you when you have the time.

Is the segfault Ronald refers to in the quote above (corrected in official-stockfish/Stockfish@0194da0) also observed in your NUMA implementation? As far as I can tell, Os_SetThreadPoolInfo (in OsWindows.asm) should not experience this same crash. I suspect the crashes reported here are coming from elsewhere, but it would give me peace of mind if you could confirm that this section is fine.

asmFish/x86/OsWindows.asm

Lines 904 to 933 in b9ddd4e

.NextNumaNode:
cmp r12d, MAX_NUMANODES
jae .NumaNodesDone
cmp rsi, rbx
jae .NumaNodesDone
mov ecx, dword[rsi+WinNumaNode.NodeNumber+8*0]
or edx, -1
mov r8, qword[.Affinity]
call QueryNodeAffinity
test eax, eax
jz .SkipNumaNode
xor eax, eax
mov edx, dword[rsi+WinNumaNode.NodeNumber+8*0]
mov r8, qword[rsi+WinNumaNode.GroupMask+8*0]
mov r9, qword[rsi+WinNumaNode.GroupMask+8*1]
mov dword[rdi+NumaNode.nodeNumber], edx
mov dword[rdi+NumaNode.coreCnt], eax ; initialize to zero, will increment later
mov qword[rdi+NumaNode.cmhTable], rax ; initialize to NULL, will allocate as needed
mov qword[rdi+NumaNode.parent], rdi ; initialize to self
mov qword[rdi+NumaNode.groupMask+8*0], r8
mov qword[rdi+NumaNode.groupMask+8*1], r9
add r12d, 1
add rdi, sizeof.NumaNode
.SkipNumaNode:
mov eax, dword[rsi+WinNumaNode.Size]
add rsi, rax
jmp .NextNumaNode

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

@Rechenschieber

You were right about this being a very rare event. It took me 4+ days of straight analysis on a multi NUMA node system, but I was finally able to generate a single crash. I would like to compare it to yours to see if we are experiencing the same event.

Could you please share the exception code(s) and exception offset(s) for the crashes you experienced? If you or Joe could share a .dmp file like the one I've attached below, that would be immensely helpful.

asmFishW_10_popcnt.exe.17980.dmp.zip

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

I'm very sorry - I didn't write any dmp files. The information from event viewer for some crashes is this:

Log Name: Application
Source: Application Error
Date: 29.06.2019 10:24:15
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E2679v4
Description:
Faulting application name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Faulting module name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Exception code: 0xc0000005
Fault offset: 0x0000000000164362
Faulting process id: 0x15b4
Faulting application start time: 0x01d52e3704c27b5d
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Report Id: 475a3afe-9a47-11e9-a5ee-2cfda1341b64
Event Xml:



1000
2
100
0x80000000000000

1631
Application
E2679v4



asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
c0000005
0000000000164362
15b4
01d52e3704c27b5d
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
475a3afe-9a47-11e9-a5ee-2cfda1341b64

Log Name: Application
Source: Application Error
Date: 27.06.2019 13:14:18
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E2679v4
Description:
Faulting application name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Faulting module name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Exception code: 0xc0000005
Fault offset: 0x0000000000164362
Faulting process id: 0x15e4
Faulting application start time: 0x01d52cba0e8411d2
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Report Id: b40017e5-98cc-11e9-a5ee-2cfda1341b64
Event Xml:



1000
2
100
0x80000000000000

1626
Application
E2679v4



asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
c0000005
0000000000164362
15e4
01d52cba0e8411d2
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
b40017e5-98cc-11e9-a5ee-2cfda1341b64

Log Name: Application
Source: Application Error
Date: 25.06.2019 11:32:25
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E2679v4
Description:
Faulting application name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Faulting module name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Exception code: 0xc0000005
Fault offset: 0x0000000000164362
Faulting process id: 0xf28
Faulting application start time: 0x01d52b2f357386f8
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Report Id: 2388379b-972c-11e9-a5ee-2cfda1341b64
Event Xml:



1000
2
100
0x80000000000000

1619
Application
E2679v4



asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
c0000005
0000000000164362
f28
01d52b2f357386f8
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
2388379b-972c-11e9-a5ee-2cfda1341b64

Log Name: Application
Source: Application Error
Date: 24.06.2019 18:57:33
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E2679v4
Description:
Faulting application name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Faulting module name: asmFishW_2019-05-18_bmi2.exe, version: 0.0.0.0, time stamp: 0x5ce0cddb
Exception code: 0xc0000005
Fault offset: 0x0000000000164362
Faulting process id: 0x17f4
Faulting application start time: 0x01d52994902514fb
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
Report Id: 28505d49-96a1-11e9-a5ee-2cfda1341b64
Event Xml:



1000
2
100
0x80000000000000

1616
Application
E2679v4



asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
asmFishW_2019-05-18_bmi2.exe
0.0.0.0
5ce0cddb
c0000005
0000000000164362
17f4
01d52994902514fb
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_2019-05-18_bmi2.exe
28505d49-96a1-11e9-a5ee-2cfda1341b64

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

And this is from Stockfish 10 - same exception code:

Log Name: Application
Source: Application Error
Date: 14.08.2019 23:55:36
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E4650
Description:
Faulting application name: stockfish_10_x64_popcnt.exe, version: 0.0.0.0, time stamp: 0x00000000
Faulting module name: stockfish_10_x64_popcnt.exe, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x00000000000d9716
Faulting process id: 0x1074
Faulting application start time: 0x01d55212b42f68ca
Faulting application path: C:\Users\Administrator\Desktop\StockFishTool10 BMI2\stockfish_10_x64_popcnt.exe
Faulting module path: C:\Users\Administrator\Desktop\StockFishTool10 BMI2\stockfish_10_x64_popcnt.exe
Report Id: 3e77083d-bede-11e9-a7d3-002590595f85
Event Xml:



1000
2
100
0x80000000000000

2562
Application
E4650



stockfish_10_x64_popcnt.exe
0.0.0.0
00000000
stockfish_10_x64_popcnt.exe
0.0.0.0
00000000
c0000005
00000000000d9716
1074
01d55212b42f68ca
C:\Users\Administrator\Desktop\StockFishTool10 BMI2\stockfish_10_x64_popcnt.exe
C:\Users\Administrator\Desktop\StockFishTool10 BMI2\stockfish_10_x64_popcnt.exe
3e77083d-bede-11e9-a7d3-002590595f85

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

@Rechenschieber
The last commit (edc0dbc) appears to have corrected the issue on my hardware. If you could confirm this, it would be much appreciated.

The executables are located here:
https://github.com/lantonov/asmFish/tree/bugfix/WindowsOS_binaries

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Bad news: I had 2 crashes after only a few hours:
Log Name: Application
Source: Application Error
Date: 18.09.2019 20:49:03
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: E2679v4
Description:
Faulting application name: asmFishW_10_bmi2.exe, version: 0.0.0.0, time stamp: 0x5d8179a7
Faulting module name: asmFishW_10_bmi2.exe, version: 0.0.0.0, time stamp: 0x5d8179a7
Exception code: 0xc0000005
Fault offset: 0x0000000000164773
Faulting process id: 0x147c
Faulting application start time: 0x01d56e3a212008b5
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2.exe
Report Id: fb6d69de-da44-11e9-a5ee-2cfda1341b64
Event Xml:



1000
2
100
0x80000000000000

1785
Application
E2679v4



asmFishW_10_bmi2.exe
0.0.0.0
5d8179a7
asmFishW_10_bmi2.exe
0.0.0.0
5d8179a7
c0000005
0000000000164773
147c
01d56e3a212008b5
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2.exe
fb6d69de-da44-11e9-a5ee-2cfda1341b64

This time I made dump files, but they are huge: nearly 16 GB - so I guess it won't be possible to upload them here.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

Interesting. Mine is still running at the moment - no crashes yet (~48 hrs uptime). I will investigate using the new offset value you've provided. Please let me know if you find a way to compress/transfer as the dump files you mentioned could be quite helpful.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Compressing didn't work - the size is still 15 GB. I'll now try to get a crash using much less hash. Originally I had 4GB, now I'll try 64 / 128 MB.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Less hash had no impact on file size :(
I'll try to find some way to send the dump file.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

Filemail is free and seems to work fine on files up to 50 GB in size. Here is a test using the 7-man TB for KBPPvKNP (26.75 GB in size).

https://www.filemail.com/d/dcvxfblnuoflrup

Using their Windows client will give you considerably faster upload speed.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

~112 hours of straight analysis and still no crash here. Unfortunately, it does not look like I will be able to reproduce the crash this time with my hardware.

While we wait on the dump file transfer, please feel free to try out the following 4 patched versions of asmFish on your hardware. At the moment, these are my best guesses on how to go about fixing the crash you are experiencing. I'm primarily interested in seeing what the changes in fault offset between these four patches look like, but if one of them manages to fix the problem entirely, that would be great too.

4patches.zip

Thank you again for your help in troubleshooting this.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Thanks a lot.
Link to dump file: https://fil.email/w2wsSYWf
Please tell me if you need more dump files.

Edit: Fixed broken link.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

@Rechenschieber

Edit: Nevermind I found a workaround. I have access to the relevant information from the dump now.

At this point, I still think the best course of action would be to try out the 4 patched asmFish versions I posted earlier. If you get another crash please let me know the error code/offset and include a dump file again if possible.

The problematic section of code is again in QSearch (a few hundred lines after the last fix) in a very similar block of code to the last. I am hoping that this is the last of such issues, but if not, it may point to the need for a more explicit redesign and/or assembly-time switch to better accommodate big hardware in this section of code.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

Here is one more patch for you to try. This should better align QSearch with your processor's cache line. Hopefully, this will reduce/eliminate these rare crashes.

asmFishW_10_bmi2_patch5.zip

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Patch 1 and 2 didn't stop crashes. I'm now trying patch 5. All of my big computers but one have W7 / server 2008 R2. The only one with W10 is a 32 core Threadripper - I'll try that.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

Update:
I was finally able to reproduce the exact crash you were reporting. It took around 11 days of analysis on my hardware. I'll revert edc0dbc since I could reproduce this crash with more reliability and will keep looking for a more sustainable fix. Any of your .dmp files from patches 1-5 would be interesting for me to see and may be helpful in pinpointing the issue.

No worries about producing a minidump. I was able to find a workaround, so other large dumps like the 14 GB dump you previously shared would be perfect.

Thank you again for you help.

from asmfish.

Rechenschieber avatar Rechenschieber commented on July 29, 2024

Bad news - I got a crash with patch #5
https://fil.email/89Q2R3yA
This is the error message:
Log Name: Application
Source: Application Error
Date: 12.10.2019 20:06:34
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: AsusWS
Description:
Faulting application name: asmFishW_10_bmi2_patch5.exe, version: 0.0.0.0, time stamp: 0x5d8816f9
Faulting module name: asmFishW_10_bmi2_patch5.exe, version: 0.0.0.0, time stamp: 0x5d8816f9
Exception code: 0xc0000005
Fault offset: 0x0000000000162c0f
Faulting process id: 0x1378
Faulting application start time: 0x01d5810f287061a3
Faulting application path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2_patch5.exe
Faulting module path: C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2_patch5.exe
Report Id: 0636220a-ed1b-11e9-b39d-2c4d5443f486
Event Xml:



1000
2
100
0x80000000000000

3277
Application
AsusWS



asmFishW_10_bmi2_patch5.exe
0.0.0.0
5d8816f9
asmFishW_10_bmi2_patch5.exe
0.0.0.0
5d8816f9
c0000005
0000000000162c0f
1378
01d5810f287061a3
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2_patch5.exe
C:\Users\Cluster\Desktop\FishTool10 BMI2\asmFishW_10_bmi2_patch5.exe
0636220a-ed1b-11e9-b39d-2c4d5443f486

from asmfish.

tthsqe12 avatar tthsqe12 commented on July 29, 2024

@CounterPly What you describe on speculative loads should have no bearing on the correct execution of the program. If the guess is incorrect and would lead to a segfault, the processor should throw out this guess.

from asmfish.

CounterPly avatar CounterPly commented on July 29, 2024

@tthsqe12 Thanks for this input. I realize that it has been nearly three years since this issue was opened and that @Rechenschieber has moved on to other things by now, but it has honestly always bothered me that I was never able to pin down a fix for this. Do you have any theories based on what you've read?

As a reference, we ultimately thought it may be related to this discussion, specifically the last two comments by Ronald de Mann and Dann Corbit.

Note: I realize asmFish has no practical value anymore; I'm asking purely from an academic perspective since I'm legitimately curious and this particular problem has puzzled me for years.

from asmfish.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.