Coder Social home page Coder Social logo

Comments (17)

gwoltman avatar gwoltman commented on June 15, 2024

Another example.

Objconv:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE] ; 6ED1 _ C4 E2 Ed: 29. 15, 00000000(rel)

gdb:

0: c4 e2 ed 29 (bad)
4: 15 00 00 00 00 adc eax,0x0

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

It should be:

0: c4 41 2d 58 d3 vaddpd ymm10,ymm10,ymm11

We have it as:

000000013fee1000 C4 41 AD 58 D3 vaddpd ymm10, ymm10, ymm11

Will investigate now and fix.

From: gwoltman [mailto:[email protected]]
Sent: 13 November 2016 04:37 PM
To: Terraspace/HJWasm [email protected]
Subject: [Terraspace/HJWasm] Request for gcc compatible output (#38)

One example is the instruction vaddpd ymm10, ymm10, ymm11. Output from objconv:

' vaddpd ymm10, ymm10, ymm11 ; 6ECC _ C4 41 Ad: 58. D3; Note: Prefix bit or byte has no meaning in this context`

Output from https://defuse.ca/online-x86-assembler.htm#disassembly2

0: c4 41 ad 58 (bad)
4: d3 .byte 0xd3

Knights Landing will execute this code correctly. However, debugging is a bit hard in gdb as disassembly is not possible.

One of my users is getting crashes on Piledriver. I don't have access to an AMD cpu so I do not know if the crash is due to encodings like the above or a bug in my code.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #38 , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVNX7EV2MbAM3Dqdm6SbQHyz_zIduks5q9zy_gaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

Working on this now.

I don’t think the bit being set would cause a crash, but for the sake of accuracy it should be fixed anyway!

From: gwoltman [mailto:[email protected]]
Sent: 13 November 2016 04:43 PM
To: Terraspace/HJWasm [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Another example.

Objconv:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE] ; 6ED1 _ C4 E2 Ed: 29. 15, 00000000(rel)

gdb:

0: c4 e2 ed 29 (bad)
4: 15 00 00 00 00 adc eax,0x0


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVIOqapvhxp7EhEhwzRHlIyZHlUSZks5q9z4JgaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

This is what I’m getting with the latest version:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE]

000000013f2e1000 C4 E2 ED 29 15 F7 2F 00 00 vpcmpeqq ymm2, ymm2, ymmword ptr [rip+0x2ff7]

According to the manual and defuse it should be:

c4 e2 ed 29 15 00 00 00 00

So that one seems right now?

vaddpd ymm10,ymm10,ymm11

gives us:

000000013f2e1009 C4 41 AD 58 D3 vaddpd ymm10, ymm10, ymm11

And should be:

c4 41 2d 58 d3

AD vs 2D = means the (VEX byte 3, W bit is set). Opcode specific extension or used like rex.w, or ignored, depending on the opcode byte.

Specifically for vaddpd we have:

VEX.NDS.256.66.0F.WIG 58 /r

WIG: can use C5H form (if not requiring VEX.mmmmm) or VEX.W value is ignored in the C4H form of VEX

prefix.

— If WIG is present, the instruction may be encoded using either the two-byte form or the three-byte form of

VEX. When encoding the instruction using the three-byte form of VEX, the value of VEX.W is ignored.

So that shouldn’t be a problem and should be safe there.

It would be worth testing the instruction specifically on it’s own on an AMD chip just to make sure they don’t have a different take on the W bit, but I wouldn’t imagine so.

From: gwoltman [mailto:[email protected]]
Sent: 13 November 2016 04:43 PM
To: Terraspace/HJWasm [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Another example.

Objconv:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE] ; 6ED1 _ C4 E2 Ed: 29. 15, 00000000(rel)

gdb:

0: c4 e2 ed 29 (bad)
4: 15 00 00 00 00 adc eax,0x0


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVIOqapvhxp7EhEhwzRHlIyZHlUSZks5q9z4JgaJpZM4Kwu7q .

from uasm.

habran avatar habran commented on June 15, 2024

I'll fix that anyway, it won't be hard to clear that bit if C4 in question
I just need to find out which instructions are included

On Mon, Nov 14, 2016 at 5:35 AM, John Hankinson [email protected]
wrote:

This is what I’m getting with the latest version:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE]

000000013f2e1000 C4 E2 ED 29 15 F7 2F 00 00 vpcmpeqq ymm2, ymm2, ymmword
ptr [rip+0x2ff7]

According to the manual and defuse it should be:

c4 e2 ed 29 15 00 00 00 00

So that one seems right now?

vaddpd ymm10,ymm10,ymm11

gives us:

000000013f2e1009 C4 41 AD 58 D3 vaddpd ymm10, ymm10, ymm11

And should be:

c4 41 2d 58 d3

AD vs 2D = means the (VEX byte 3, W bit is set). Opcode specific extension
or used like rex.w, or ignored, depending on the opcode byte.

Specifically for vaddpd we have:

VEX.NDS.256.66.0F.WIG 58 /r

WIG: can use C5H form (if not requiring VEX.mmmmm) or VEX.W value is
ignored in the C4H form of VEX

prefix.

— If WIG is present, the instruction may be encoded using either the
two-byte form or the three-byte form of

VEX. When encoding the instruction using the three-byte form of VEX, the
value of VEX.W is ignored.

So that shouldn’t be a problem and should be safe there.

It would be worth testing the instruction specifically on it’s own on an
AMD chip just to make sure they don’t have a different take on the W bit,
but I wouldn’t imagine so.

From: gwoltman [mailto:[email protected]]
Sent: 13 November 2016 04:43 PM
To: Terraspace/HJWasm [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Another example.

Objconv:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE] ; 6ED1 _ C4 E2 Ed: 29. 15,
00000000(rel)

gdb:

0: c4 e2 ed 29 (bad)
4: 15 00 00 00 00 adc eax,0x0


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <https://github.com/
/issues/38#issuecomment-260197046> , or mute the thread <
https://github.com/notifications/unsubscribe-auth/
AQGQVIOqapvhxp7EhEhwzRHlIyZHlUSZks5q9z4JgaJpZM4Kwu7q> .


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#38 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AQH-YBhNiGeM9Ct2BbNWP5P9NSrSunRmks5q919vgaJpZM4Kwu7q
.

from uasm.

habran avatar habran commented on June 15, 2024

codegen.c line 598:
/* This fixes AVX REX_W wide 32 <-> 64 instructions third
byte bit W*/
//lbyte &= ~EVEX_P1WMASK; //make sure it is not set if
not 64 bit
//lbyte |= ((CodeInfo->pinstr->prefix) >> 8 & 0x80); // set
only W bit if 64 bit
fixes the problem

On Mon, Nov 14, 2016 at 5:40 AM, Branislav Habus [email protected]
wrote:

I'll fix that anyway, it won't be hard to clear that bit if C4 in question
I just need to find out which instructions are included

On Mon, Nov 14, 2016 at 5:35 AM, John Hankinson [email protected]
wrote:

This is what I’m getting with the latest version:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE]

000000013f2e1000 C4 E2 ED 29 15 F7 2F 00 00 vpcmpeqq ymm2, ymm2, ymmword
ptr [rip+0x2ff7]

According to the manual and defuse it should be:

c4 e2 ed 29 15 00 00 00 00

So that one seems right now?

vaddpd ymm10,ymm10,ymm11

gives us:

000000013f2e1009 C4 41 AD 58 D3 vaddpd ymm10, ymm10, ymm11

And should be:

c4 41 2d 58 d3

AD vs 2D = means the (VEX byte 3, W bit is set). Opcode specific
extension or used like rex.w, or ignored, depending on the opcode byte.

Specifically for vaddpd we have:

VEX.NDS.256.66.0F.WIG 58 /r

WIG: can use C5H form (if not requiring VEX.mmmmm) or VEX.W value is
ignored in the C4H form of VEX

prefix.

— If WIG is present, the instruction may be encoded using either the
two-byte form or the three-byte form of

VEX. When encoding the instruction using the three-byte form of VEX, the
value of VEX.W is ignored.

So that shouldn’t be a problem and should be safe there.

It would be worth testing the instruction specifically on it’s own on an
AMD chip just to make sure they don’t have a different take on the W bit,
but I wouldn’t imagine so.

From: gwoltman [mailto:[email protected]]
Sent: 13 November 2016 04:43 PM
To: Terraspace/HJWasm [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Another example.

Objconv:

vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE] ; 6ED1 _ C4 E2 Ed: 29. 15,
00000000(rel)

gdb:

0: c4 e2 ed 29 (bad)
4: 15 00 00 00 00 adc eax,0x0


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <
https://github.com/Terraspace/HJWasm/issues/38#issuecomment-260197046> ,
or mute the thread <https://github.com/notificati
ons/unsubscribe-auth/AQGQVIOqapvhxp7EhEhwzRHlIyZHlUSZks5q9z4JgaJpZM4Kwu7q>
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#38 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AQH-YBhNiGeM9Ct2BbNWP5P9NSrSunRmks5q919vgaJpZM4Kwu7q
.

from uasm.

gwoltman avatar gwoltman commented on June 15, 2024

The addpd problem is fixed. The vpcmpeqq is not.

I think this stretch of code:

vpxor   ymm2, ymm2, ymm2
vpcmpeqq ymm3, ymm3, ymm2
vpand   ymm9, ymm9, YMMWORD PTR YMM_28TH_BIT        ;; Test for positive dword values in QF1 (test 28th bit)
vpcmpeqq ymm9, ymm9, ymm2
vpmovmskb rdx, ymm3
and edx, 0FFFFFFFFh                 ;; See if INVFAC values changed
jnz short invfac_adjust             ;; Jump INVFACs need adjustment
vpmovmskb rdx, ymm9

comes out as this by objdump:

3cd5:       c5 ed ef d2             vpxor  %ymm2,%ymm2,%ymm2
3cd9:       c4 e2 e5 29             (bad)
3cdd:       da c5                   fcmovb %st(5),%st
3cdf:       35 db 0d 00 00          xor    $0xddb,%eax
3ce4:       00 00                   add    %al,(%rax)
3ce6:       c4 62 b5 29             (bad)
3cea:       ca c4 e1                lret   $0xe1c4
3ced:       7d d7                   jge    3cc6 <factor64_tf+0x35a3>
3cef:       d3 83 e2 ff 75 0a       roll   %cl,0xa75ffe2(%rbx)
3cf5:       c4 c1 7d d7 d1          vpmovmskb %ymm9,%edx

and:

vpsrlq  ymm9, ymm9, 30                  ;; Q1 = top bits of quotient

comes out as:

3c05:       c4 c1 b5 73             (bad)
3c09:       d1 1e                   rcrl   (%rsi)

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

If I disassemble that sequence (use Intel SDE) I get the following (which seems correct):

70: vpxor ymm2, ymm2, ymm2

000000013fa31000 C5 ED EF D2 vpxor ymm2, ymm2, ymm2

71:       vpcmpeqq ymm3, ymm3, ymm2

000000013fa31004 C4 E2 E5 29 DA vpcmpeqq ymm3, ymm3, ymm2

72:       vpand   ymm9, ymm9, YMMWORD PTR mop        ;; Test for positive dword values in QF1 (test 28th bit)

000000013fa31009 C5 35 DB 0D EF 2F 00 00 vpand ymm9, ymm9, ymmword ptr [rip+0x2fef]

73:       vpcmpeqq ymm9, ymm9, ymm2

000000013fa31011 C4 62 B5 29 CA vpcmpeqq ymm9, ymm9, ymm2

74:       vpmovmskb rdx, ymm3

000000013fa31016 C4 E1 7D D7 D3 vpmovmskb edx, ymm3

75:       and edx, 0FFFFFFFFh                 ;; See if INVFAC values changed

000000013fa3101b 83 E2 FF and edx, 0xffffffff

76:       jnz short invfac_adjust             ;; Jump INVFACs need adjustment

000000013fa3101e 75 05 jnz 0x13fa31025

77:       vpmovmskb rdx, ymm9

000000013fa31020 C4 C1 7D D7 D1 vpmovmskb edx, ymm9

78: invfac_adjust:

79:       vpsrlq  ymm9, ymm9, 30

000000013fa31025 C4 C1 B5 73 D1 1E vpsrlq ymm9, ymm9, 0x1e

Visual Studio 2015 also agrees with:

--- vcall2.asm -----------------------------------------------------------------

start:

000000013F0B1000 C5 ED EF D2 vpxor ymm2,ymm2,ymm2

000000013F0B1004 C4 E2 E5 29 DA vpcmpeqq ymm3,ymm3,ymm2

000000013F0B1009 C5 35 DB 0D EF 2F 00 00 vpand ymm9,ymm9,ymmword ptr [mop (013F0B4000h)]

000000013F0B1011 C4 62 B5 29 CA vpcmpeqq ymm9,ymm9,ymm2

000000013F0B1016 C4 E1 7D D7 D3 vpmovmskb edx,ymm3

000000013F0B101B 83 E2 FF and edx,0FFFFFFFFh

000000013F0B101E 75 05 jne invfac_adjust (013F0B1025h)

000000013F0B1020 C4 C1 7D D7 D1 vpmovmskb edx,ymm9

invfac_adjust:

OBJConv too (apart from the prefix bit which shouldn’t matter):

; Disassembly of file: vcall2.obj

; Thu Nov 17 08:53:53 2016

; Mode: 64 bits

; Syntax: MASM/ML64

; Instruction set: AVX-512, x64

option dotname

public start

public invfac_adjust

public mop

public YMM_FMA_ONE

_text SEGMENT PARA 'CODE' ; section number 1

start PROC

$$$00001 LABEL NEAR

    vpxor   ymm2, ymm2, ymm2                        ; 0000 _ C5 ED: EF. D2

; Note: Prefix bit or byte has no meaning in this context

    vpcmpeqq ymm3, ymm3, ymm2                       ; 0004 _ C4 E2 E5: 29. DA

    vpand   ymm9, ymm9, ymmword ptr [mop]           ; 0009 _ C5 35: DB. 0D, 00000000(rel)

; Note: Prefix bit or byte has no meaning in this context

    vpcmpeqq ymm9, ymm9, ymm2                       ; 0011 _ C4 62 B5: 29. CA

    vpmovmskb rdx, ymm3                             ; 0016 _ C4 E1 7D: D7. D3

    and     edx, 0FFFFFFFFH                         ; 001B _ 83. E2, FF

    jnz     invfac_adjust                           ; 001E _ 75, 05

    vpmovmskb rdx, ymm9                             ; 0020 _ C4 C1 7D: D7. D1

The only one that has an issue is Defuse.. so I think in this case, they’re wrong .. probably due to them not being able to recognise the instruction with the prefix bit :)

We’ll fix that anyway, but it should all be working non-the-less.

From: gwoltman [mailto:[email protected]]
Sent: 17 November 2016 04:11 AM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

The addpd problem is fixed. The vpcmpeqq is not.

I think this stretch of code:

vpxor ymm2, ymm2, ymm2
vpcmpeqq ymm3, ymm3, ymm2
vpand ymm9, ymm9, YMMWORD PTR YMM_28TH_BIT ;; Test for positive dword values in QF1 (test 28th bit)
vpcmpeqq ymm9, ymm9, ymm2
vpmovmskb rdx, ymm3
and edx, 0FFFFFFFFh ;; See if INVFAC values changed
jnz short invfac_adjust ;; Jump INVFACs need adjustment
vpmovmskb rdx, ymm9

comes out as this by objdump:

3cd5: c5 ed ef d2 vpxor %ymm2,%ymm2,%ymm2
3cd9: c4 e2 e5 29 (bad)
3cdd: da c5 fcmovb %st(5),%st
3cdf: 35 db 0d 00 00 xor $0xddb,%eax
3ce4: 00 00 add %al,(%rax)
3ce6: c4 62 b5 29 (bad)
3cea: ca c4 e1 lret $0xe1c4
3ced: 7d d7 jge 3cc6 <factor64_tf+0x35a3>
3cef: d3 83 e2 ff 75 0a roll %cl,0xa75ffe2(%rbx)
3cf5: c4 c1 7d d7 d1 vpmovmskb %ymm9,%edx

and:

vpsrlq ymm9, ymm9, 30 ;; Q1 = top bits of quotient

comes out as:

3c05: c4 c1 b5 73 (bad)
3c09: d1 1e rcrl (%rsi)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVEcpDXkfAk92940fYV2fszKFlxp-ks5q-9PEgaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

Fixed a bunch of these encoding bits now:

The following test code:

    vpcmpeqq ymm3, ymm3, ymm2



    vpsrlq  ymm9, ymm9, 30                  



    vpxor   ymm2, ymm2, ymm2

    vpcmpeqq ymm3, ymm3, ymm2

    vpand   ymm9, ymm9, YMMWORD PTR mop        ;; Test for positive dword values in QF1 (test 28th bit)

    vpcmpeqq ymm9, ymm9, ymm2

    vpmovmskb rdx, ymm3

    and edx, 0FFFFFFFFh                 ;; See if INVFAC values changed

    jnz short invfac_adjust             ;; Jump INVFACs need adjustment

    vpmovmskb rdx, ymm9

invfac_adjust:

    vpsrlq  ymm9, ymm9, 30



    vpslldq ymm1,ymm20, 30



    VPBROADCASTD zmm10, xmm3

    VPBROADCASTD zmm10, xmm14



    VPBROADCASTD ymm20, dword ptr mop

    vbroadcastsd zmm16{k1}, QWORD PTR mop

    vbroadcastsd zmm16, REAL8 PTR mop





    VPBROADCASTD ymm1, xmm2

    VPBROADCASTD ymm1, dword ptr mop

    VPBROADCASTD ymm9, xmm2

    VPBROADCASTD zmm20, dword ptr mop

    VPBROADCASTD zmm10{k2}, xmm3



    vbroadcastsd zmm16{k1}, QWORD PTR mop

    vbroadcastsd zmm16, REAL8 PTR mop



    vpsrlq ymm9, ymm7, 3

    vpsllq ymm9, ymm7, 2



    vpsrlq ymm1, ymm20, 30                   

    vpsllq ymm1, ymm20, 30                   



    vpsrlq ymm9, ymm9, 30

    vpsllq ymm9, ymm9, 30

    vpslldq ymm9,ymm9, 30

    vpsllw ymm8,ymm9,30

    vpslld ymm9, ymm9, 30

    vpsraw ymm9, ymm9, 30

    vpsrad ymm9,ymm9, 30

    vpsrldq ymm9, ymm9, 30





    vpsrlq ymm9, ymm7, [r11]

    vfnmadd231pd zmm27, zmm24, zmm23



    ;c4 e2 6d 29 14 25 00       ;us

    ;c4 e2 ed 29 15 00 00 00 00 ;should be 

    vpcmpeqq ymm2, ymm2, ymmword ptr [YMM_FMA_ONE]

    vpcmpeqq ymm9, ymm9, ymmword ptr [YMM_FMA_ONE]



    vmovapd ymm1, ymmword ptr [YMM_FMA_ONE]



    vmovapd ymm1, ymm9



    ;c4 41 ad 58 d3 ;us

    ;c4 41 2d 58 d3 ;should be

    vaddpd ymm10,ymm10,ymm11



    vaddpd zmm17, zmm17, zmm20

    vmovapd zmm16,zmm18

    vmovapd zmm18,zmm16

    vmulpd zmm17, zmm17, zmm20

    vdivpd zmm17, zmm17, zmm20



    vaddpd zmm17, zmm17, zmm10

    vmovapd zmm16,zmm8

    vmovapd zmm18,zmm6

    vmulpd zmm17, zmm17, zmm12

    vdivpd zmm17, zmm17, zmm13



    vaddpd zmm3, zmm17, zmm6

    vmovapd zmm16,zmm2

    vmovapd zmm18,zmm6

    vmulpd zmm1, zmm17, zmm4

    vdivpd zmm17, zmm3, zmm13



    korw k5, k1, k2



    bob exec



    vmovapd ymm2,YMMWORD PTR [ebx+48*SZPTR+576+16*SZPTR+24*8]



    noexec vmovapd  ymm2, YMM_BIGVAL

    vxorpd  ymm2, ymm2, ymm2



    vmovapd ymm1, YMM_BIGVAL                ;; Load comparison valueno base2 

    vxorpd  ymm1, ymm1, ymm1                ;; Create comparison value      

    vcmppd  ymm0, ymm2, ymm1, 0Ch           ;; Are any carries non-zero     

    vmovmskpd eax, ymm0                     ;; Extract 4 comparison bitsbase2       

    vxorpd  ymm1, ymm1, ymm1                ;; High carry words are always compared to zero 

    vcmppd  ymm0, ymm3, ymm1, 0Ch           ;; Are any carries non-zero     

    vmovmskpd ecx, ymm0                     ;; Extract 4 comparison bits    

    or      eax, ecx



    movzx   eax, BYTE PTR [edi]             ;; Load big vs. little flags    

    vmovapd ymm0, [esi]                     ;; Load values1ttp      

    vmulpd  ymm0, ymm0, [ebp]               ;; Mul values1 by two-to-minus-phittp   

    vmulpd  ymm0, ymm0, YMM_BIGVAL;; Mul by FFTLEN/2        

    vaddpd  ymm0, ymm0, ymm2                ;; x1 = values1 + carry split_upper_carry_zpad_word ttp, base2, ymm3, ymm1, ymm2, rax*2no const vmulpd        ymm2, ymm3, YMM_K_LO            ;; low bits of high FFT carry * k_loconst        

    vmulpd  ymm2, ymm3, YMM_BIGVAL ;; low bits of high_FFT_carry * k_lo     

    vaddpd  ymm0, ymm0, ymm2                ;; x1 = x1 + low bits of high_FFT_carry * k_lono const 

    vmulpd ymm3, ymm3, YMM_BIGVAL           ;; low bits of high FFT carry * k_hiconst       vmulpd  ymm3, ymm3, YMM_K_TIMES_MULCONST_HI ;; low bits of high FFT carry * k_hittp 

    vmulpd  ymm3, ymm3, YMM_BIGVAL[rax*2] ;; shift low bits of high FFT carry * k_hino ttp  vmulpd  ymm3, ymm3, YMM_LIMIT_INVERSE[0] ;; shift low bits of high FFT carry * k_hi 

    vroundpd ymm3, ymm3, 0                  ;; WASTEFUL.  Round (k_hi * limit_inverse) should be precomputed        rounding ttp, base2, noexec, ymm0, ymm2, ymm4, rax*2    

    vaddpd  ymm2, ymm2, ymm3                ;; Carry += shifted low bits of high_FFT_carry * k_hittp        

    vmulpd  ymm0, ymm0, [ebp+32]            ;; new value1 = val * two-to-phi        ystore  [rsi], ymm0                     ;; Save new value1      

    vmovapd ymm3, ymm1                      ;; Next high FFT carry = high bits of current high FFT carryttp bump        rdi, 1                          ;; Advance pointers     bump    rsi, 64ttp      bump    rbp, 64 sub     rdx, 1                                ;; Test counter jnz     section_loop                    ;; More cache lines in section, add carry in   ;; Section ended.  Rotate carries again and add the new next section carry values       ;; into the previously calculated next section carry values        rotate_carries base2, ymm2, ymm4, ymm0, ymm1        rotate_carries noexec, ymm3, ymm5, ymm0, ymm1base2      vsubpd  ymm4, ymm4, YMM_BIGVAL  vaddpd  ymm4, ymm4, YMM_TMP1        vaddpd  ymm5, ymm5, YMM_TMP2    jmp     section_start

    ret

Produces (and is gcc compliant):

Disassembly:

0: c4 e2 65 29 da vpcmpeqq ymm3,ymm3,ymm2
5: c4 c1 35 73 d1 1e vpsrlq ymm9,ymm9,0x1e
b: c5 ed ef d2 vpxor ymm2,ymm2,ymm2
f: c4 e2 65 29 da vpcmpeqq ymm3,ymm3,ymm2
14: c5 35 db 0d e4 2f 00 vpand ymm9,ymm9,YMMWORD PTR [rip+0x2fe4] # 0x3000
1b: 00
1c: c4 62 35 29 ca vpcmpeqq ymm9,ymm9,ymm2
21: c4 e1 7d d7 d3 vpmovmskb edx,ymm3
26: 83 e2 ff and edx,0xffffffff
29: 75 05 jne 0x30
2b: c4 c1 7d d7 d1 vpmovmskb edx,ymm9
30: c4 c1 35 73 d1 1e vpsrlq ymm9,ymm9,0x1e
36: 62 b1 f5 28 73 fc 1e vpslldq ymm1,ymm20,0x1e
3d: 62 72 7d 48 58 d3 vpbroadcastd zmm10,xmm3
43: 62 52 7d 48 58 d6 vpbroadcastd zmm10,xmm14
49: 62 e2 7d 28 58 25 ad vpbroadcastd ymm20,DWORD PTR [rip+0x2fad] # 0x3000
50: 2f 00 00
53: 62 e2 fd 49 19 05 a3 vbroadcastsd zmm16{k1},QWORD PTR [rip+0x2fa3] # 0x3000
5a: 2f 00 00
5d: 62 e2 fd 48 19 05 99 vbroadcastsd zmm16,QWORD PTR [rip+0x2f99] # 0x3000
64: 2f 00 00
67: c4 e2 7d 58 ca vpbroadcastd ymm1,xmm2
6c: c4 e2 7d 58 0d 8b 2f vpbroadcastd ymm1,DWORD PTR [rip+0x2f8b] # 0x3000
73: 00 00
75: c4 62 7d 58 ca vpbroadcastd ymm9,xmm2
7a: 62 e2 7d 48 58 25 7c vpbroadcastd zmm20,DWORD PTR [rip+0x2f7c] # 0x3000
81: 2f 00 00
84: 62 72 7d 4a 58 d3 vpbroadcastd zmm10{k2},xmm3
8a: 62 e2 fd 49 19 05 6c vbroadcastsd zmm16{k1},QWORD PTR [rip+0x2f6c] # 0x3000
91: 2f 00 00
94: 62 e2 fd 48 19 05 62 vbroadcastsd zmm16,QWORD PTR [rip+0x2f62] # 0x3000
9b: 2f 00 00
9e: c5 35 73 d7 03 vpsrlq ymm9,ymm7,0x3
a3: c5 35 73 f7 02 vpsllq ymm9,ymm7,0x2
a8: 62 b1 f5 28 73 d4 1e vpsrlq ymm1,ymm20,0x1e
af: 62 b1 f5 28 73 f4 1e vpsllq ymm1,ymm20,0x1e
b6: c4 c1 35 73 d1 1e vpsrlq ymm9,ymm9,0x1e
bc: c4 c1 35 73 f1 1e vpsllq ymm9,ymm9,0x1e
c2: c4 c1 35 73 f9 1e vpslldq ymm9,ymm9,0x1e
c8: c4 c1 3d 71 f1 1e vpsllw ymm8,ymm9,0x1e
ce: c4 c1 35 72 f1 1e vpslld ymm9,ymm9,0x1e
d4: c4 c1 35 71 e1 1e vpsraw ymm9,ymm9,0x1e
da: c4 c1 35 72 e1 1e vpsrad ymm9,ymm9,0x1e
e0: c4 c1 35 73 d9 1e vpsrldq ymm9,ymm9,0x1e
e6: c4 41 45 d3 0b vpsrlq ymm9,ymm7,XMMWORD PTR [r11]
eb: 62 22 bd 40 bc df vfnmadd231pd zmm27,zmm24,zmm23
f1: c4 e2 6d 29 15 26 2f vpcmpeqq ymm2,ymm2,YMMWORD PTR [rip+0x2f26] # 0x3020
f8: 00 00
fa: c4 62 35 29 0d 1d 2f vpcmpeqq ymm9,ymm9,YMMWORD PTR [rip+0x2f1d] # 0x3020
101: 00 00
103: c5 fd 28 0d 15 2f 00 vmovapd ymm1,YMMWORD PTR [rip+0x2f15] # 0x3020
10a: 00
10b: c4 c1 7d 28 c9 vmovapd ymm1,ymm9
110: c4 41 2d 58 d3 vaddpd ymm10,ymm10,ymm11
115: 62 a1 f5 40 58 cc vaddpd zmm17,zmm17,zmm20
11b: 62 a1 fd 48 28 c2 vmovapd zmm16,zmm18
121: 62 a1 fd 48 28 d0 vmovapd zmm18,zmm16
127: 62 a1 f5 40 59 cc vmulpd zmm17,zmm17,zmm20
12d: 62 a1 f5 40 5e cc vdivpd zmm17,zmm17,zmm20
133: 62 c1 f5 40 58 ca vaddpd zmm17,zmm17,zmm10
139: 62 c1 fd 48 28 c0 vmovapd zmm16,zmm8
13f: 62 e1 fd 48 28 d6 vmovapd zmm18,zmm6
145: 62 c1 f5 40 59 cc vmulpd zmm17,zmm17,zmm12
14b: 62 c1 f5 40 5e cd vdivpd zmm17,zmm17,zmm13
151: 62 f1 f5 40 58 de vaddpd zmm3,zmm17,zmm6
157: 62 e1 fd 48 28 c2 vmovapd zmm16,zmm2
15d: 62 e1 fd 48 28 d6 vmovapd zmm18,zmm6
163: 62 f1 f5 40 59 cc vmulpd zmm1,zmm17,zmm4
169: 62 c1 e5 48 5e cd vdivpd zmm17,zmm3,zmm13
16f: c5 f4 45 ea korw k5,k1,k2
173: c4 c1 7d 28 8b 00 05 vmovapd ymm1,YMMWORD PTR [r11+0x500]
17a: 00 00
17c: c4 c1 7d 28 a3 00 05 vmovapd ymm4,YMMWORD PTR [r11+0x500]
183: 00 00
185: 67 c5 fd 28 93 00 05 vmovapd ymm2,YMMWORD PTR [ebx+0x500]
18c: 00 00
18e: c5 ed 57 d2 vxorpd ymm2,ymm2,ymm2
192: c4 c1 7d 28 8b 00 05 vmovapd ymm1,YMMWORD PTR [r11+0x500]
199: 00 00
19b: c5 f5 57 c9 vxorpd ymm1,ymm1,ymm1
19f: c5 ed c2 c1 0c vcmpneq_oqpd ymm0,ymm2,ymm1
1a4: c5 fd 50 c0 vmovmskpd eax,ymm0
1a8: c5 f5 57 c9 vxorpd ymm1,ymm1,ymm1
1ac: c5 e5 c2 c1 0c vcmpneq_oqpd ymm0,ymm3,ymm1
1b1: c5 fd 50 c8 vmovmskpd ecx,ymm0
1b5: 0b c1 or eax,ecx
1b7: 67 0f b6 07 movzx eax,BYTE PTR [edi]
1bb: 67 c5 fd 28 06 vmovapd ymm0,YMMWORD PTR [esi]
1c0: 67 c5 fd 59 45 00 vmulpd ymm0,ymm0,YMMWORD PTR [ebp+0x0]
1c6: c4 c1 7d 59 83 00 05 vmulpd ymm0,ymm0,YMMWORD PTR [r11+0x500]
1cd: 00 00
1cf: c5 fd 58 c2 vaddpd ymm0,ymm0,ymm2
1d3: c4 c1 65 59 93 00 05 vmulpd ymm2,ymm3,YMMWORD PTR [r11+0x500]
1da: 00 00
1dc: c5 fd 58 c2 vaddpd ymm0,ymm0,ymm2
1e0: c4 c1 65 59 9b 00 05 vmulpd ymm3,ymm3,YMMWORD PTR [r11+0x500]
1e7: 00 00
1e9: c4 c1 65 59 9c 43 00 vmulpd ymm3,ymm3,YMMWORD PTR [r11+rax*2+0x500]
1f0: 05 00 00
1f3: c4 e3 7d 09 db 00 vroundpd ymm3,ymm3,0x0
1f9: c5 ed 58 d3 vaddpd ymm2,ymm2,ymm3
1fd: 67 c5 fd 59 45 20 vmulpd ymm0,ymm0,YMMWORD PTR [ebp+0x20]
203: c5 fd 28 d9 vmovapd ymm3,ymm1
207: c3 ret

From: gwoltman [mailto:[email protected]]
Sent: 17 November 2016 04:11 AM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

The addpd problem is fixed. The vpcmpeqq is not.

I think this stretch of code:

vpxor ymm2, ymm2, ymm2
vpcmpeqq ymm3, ymm3, ymm2
vpand ymm9, ymm9, YMMWORD PTR YMM_28TH_BIT ;; Test for positive dword values in QF1 (test 28th bit)
vpcmpeqq ymm9, ymm9, ymm2
vpmovmskb rdx, ymm3
and edx, 0FFFFFFFFh ;; See if INVFAC values changed
jnz short invfac_adjust ;; Jump INVFACs need adjustment
vpmovmskb rdx, ymm9

comes out as this by objdump:

3cd5: c5 ed ef d2 vpxor %ymm2,%ymm2,%ymm2
3cd9: c4 e2 e5 29 (bad)
3cdd: da c5 fcmovb %st(5),%st
3cdf: 35 db 0d 00 00 xor $0xddb,%eax
3ce4: 00 00 add %al,(%rax)
3ce6: c4 62 b5 29 (bad)
3cea: ca c4 e1 lret $0xe1c4
3ced: 7d d7 jge 3cc6 <factor64_tf+0x35a3>
3cef: d3 83 e2 ff 75 0a roll %cl,0xa75ffe2(%rbx)
3cf5: c4 c1 7d d7 d1 vpmovmskb %ymm9,%edx

and:

vpsrlq ymm9, ymm9, 30 ;; Q1 = top bits of quotient

comes out as:

3c05: c4 c1 b5 73 (bad)
3c09: d1 1e rcrl (%rsi)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVEcpDXkfAk92940fYV2fszKFlxp-ks5q-9PEgaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

New packages on the site dated 18th Nov with all the gcc compliant fixes included.

From: gwoltman [mailto:[email protected]]
Sent: 17 November 2016 04:11 AM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

The addpd problem is fixed. The vpcmpeqq is not.

I think this stretch of code:

vpxor ymm2, ymm2, ymm2
vpcmpeqq ymm3, ymm3, ymm2
vpand ymm9, ymm9, YMMWORD PTR YMM_28TH_BIT ;; Test for positive dword values in QF1 (test 28th bit)
vpcmpeqq ymm9, ymm9, ymm2
vpmovmskb rdx, ymm3
and edx, 0FFFFFFFFh ;; See if INVFAC values changed
jnz short invfac_adjust ;; Jump INVFACs need adjustment
vpmovmskb rdx, ymm9

comes out as this by objdump:

3cd5: c5 ed ef d2 vpxor %ymm2,%ymm2,%ymm2
3cd9: c4 e2 e5 29 (bad)
3cdd: da c5 fcmovb %st(5),%st
3cdf: 35 db 0d 00 00 xor $0xddb,%eax
3ce4: 00 00 add %al,(%rax)
3ce6: c4 62 b5 29 (bad)
3cea: ca c4 e1 lret $0xe1c4
3ced: 7d d7 jge 3cc6 <factor64_tf+0x35a3>
3cef: d3 83 e2 ff 75 0a roll %cl,0xa75ffe2(%rbx)
3cf5: c4 c1 7d d7 d1 vpmovmskb %ymm9,%edx

and:

vpsrlq ymm9, ymm9, 30 ;; Q1 = top bits of quotient

comes out as:

3c05: c4 c1 b5 73 (bad)
3c09: d1 1e rcrl (%rsi)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVEcpDXkfAk92940fYV2fszKFlxp-ks5q-9PEgaJpZM4Kwu7q .

from uasm.

gwoltman avatar gwoltman commented on June 15, 2024

Better. vpmuludq exhibits the same symptoms. Two examples:

vpmuludq ymm9, ymm9, ymm8               
vpmuludq ymm4, ymm4, YMMWORD PTR YMM_TWO_120_MODF3  

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

This one is fixed now.

We’ve found a few others which we’re doing now as well. Will upload an updated package as soon as they’re all done.

From: gwoltman [mailto:[email protected]]
Sent: 18 November 2016 06:36 PM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Better. vpmuludq exhibits the same symptoms. Two examples:

vpmuludq ymm9, ymm9, ymm8
vpmuludq ymm4, ymm4, YMMWORD PTR YMM_TWO_120_MODF3


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVABzuTzF5_BU4tdbdFxBWD-CBSAzks5q_fAGgaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

Packages are updated on the site, should fix the below issue including some with vpermilpd, vblend and vcvtpd2ps.

From: gwoltman [mailto:[email protected]]
Sent: 18 November 2016 06:36 PM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Better. vpmuludq exhibits the same symptoms. Two examples:

vpmuludq ymm9, ymm9, ymm8
vpmuludq ymm4, ymm4, YMMWORD PTR YMM_TWO_120_MODF3


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVABzuTzF5_BU4tdbdFxBWD-CBSAzks5q_fAGgaJpZM4Kwu7q .

from uasm.

gwoltman avatar gwoltman commented on June 15, 2024

Sorry to report these in dribs and drabs:

vpaddq  ymm10, ymm10, ymm8          

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

That’s ok… sorry to fix them in dribs and drabs :)

From: gwoltman [mailto:[email protected]]
Sent: 18 November 2016 10:56 PM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Sorry to report these in dribs and drabs:

vpaddq ymm10, ymm10, ymm8


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVGqyptg04xYI7JTbF97f0Gq8X3ibks5q_i0IgaJpZM4Kwu7q .

from uasm.

john-terraspace avatar john-terraspace commented on June 15, 2024

Fixed, packages updated on the site.

From: gwoltman [mailto:[email protected]]
Sent: 18 November 2016 10:56 PM
To: Terraspace/HJWasm [email protected]
Cc: John Hankinson [email protected]; Comment [email protected]
Subject: Re: [Terraspace/HJWasm] Request for gcc compatible output (#38)

Sorry to report these in dribs and drabs:

vpaddq ymm10, ymm10, ymm8


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #38 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/AQGQVGqyptg04xYI7JTbF97f0Gq8X3ibks5q_i0IgaJpZM4Kwu7q .

from uasm.

gwoltman avatar gwoltman commented on June 15, 2024

I cannot find any more gcc compatibilities in all my assembled source files!

Well done!

from uasm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.