jserv / amacc Goto Github PK
View Code? Open in Web Editor NEWSmall C Compiler generating ELF executable Arm architecture, supporting JIT execution
License: Other
Small C Compiler generating ELF executable Arm architecture, supporting JIT execution
License: Other
Creating an issue for this so it can be tracked.
Using an array operator '[ ]' is currently supported for use with (certain) pointer variables in array expressions, but not always. The goal of this ticket is to be able to use the array operator '[ ]' in expressions, working as expected under every possible circumstance. Note that having '[ ]' work for declarations also needs to be improved, and is submitted as a separate issue.
I noticed in a larger program that I compiled JIT with amacc, which relies on modulo for prng, that the results were wrong. So I made a small test program, and tried a freshly cloned amacc on it - as you can see below the results are completely wrong.
(Here is a hacky fix for now, which requires the use of mod(a, b)
instead of a % b
- it gives the correct results for the shown small test program)
$ git clone [email protected]:jserv/amacc.git
Cloning into 'amacc'...
remote: Enumerating objects: 71, done.
remote: Counting objects: 100% (71/71), done.
remote: Compressing objects: 100% (51/51), done.
remote: Total 734 (delta 33), reused 52 (delta 19), pack-reused 663
Receiving objects: 100% (734/734), 297.06 KiB | 1.18 MiB/s, done.
Resolving deltas: 100% (411/411), done.
$ cd amacc/
$ make
CC+LD amacc
CC+LD amacc-native
amacc.c:2107:5: warning: return type of ‘main’ is not ‘int’ [-Wmain]
int main(int argc, char **argv)
^~~~
amacc.c:2107:5: warning: first argument of ‘main’ should be ‘int’ [-Wmain]
$ ./amacc ~/tmp/modtest.c
39628
1015073
3
2
$ more ~/tmp/modtest.c
int main() {
int a = 2131119850;
int b = 53777;
printf("%d\n", (a % b));
int c = 841495917;
int d = 829;
printf("%d\n", (c % d));
int e = 100;
int f = 26;
printf("%d\n", e % f);
int g = 2;
int h = 1;
printf("%d\n", g % h);
return 0;
}
int main()
{
int *m;
*m = *(m-2); // bad dereference compiler error, RHS
return 0;
}
Since AMaCC was influenced by c4 design, it assumed 32-bit target. However, it is confusing while 64-bit targets are considered.
There were preliminary patches for proposed portability changes:
Expected output:
As AMaCC contains substantial rswier/c4 code, please change your license to GPLv2.
A recent problem has appeared:
$ gcc -o amacc-gcc amacc.c -ldl
$ ./amacc-gcc -o fib tests/fib.c
$ ./fib 20
10946
$ ./amacc-gcc -o amacc amacc.c # amacc bootstrap compile
$ ./amacc tests/fib.c 20
10946
$ ./amacc -o fib-amacc tests/fib.c
Segmentation fault
$
At first, I thought this was because of changes I made to the compiler, but I rolled back to Dec 16, 2020 before I had made any changes to the compiler, and tried again, with the same result:
$ cd /tmp
$ git clone https://github.com/jserv/amacc.git
$ cd amacc
$ git checkout -b old_state 684c54b # Dec 16 2020
$ gcc -o amacc-gcc amacc.c -ldl
$ ./amacc-gcc -o fib tests/fib.c
$ ./fib 20
10946
$ ./amacc-gcc -o amacc amacc.c
$ ./amacc tests/fib.c 20
10946
$ ./amacc -o fib-amacc tests/fib.c
Segmentation fault
$
I am thinking that this may be due to an ABI change in the linux kernel that I pulled with a recent 'sudo apt update ; sudo apt full-upgrade' .
I am not an ELF expert, so I'm not sure I can fix this, especially since I don't have a debugger that will work on the bootstrapped amacc executable produced by the compile commands above. If anyone can point me to a debugger that will work on amacc ELF files, I can try to make some progress
PS Just getting back to amacc work after a forced hiatus.
On my platform:
ubuntu 18.04
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Left shift operator inside gcc with negative argument will become right shift,but AMaCC will set the result to 0.
Do we need to change AMaCC to keep the behavior consistent with gcc?
This is the failure message after check test:
FAIL: test_shift (__main__.TestCC_SC)
----------------------------------------------------------------------
Traceback (most recent call last):
File "runtest.py", line 55, in test
self.assertEqual(amacc_out.decode('utf-8'), gcc_out.decode('utf-8'))
AssertionError: '1 <<[70 chars]-1 = 0\n-1 << 1 = fffffffe\n-1 << 0 = ffffffff[197 chars]68\n' != '1 <<[70 chars]-1 = 2\n-1 << 1 = fffffffe\n-1 << 0 = ffffffff[197 chars]68\n'
1 << 0 = 1
1 << 2 = 4
0 << 4 = 0
1 << 31 = 80000000
1 << 32 = 0
- 4 << -1 = 0
? ^
+ 4 << -1 = 2
? ^
-1 << 1 = fffffffe
-1 << 0 = ffffffff
4 >> 1 = 2
4 >> 5 = 0
0x80000000 >> 31 = ffffffff
-1 >> 2 = ffffffff
-1091119768 - -1091119766 = -2(fffffffe)
a = f6d5680c, b = bef6d568, c = fff6d568
a = 76d5680c, b = bef6d568, c = bef6d568
At present, AMaCC translates C source into internal bytecode form on which code generator depends for Arm machine code. The argument -s
is to generate bytecode from parsing tree. Here is the example:
$ ./amacc -s tests/hello.c
1: #include <stdio.h>
2:
3: int main()
4: {
5: printf("hello, world\n");
ENT 0
IMM -568971248
PSH
PRTF
ADJ 1
6: return 0;
IMM 0
LEV
7: }
Branch interpreter provides incomplete interpreter support, which can accepts bytecode. It would be great if we can take the standalone interpreter to validate AMaCC internally.
Reference: Write a C interpreter
A 'static' storage class could potentially be supported by internally creating a 'hidden' global variable having the name <funcname>_<varname>. This could help prevent the use of global variables, at least in the source code. I think all the machinery is already there to support this in about ten lines of source code changes.
Branch c5-AST introduced the following enhancements:
Proposed changes were initially described in c4-enhance.patch.
It seems left shift on negative numbers are not handled properly.
Make check has error message as below:
- 4 << -1 = 0
? ^
+ 4 << -1 = 2
? ^
I traced the source code and it seemed the negative numbers have not been taken into consideration in our semantics.
At line 618 of amacc.c
case Shl: next(); *++e = PSH; expr(Add); *++e = SHL; ty = INT; break;
So is it a bug or a feature of our compiler? Please do correct me if I missed out something.
The +=
operator in C is one of the language’s compound assignment operators. It is essentially a shorthand notation for increamenting the variable on the left by an arbitrary value on the right.
The following two lines of C code are identical, in terms of their effect on the variable z:
z = z + y; // increment z by y
z += y; // increment z by y
Both of these statements do the same thing. The current value of z and the current value of y are added together, and the result of the addition is placed into variable z, replacing the value that had been stored in z.
At least +=
, -=
, and *=
operators should be supported.
int main() {
int a;
a = 0;
if (1) {
int b;
printf("%d\n", a);
int c;
b = 1;
a += b;
printf("%d\n", a);
}
}
Results in
0
0
1
Clearly it should not print 3 numbers!
Creating an issue for this so it can be tracked.
Using an array operator '[ ]' is currently supported for use with (certain) pointer variables in array expressions, but never supported for array declarations. This ticket is to specifically address declarations using '[ ]' directly.
pi@anchor:~/Languages/amacc $ uname -a
Linux anchor 5.10.17-v7l+ #1403 SMP Mon Feb 22 11:33:35 GMT 2021 armv7l GNU/Linux
pi@anchor:~/Languages/amacc $ gcc amacc.c -ldl
pi@anchor:~/Languages/amacc $ cat count.c
int main()
{
int i;
for (i=0; i<10; ++i)
printf("%d\n", i) ;
return i;
}
pi@anchor:~/Languages/amacc $ ./a.out -o foo.o count.c
pi@anchor:~/Languages/amacc $ file foo.o
foo.o: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, stripped
pi@anchor:~/Languages/amacc $ objdump -d foo.o
objdump: foo.o: file format not recognized
??? --- Why doesn't this work?
pi@anchor:~/Languages/amacc $ ./a.out count.c
0
1
2
3
4
5
6
7
8
9
pi@anchor:~/Languages/amacc $
maze.c.gz is a simple maze generator written by Joe Wingbermuehle. The modified program can be parsed and executed by AMaCC. However, the program output is NOT correct. It was assumed to be kind of maze though. Also, the functionality to solve maze was incorrect.
Recent div/mod operator support breaks the classification of opcode enumeration:
/* system call shortcuts */
OPEN,READ,WRIT,CLOS,PRTF,MALC,FREE,MSET,MCMP,MCPY,MMAP,DSYM,BSCH,STRT,DLOP,DIV,MOD,EXIT,
Both DIV
and MOD
should be moved to somewhere nearby SUB
and MUL
.
I'm guessing 32bit floating point would take 150-400 lines of C source code to implement. For instance, now that dynamic library support has been expanded, AMaCC can call strtof() to parse float constants, saving many lines of compiler code. New IR symbols can be added for FP operations, probably around a dozen. It's my belief that type promotions and function call ABI would be the hardest part of the whole implementation. I think the AMaCC type system is already well positioned to handle float, although some comparison expressions in amacc.c may get longer if there are three base types to consider rather than two. The question for discussion is, just because it can be added, should it be added?
id->class is 0 for passed parameters vs. class 'Par' or 'Loc'. There may be other inconsistencies for other storage types.
CLCA is never used. I think it should be removed, since it is misleading.
Branch decl-refactor introduced excellent work, allowing now variables declared within functions anywhere, not just at the beginning, and enums can go into functions too, that means we can initialize local variables as int i = [expr]
.
Commits:
res = bsearch(&sym, sym, 2, 1, (void*) _start); // hack to jump into a function pointer
if (((void*) 0) != res)
return retval;
else {
printf("Error: can't find the function pointer\n");
exit(0);
}
If return value from _start is not equal to 0, res will be NULL
In some test: tests/ptr.c tests/shift.c tests/arginc.c ...
main() returns non-zero or have no return, and print the Error.
It's pretty weird to the error message "can't find the function pointer".
Can't find what pointer? At this state the program finishes executing jit-progrom, if pointer means _start, "can't find" is really meaningless.
Is it better message for "main function return for non-zero value: %d" or something else to let some test file (ex: tests/arginc.c only return argc + 2) make sense?
In test file tests/char.c
, the amacc execution result of following code:
p = malloc(128);
p[0] = -1;
v = p[0];
printf("%x %d %d %x\n", p[0], p[0], v, p[1]);
is different to gcc.
amacc generates ffffffff -1 -1 65
while gcc generates ff 255 255 65
Not sure is this an issue or not. It may related to that amacc handle char
as signed char
, not sure for now.
Struct/union assignments are not currently allowed, since they require moving mem blocks:
struct foo s1, s2;
s1 = s2; // not allowed
That said, the AST tracks all the type information needed to support this. Assignment/copy just needs to be converted to a SYSC to memcpy(). Also, for function return values and parameters, the C-standard ABI conventions need to be followed. I looked at the C89 standard and didn't see it spelled out. Some compilers used to convert struct parameters to pointers automatically if the struct size exceeded a certain limit, and I am not sure what the rules are now.
"s.memb += xxx ;" does not work. I believe the type handling state machine implemented in expr() can be simplified, but it could take as much as a full day of work to get it right.
Line 1746 in 1c320c6
On my Ubuntu 14.04, cloc
is not installed by default.
But while executing make check
, it use cloc
to count lines of code.
Should this be mentioned in README file?
Although it can compile, it can not do the self-hosting via ELF32 format.
For now, it is the error message
[gapry@E130 amacc]$ make check
[ JIT ]
hello, world
[ compiled ]
could not open(elf/hello)
Makefile:20: recipe for target 'check' failed
make: *** [check] Error 255
Is it possible to create a new branch to develop the ELF32 Loader ?
I can't load the PC value which is a memory location to a general purpose register r0.
It is my current implementation jit-arm.dasm
The line 16 is that instruction, ldr r0, [pc, #12]
All the instruction will translate into a unsigned integer array which contain all the machine code.
//|.actionlist actions
static const unsigned int actions[6] = {
0xe59f000c,
0xe5901000,
0xe2811009,
0xe5801000,
0xe1a0f00e,
0x00000000
};
It is as same as tests/jit.c
Hence, I don't know why it occurs the following error as it executes the instruction, ldr r0, [pc, #12]
lua ./dynasm.lua -o jit-arm.h jit-arm.dasc
arm-linux-gnueabihf-gcc -Wall -std=gnu99 -I. -o jit-arm -DJIT=\"jit-arm.h\" dynasm-driver.c
qemu-arm -L /usr/arm-linux-gnueabihf jit-arm
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
make: *** [run-jit-arm] Segmentation fault (core dumped)
Any suggestion ?
(BTW, if I change PC to SP, everything is ok)
In test file tests/ptr.c
, the execution result of following code is incorrect:
s = (int*)0xbebebeb0;
e = (int*)0xbebebeb4;
v = e - s;
if (v == 1)
printf("passed\n");
else
printf("failed, e - s = %x\n", v);
Which print failed, e - s = 0
To reproduce this, try to compile:
int main(int argc, char **argv)
{
int i, j;
for (i = 0; i < 5; i++)
for (j = 0; j < 5; j++)
printf("loop %d\n", i);
return 0;
}
and we got an infinite loop.
This is a bug of current implementation of for loop. As we know, the order of code chunks while parsing is not what we expect in assembly form, like
// We parse it as:
// Init -> Cond -> Bz -> After -> Body -> Jmp
// But we want it to be:
// Init -> Cond -> Bz -> Body -> After -> Jmp
To solve this, I simply swap After
and Body
chunk, and here comes the problem. All address used in Body
remain the same and causes unexpected behavior. Here are two ways in my mind:
Body
+
higher performance-
more complex works-
increase bunch of code (I think we should keep this project small and simple)-
lower performance+
keep everything as simple as now+
need only lines of code to achieveNot sure which approach would fit the spirit of this project more.
After the newer ast support. The *e can be change to local variable.
The following error occurred when scripts/runtest.py
was being run.
FAIL: test_shift (__main__.TestCC_SC)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scripts/runtest.py", line 54, in test
self.assertEqual(amacc_out.decode('utf-8'), gcc_out.decode('utf-8'))
AssertionError: '1 <<[70 chars]-1 = 2\n4 >> -1 = 8\n-1 << 1 = fffffffe\n-1 <<[210 chars]68\n' != '1 <<[70 chars]-1 = 0\n4 >> -1 = 0\n-1 << 1 = fffffffe\n-1 <<[210 chars]68\n'
1 << 0 = 1
1 << 2 = 4
0 << 4 = 0
1 << 31 = 80000000
1 << 32 = 0
- 4 << -1 = 2
? ^
+ 4 << -1 = 0
? ^
- 4 >> -1 = 8
? ^
+ 4 >> -1 = 0
? ^
-1 << 1 = fffffffe
-1 << 0 = ffffffff
4 >> 1 = 2
4 >> 5 = 0
0x80000000 >> 31 = ffffffff
-1 >> 2 = ffffffff
-1091119768 - -1091119766 = -2(fffffffe)
a = f6d5680c, b = bef6d568, c = fff6d568
a = 76d5680c, b = bef6d568, c = bef6d568
Should check verbose before execute this line? i am not sure?
https://github.com/jserv/amacc/blob/master/amacc.c#L145
Hi,
Can you add a file to the doc directory describing the calling convention for amacc? The comments attached to the LEA instruction are not enough.
I have written a standalone peephole optimizer that reduces the number of assembly language instructions by half at this point (mostly because it eliminates all the push/pop instructions) and layers other optimizations on top, but I could go much further if I could wrap my head around the calling convention. It looks like a function sometimes returns an r0 register. It also seems like it is safe to 'step on' most of the registers without saving/restoring at function entry/exit, but I just want to verify. I am flying totally blind.
Finally, if you have any insight into how the (non-standard) ELF format "works", any information at all there would be helpful. Specificlly, I am interested in relocating and (most importantly) aligning functions that call each other.
Just a few lines of information in a doc file would help since I currently have utter confusion, and have miraculously not broken anything with my optimizer yet, at least nothing I know of through my five tests (including maze.c).
PS Here's what the optimizer currently does. In a month, it might beat optimizing compilers. :)
int fact(int n)
{
int i ;
int retVal = 1 ;
for (i=2; i<=n; ++i)
retVal *= i ;
return retVal ;
}
after peephole:
50: e92d4800 push {fp, lr}
54: e28db000 add fp, sp, #0
58: e24dd008 sub sp, sp, #8
5c: e3a00001 mov r0, #1
60: e50b0008 str r0, [fp, #-8]
64: e3a00002 mov r0, #2
68: e50b0004 str r0, [fp, #-4]
6c: e1a00000 nop ; (mov r0, r0)
70: e1a00000 nop ; (mov r0, r0)
74: e1a00000 nop ; (mov r0, r0)
78: e51b8004 ldr r8, [fp, #-4]
7c: e59b0008 ldr r0, [fp, #8]
80: e1580000 cmp r8, r0
84: ca000013 bgt 0xb0
88: e51b7008 ldr r7, [fp, #-8]
8c: e51b0004 ldr r0, [fp, #-4]
90: e0000097 mul r0, r7, r0
94: e50b0008 str r0, [fp, #-8]
98: e51b7004 ldr r7, [fp, #-4]
9c: e3a00001 mov r0, #1
a0: e0800007 add r0, r0, r7
a4: e50b0004 str r0, [fp, #-4]
a8: eafffff2 b 0x78
ac: e1a00000 nop ; (mov r0, r0)
b0: e51b0008 ldr r0, [fp, #-8]
b4: e28bd000 add sp, fp, #0
b8: e8bd8800 pop {fp, pc}
from maze.c:
int maze_rand()
{
return ((maze_rand_v = maze_rand_v * 214013 + 2531011) >> 16) & 0x7fff;
}
after peephole:
50: e92d4800 push {fp, lr}
54: e28db000 add fp, sp, #0
58: e59f8064 ldr r8, [pc, #100] ; 0xc4
5c: e59f005c ldr r0, [pc, #92] ; 0xc0
60: e5907000 ldr r7, [r0]
64: e59f004c ldr r0, [pc, #76] ; 0xb8
68: e0070097 mul r7, r7, r0
6c: e59f0038 ldr r0, [pc, #56] ; 0xac
70: e0807007 add r7, r0, r7
74: e5887000 str r7, [r8]
78: e3a00010 mov r0, #16
7c: e1a08057 asr r8, r7, r0
80: e59f000c ldr r0, [pc, #12] ; 0x94
84: e0080000 and r0, r8, r0
88: e28bd000 add sp, fp, #0
8c: e8bd8800 pop {fp, pc}
May I separate the lexer and parser?
It will be more easy to generate ast tree if I can separate the lexer and parser.
In test file tests/shift.c
, the execution result of following code:
printf("4 << -1 = %x\n", 4 << -1);
is 4 << -1 = 0
, while gcc
output 4 << -1 = 2
It is possible that left shift of negative number is handled like unsigned number, thus wipe out all bits.
Dmytro Sirenko wrote an excellent documentation for c4-based x86 JIT compilation: JIT.md.
We expect to prepare Aarch32 specific entry to describe how JIT works in AMaCC as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.