pilif0 / basilisk Goto Github PK

2.0 1.0 0.0 294 KB

LLVM frontend for my pet programming language

License: MIT License

CMake 1.53% C++ 98.43% Shell 0.04%

llvm programming-language cpp cpp17 compiler-frontend cmake llvm-frontend

basilisk's Introduction

Basilisk

This project is my effort to learn basics of compiler design and LLVM. It is an LLVM frontend for a simple C-like language with no particular purpose. I will be adding features to the language as I try to learn them and how they can be implemented.

Built With

Usage

The main executable is basilisk, located in the tools directory. This executable handles the full compilation from a source file to an object file native for the host machine. It supports input and output through standard streams and files, and output can be generated at any stage of the process by using command line options (--lex, --parse, ...). For full usage description, run basilisk -h to display the help screen.

Building

Build Requirements

Boost >= 1.69.0,
LLVM (developed with 8.0.0svn)

The main supported compiler is Clang, but GCC should also work with minimal adjustments. The recommended linker is LLVM's lld, mainly for the easy to understand warning messages.

CMake Options

basilisk_BUILD_TEST — build tests (default: ON),
basilisk_BUILD_DOC — build documentation (default: ON),
basilisk_LLVM — LLVM build directory (default: /opt/llvm),
basilisk_BOOST — Boost directory (default: /opt/boost)

Compilation Warnings

The compiler warnings enabled are all of -Wall, -Wextra and -Wpedantic. There is also a configuration file and run script for clang-tidy. The library itself should produce no warnings with either of these if at all possible.

Contributing

Please read the Contributing Guide for details on the contribution process.

Versioning

This project uses Semantic Versioning.

Authors

Filip Smola

Licence

This project is licensed under the MIT licence - see LICENCE.md for details.

basilisk's People

Contributors

Stargazers

Watchers

basilisk's Issues

Identifiers starting with underscore

Currently all identifiers have to start with a letter. I think it would be good to expand this to allow identifiers starting with an underscore, which is often used in other languages.

Can order of definitions in a program not matter?

As discussed in the LLVM IR generation pull request (#4), there is a question of whether order of definitions in a program should matter. At the time of that pull request, making definition order not matter would produce ambiguities and would require handling of special cases. Therefore the decision was taken to make the order matter.

This issue is created to continuously examine when the change to definition order not mattering could be made, and what it would entail.

Data types

The language needs to have more data types than just doubles. I propose at least the following types based on the types in LLVM:

Void
Boolean
Integers of widths 8, 16, 32 and 64 bits (byte, short, int, long)
Floating-point values of widths 32 and 64 bits (float, double)
Array
Structure

It might also be good to implement further types while this is being done, such as vectors, and prepare for later implementation of pointers.

With these new types, it seems appropriate to expand the set of valid literals:

Decimal, binary, hexadecimal integer literals - e.g. 7, 0b111, 0x7
Floating point literals - e.g. 3.14
Scientific notation - e.g. 3.6e2, 2e-4
Boolean literals - true and false

Furthermore, underscores should be allowed and discarded in literals, allowing more readable formatting, for example 0xffff_f0f0_abcd_1234 instead of 0xfffff0f0abcd1234.

Unsigned versions of the types should also be considered. LLVM doesn't distinguish between signed and unsigned types, that is done when selecting an instruction to use.

Possibility of dropping semicolon requirement

Currently all statements are required to end in a semicolon. While thinking about designs for other features, I started wondering whether this requirement is really necessary or could be dropped.

The semicolon currently works to divide statements. As one of my main principles for basilisk is that whitespace should not matter beyond dividing tokens, I can't replace it with a deadline and force each statement on a separate line. This would give meaning to whitespace and make it less suitable for formatting code without impacting function.

This issue is focused on simply dropping the semicolon and seeing what ambiguities are produced and if they can be reconciled. If it seems that all possible ambiguities can be easily solved, I would proceed with removing the requirement while keeping the option to include a semicolon there in case it is preferable for readability.

Error token recognition

Currently error tokens are picked up by the parser as unexpected tokens (as error tokens are never expected). It would be better if a unified way of intercepting error tokens as added. Then they could be better reported, with possible recommendations based on the context. One of the main requirements for the solution is that it interferes as little as possible with the actual parsing, in order to keep the parser as easy to expand as possible.

Global variable multiple initializers

Multiple definitions of the same global variable currently produce multiple initializers, with the variable taking on the value of the last initializer for the full execution. This behaviour is unintuitive and should be removed. A good time to straighten this would be when adding more data types and differentiating variable definitions and assignments.

Simple function definition variant

A feature I enjoyed in my work with Kotlin was defining functions with an expression body (see Kotlin reference). These make the code cleaner and easier to read, and shouldn't be too hard to implement.

In essence, a function definition:

f(x, y) = x + y;

would be equivalent to:

f(x, y) {
    return x + y;
}

Nested blocks

If we regard a block of statements as a statement in itself, they can naturally be nested. This would allow better management of scope as well as prepare for implementation of conditional statements and loops.

Tasks:

Adjust grammar to consider a block of statements as a statement
Add AST node type Block containing a set of statements to reflect this
Generate code from this node type by pushing a scope on the named values stack, executing the statements in sequence and popping the scope

Remove main function wrapper

Due to everything being a double, there needs to be a wrapper around the main function that converts the double it returns into the integer that the system expects. Once more data types are added this wrapper can be removed.

Rich tokens

More information (e.g. line number and character) should be included in tokens. This information should then be used in the parser to improve error reporting.