jacquelinecasey / nom Goto Github PK

0.0 1.0 0.0 363 KB

Rust 100.00%

nom's Introduction

Nom

Nom is a prototype / toy programming language. I am developing it for fun, and to learn more about the inner workings of the compilation process.

Quick Start

Nom should work with the current version of Rust. At time of writing, that is Rust 1.80.0 (2021 Edition). To build and test this project:

Obtain an up to date version of Cargo.
Run cargo test. All tests should pass, unless I have recently started (and not finished) adding a new feature.

Then, you can have the project compile and run your Nom code. For instance:

Running cargo run < samples/successful/example_test.nom will compile and run one of the sample programs.

More about Nom

Nom is a compiled programming language, in the same sense that Java is compiled. A Nom program is transformed into a simple set of instructions, which run on a virtual machine (the Nom VM, or the Nom runtime.) You could split hairs and say that it is interpretted, but you can do the same thing for Java, C#, etc (although they do have a JIT, and Nom certainly does not).

Here is a small program in Nom.

// Ruthlessly inefficient function computing fibonacci numbers.
fn fib(n: i32) -> i32 {
    var result: i32 = 0;
    if n == 1 {
        result = 1;
    }
    else if n == 2 {
        result = 1;
    }
    else {
        result = fib(n-2) + fib(n-1);
    };

    result
}

fn main() -> i32 {
    fib(20)
}

This and many other programs can be found in the samples folder. These are also the programs used in the tests, so you can verify that they all work and produce correct output.

Nom was strongly influenced by Rust and Zig, and to a lesser extent every other language I know. It intends to offer direct control of memory, as these languages do, and a rich type system. It is currently a very simple language, but I intend for it to become more complex over time. The main goal of the project is to let me learn about implementing a language, so by necessity the language will get more and more complicated to let me learn more.

Features:

Imperative programming basics. Variables, arithmetic, conditionals, while loops, and functions (even recursive functions). The syntax is much like Rust's, which for this subset of features means it should look reasonably familiar those who know any C family language.
Expressional syntax. Blocks for instance are expressions, which run some statements before evaluating to a final expression in the block. Function bodies work like this as well, so you can omit the final return in the function, like in Rust. If statements are really just if expressions, evaluating to whichever block is selected. Returns and loops are also expressions, which provides a bit of flexibility.
Strong typing. Currently no support for type inference (beyond inferring literal types), but inference is planned. Built in types include the suite of normal integer types (e.g. i32, u8), as well as bool, unit (the type with one value, analogous to C's void) and bottom (the type with no value, symbolizing that a function never returns, or an expression never finishes evaluating).
User defined types, with struct. Structs can of course contain other structs. Struct expressions let you define an instance of a struct all at once (the only way struct values come into existence), but then a struct can be modified via the familiar member access notation (my_struct.my_field).

Further documentation on the Nom programming language (not documentation for the project) can be found in nom_docs.md. I'll try to avoid letting it get too out of date.

Upcoming Features (probably):

Pointers are the next major milestone. They will enable comfortably working with arrays (without making copies all the time), and then we can do strings and I/O. I eventually want some sort of slice as well.
Multi-file programs, i.e. a module or import system. Some groundwork has been laid, but not fleshed out. With this, writing some basic utilities into a standard library could come next (perhaps with some functions being implemented natively in Rust?)
More user defined types. Tuples / anonymous structs are a huge convenience. Enums / unions that model sum types, completing an algebraic type system when combined with structs (product types). Perhaps we can have Zig-like support for optional or nullable values, and likewise for fallible values ("error-union type"). I also want to include a generic type system, but that may be implemented much further down the line.

This is a fairly high level roadmap. For a much more detailed version, see the project tab or the issues tab on GitHub.

More about the Project

Documentation for the code of the project is included in the code itself. However, it can be viewed in HTML form by running cargo doc. This only shows the (small) public interface, but documentation exists for private items as well, which can be viewed by running cargo doc --document-private-items.

Most of the code is contained in the library crate, rooted at lib.rs. There is also a simple binary crate (the single file, main.rs) which compiles and runs a program provided through standard input. In the future, main.rs should look more like the frontend of a compiler, handling flags and options and so on; right now, it is a very thin wrapper around the library.

The library code is broken up into modules. The first layer of modules is organized primarily by which stage of the compilation process they implement. This first layer is described below, but each of these primary modules may be further broken up into submodules.

lib.rs is the top level module, which includes a CompilationEnvironment struct that tracks all the currently known information about the program; its structure, its types, etc. It is important to note that this struct contains different amounts of information at different stages of the process, so env.types should be understood to reflect "currently known types", which means it may be empty early on. This struct also determines when different compilation stages take place for various files (so "scope checking" one file may trigger imports from another file).
token handles the first step of compilation: tokenization. It defines the Token struct, which describes a single unit of the program (an identifier, an operator, a brace, etc.). This module allows a string to be broken up into these tokens. Tokens also track "span" information, which describes where in the source file they came from, permitting nicer error messages.
One stage of compilation is not handled by code in this crate: Parsing. By parsing, I mean specifically the process of discovering the structure of the program from a string of tokens. This step is handled by a separate, standalone crate I wrote called Parsley. Parsely has evolved in tandem with Nom, but Parsely is a far more general (and far more complete) tool. In this stage of compilation, the token string is transformed into a parsley::SyntaxTree<Token>. This only captures the structure, and almost nothing else, of the program. That structure is described in grammar.parsley, which is given in a format similar to Backus-Naur form (BNF).
ast handles the process of transforming the SyntaxTree into an AST or abstract syntax tree. This entails validating many local features of the language, and performing simple transformations. The intent is to not capture the syntax the programmer typed, but the meaning they intended, so that all future stages of analysis proceed more easily. Extra information (e.g. the brackets surrounding an if block) are discarded. Data is stored in a more easily accessible manner. Finally, AST nodes have some additional data which allows them to be identified and associated with information discovered later in analysis (e.g. type information, which is computed later).
analysis controls the step where the program (as an AST) is repeatedly analyzed to determine further information. Further transformations to the code are also undertaken. Syntactic sugar is expanded into a standard form (desugar submodule). Items used in the programs are checked to ensure that they have been declared, and are in scope (scope_check submodule). Finally, the program is type checked (type_check submodule), which validates the type rules as well as discovering the types for every expression, variable, and function.
generate takes the finalized AST, and produces a list of instructions. There are some final optimizations made to just this list of instructions (currently very simple).
instructions naturally describes the instructions that the virtual machine / Nom runtime understands. A list of instructions fully describes a Nom program, so no other information is needed to run it. They can safely be stored somewhere to be run by the VM later.
runtime implements the virtual machine. It receives a list of instructions, and and runs them. The runtime is based somewhat off of Java's, but like the rest of Nom, much of it is developed from first principles. The architecture is stack based, so there are no registers beyond an instruction pointer, a stack frame base pointer, and a top of stack pointer. Nom's stack lives in Rust's heap, but it is actual memory which (in the future) can be manipulated directly by Nom programs, and perhaps someday even passed to extenal code (i.e. C code).
The other modules, error and util, are helper modules. As the language grows more complicated, I find I need to add more prettier error reporting facilities in error in order to successfully write new tests / samples. In time, I hope that every possible error would generate a nice user facing message / explanation, but right now many error messages are terse and intended for debugging the compiler, not the Nom program.

The Nom runtime isn't particularly fast, as one might imagine from a toy language. While I say there are optimizations, they are currently very simple. That being said, a carefully written Nom program (written with an understanding of how it compiles) can achieve performance similar to a translation of that program into Python (even performing around 10% better), which is a pleasant surprise. Make sure you build the project in release mode (cargo build --release) before doing benchmarks.

I find a test driven approach is very helpful in developing this project. There are a couple of unit tests for some modules, but the majority of tests are integration tests for new language features. To extend the language, I recommend writing a test or two with the new feature, then going through all the modules in order updating as necessary. Rust's type system is very helpful here - often I need to make just a few well thought out changes to, for instance, the AST type, and after that the rest of the work is just going through the remaining modules and making sure they compile. Rust points out all the places that need to consider how the handle the new AST variant, which is very helpful.

When I am done, cargo test should pass, with no compilation warnings either. cargo clippy (or cargo clippy -- -D clippy::pedantic) should also emit no warnings (to be fair, I have been known to silence those lints I disagree with). Finally, cargo fmt should be used to fix any formatting issues.

nom's People

Contributors

Watchers

nom's Issues

Documentation

Read this, and apply to project: https://doc.rust-lang.org/rustdoc/

Note that cargo doc --document-private-items can generate full documentation including private items, which is good for my use case. Also, documentation of items appears in vscode tooltips, so that is also nice.

Return

Allow the return keyword in functions. I am not sure if return should be a statement or an expression.

One concern is type checking - ideally a branch that returns should have some sort of diverges type, which can coerce to any other type. Does this only work if we use return as a statement? Should using return as an expression statement be allowed in this case? I guess dead code is ok if it is being actively worked on or the live code is just a test, but it might necessitate some sort of analysis that says that a branch yields the bottom type if it is guaranteed to always hit a return...

Inline functions

For some functions, calling the function is a significant portion of the overhead. I'd like to eventually allow functions to be inlined by the compiler if it determines it is a good idea.

I don't think I want to expose an inline keyword though, I think that could be kinda messy.

We should choose carefully which functions to inline. For instance, a function that uses each of its arguments exactly once and returns in one place might be apt for inlining. (Note - maybe this is not as good as it seems: if we want arguments to be resolved in a specific order, we are going to have to store them somewhere no matter what).

So determine if this is feasible, which functions it would be good for, and then implement inlining for those functions.

Floats

At time of writing, there is no support for floating point arithmetic. That should probably change.

Note that if casting has been implemented, this will likely require updating the casting rules.

Const Analysis

Nom currently has syntax to support declaring some variables as val (constant) and some as var (mutable). However, currently, this does nothing, and all variables are treated as mutable.

Update the scope checking system to also determine which variables are mutated, and emit an error if const correctness is not respected. An error should be emitted if const correctness is violated.

Until further notice, function parameters should be treated as const.

Casts

There should be some syntax (as or maybe @) that allows casting, even if only between primitive types. Note that some numeric conversions have already been implemented in the bytecode, but some other casts (particularly those involving floats) might need more work.

Else statement

We have if statements expressions, now time for else blocks.

Be delicate with the syntax here - we want else-if for free of course.

Also, make sure that the type of an if-else expression is computed correctly.

Note - may require better type unification - easier said than done...

Structs

Implement a basic system for user defined types (structs).

Users should be able to declare new structs, with fields being any type that currently exists (e.g. other structs). Recursion is forbidden (though in the future, it should be o.k. to include pointers within a struct).

If needed, we can (temporarily) enforce a C-like rule where structs (and types in general) need to be declared in order. Eventually though it would be nice if we didn't need this.

Improve Built-In types and literals

Write literal_fits
Literals can definitely use some work overall
- We need a way to write up to i64 and u64 literals, with validation and also
  proper conversion later on.
Implement rule that main must return i32.
Test unit type?
Builtin types beyond unit and i32
- Likely we will want
- Requires an actual type analysis. I'd like to have one that associates a type
  to every expression using the identity in NodeData.
- Requires a conversion AST node, ergo requires the ability to modify the AST.
- We need to generate code for this AST node, which cares mostly about alignment.
- We might need some support from the VM to handle narrowing conversions gracefully.
  - Or maybe all conversions gracefully. Byte order is a nuissance.

Semicolons and blocks

Right now, blocks require semicolons in all cases.
Update the grammar so that a block can exist without a semicolon. This should generate the same expression statement as the version without a semicolon.

Note - due to the "no ambiguity" rule in the parser, this means disallowing empty expressions (i.e. extra semicolons). However,
extra semicolons are weird, so it really doesn't matter.

Begin a standard library

Although we don't yet have strings (strings are surprisingly complex) we would like to be able to print some things. Perhaps our first standard library functions can be utilities to print common data types for tests (we no longer need to rely on debug printing the error code for tests).

Begin work on a standard library. This will entail adding some way for a nom file to import the standard library or certain functions from the standard library, and may require some name resolution feature.

Standard library functions should only be compiled if they are used. However, I think all functions written by the user should be compiled.

Adding printing in particular may require extending the runtime and referencing some builtin operation somehow. Perhaps some standard library functions can be implemented directly in the bytecode?

Token Error

It is getting annoying to have to determine where the code is broken even when writing (the longer) tests. Usually this is because I still naturally forget semicolons, especially where blocks end (see #19).

First, determine how to test errors that come out of Nom. Ideally we integrate them into test files like with the other tests.

Next, develop a system for displaying nice errors nicely. I've been poking around the rust compiler, and I like the way things are handled there. Ideally, we can put the wording of errors in a separate file, and we can emit them with a single line of code wherever they are spotted.

Finally, ensure Nom pretty prints an error whenever it cannot parse. It should give the location where the parse fails. I think it would be nice if we could print and mark the line where a failure occurs.

We already know how to get the "possible next tokens" when a parse fails. I think a brief, heuristic analysis of this should be enough to make a suggestion. In particular, we should be able to suggest things like commas, semicolons, and parentheses/brackets/braces.

Design: Type system

I have the following notes on the type system:

Type system: Simplest approach to start - everything is a value type.
- Generalize the type system somewhat. Add more builtin types.
- Permit type conversion (conversion to wider type)
- Syntax for type coercion (cast to smaller type. Could be checked?)
- User defined types:
  - Type Alias
  - Product Type
  - Sum Type
- Arrays (and slices?)
  - Requires a sort of "builtin generic behavior". Maybe just cheat for now, or skip to actual generics?
  - May want an explicit allocator. Maybe hide in standard library.
- Strings!
- Exotic types:
  - void / unit
  - unreachable / never / diverging
    - I'm fond of adding an explicit 'unreachable' keyword like Zig. Rust has something similar as a macro.
    - However, its unclear if divergence should be a type or a statement or both. Maybe I need to reconsider
      my stance on statements vs expressions.
  - any? I kinda hate interface{} or anytype or void*, but they can be useful in some places.
- Convenience types:
  - e.g. tuples, slices, optionals and errors. Records? Newtype?
  - References / pointers / borrows / view / lens

At this point, the core of the language should be in reasonably good shape, and we would like to add many useful types. It is time to carefully design the type system. Rust and Zig are my primary motivation at this point, but there are some other good ideas out there.

When done, spawn issues for implementing support for each of the new types / ways for users to create types.

Modulus Operator

The modulus operator begins to look more and more useful now that we have loops.

Add modulus. Make sure to add mod-equals as well.

Boolean operators

Requrires: #2 Booleans

Still need to choose syntax - are not or and and preferable to operators?

Implement short-circuiting logic. This might mean performing a transformation to the AST to include the implicit jump logic.

The pseudocode

if (p and q) {
  print("Hi");
}

might become an AST which could literally be interpreted as

if (if (p) {q} else {false}) {
  print("Hi");
}

While loops

Requires #2 Booleans.

Once if statements are added, while loops should be pretty easy.

Should we add break and continue? How might those work in the future if we don't add them now. Can we break to an arbitrary outer loop? Outer block even (see Zig)?

Booleans

Add Booleans and related constructs.

Boolean Type
Boolean Literals
6 standard integer comparison operators
if expression (with else)
- Should always require a block, and takes on the type of the block, which must be unit if no else is provided.

Unary negate and negative literals.

In principle, this is easy.

However, we need to be careful not to introduce ambiguity into the grammar, so likely we will only have a negate operator (and not negative literals). Then, as an AST transformation, we should fold a unary negate + literal into just a negative literal. This is a little tricky still, since ideally we do bounds-checking after we perform the fold. That would be before during type analysis though, so this is not a huge issue.

One test should be to construct the largest and smallest i64 values.

Type checking must be smart enough to know that unsigned integers cannot be negated.

Explicit `never` type

Refer to #14. We have a bottom type, but it is inaccessible to users (only used internally for the value of return expressions).

Compound Assignment Operators

Now that we have loops, compound assignment operators feel long overdue. It might be time to add a stage to the analysis step to process syntactic sugar.

While here, consider breaking the analyis module into several files. It will be kinda bloated.

I am considering the visitor pattern as well. The concern is that some recursive processes are too different.

Better Scope Analysis

Currently, scopes are "leaky", meaning a variable defined inside one block can be used outside of that block.
This will become particularly problematic after we add if statements.

Make it so that variables are checked to see if they follow scope rules. Emit errors if a variable disobeys scope rules.

Ideally we could prepare for adding support to the global scope, or just add it directly. Then, compilation can error early in the case of a misnamed function.

Global Variables

Requires #5 Scope Analysis.

Allow variable declarations in the global scope. They should be allowed to be mutable (even though that is considered a bad decision in many cases).

This will require extending the VM, as we need some sort of global segment to be prepared. Perhaps we will add a Pseudo instruction that will be transformed after the relevant globals are placed.

Note that when pointers or references become available, we would like a way to take the address of a global. Perhaps we could implement globals through writing through Nom pointers? This of course might not be ideal - the tradeoff is efficiency (this would be inefficient) vs. simplicity of the instruction set, and simplicity in eventually determining how to build pointers.

Token Spans

Tokens should describe their location in the file. Namely, a line and character, as well as a file name. If possible, the file name should be passed by reference, which means we might use a Cow<?>.

AST nodes should also remember span location (which span multiple tokens). This could be stored in the NodeData field. Alternatively, a hash table indexed by the node ID.

AST transformations should allow new nodes to inherit spans from the nodes they replace.

Eventually, analysis errors could contain span information, which could allow for much prettier and more specific error messages. This requires reopening the erroring file of course.

Pay attention to the tests, which should compile and run ideally without emitting extra files. Of course, we can also have a different test that does emit a file.