andgate / higher-c Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 0.0 23.49 MB

Higher-C compiler toolchain, written in Haskell.

License: BSD 3-Clause "New" or "Revised" License

Haskell 90.32% Yacc 9.68%

higher-c's Introduction

Hi there 👋

higher-c's People

Contributors

Stargazers

Watchers

higher-c's Issues

NoConsume is kinda harmful

Hawk being primarily a personal research language, I've continued to research into the literature on linear typing. While originally, it was planned that there would be a linearity checker, which would run after type checking and ensure that values in functions were used according to the rules of linear typing. However, further research has led me to conclude that this work should instead be done by the type checker.

However, there is the issue of the noconsume tag. This does not fit neatly into the model of the typechecker, and it turns out there is a better way to provide this functionality. A function marked noconsume essentially has values with the following needs

May be used zero or more times
Cannot substitute a linear value (as it may be free'd)
Cannot be introduced through let binding
Can be introduced through abstraction
Can not be the result of abstraction elimination

Number one suggests immediately that we are working within traditional intuitionistic typing. Relevant typing would suffice, but pattern matching requires some values may never be used. Since these values are not intended to be memory managed, we cannot provide an escape hatch in the form of free, so usage is bounded by zero.

It seems to me that we are dealing with two kinds of types, intuitionistic and linear. We can define these as such:

data Kind = * | o | Kind -> Kind

Where we have star (*) kinds for intuitionistic types, and pop (o) kinds for linear typing.

Differentiating between linear and intuitionistic types on the kind level has advantages over other methods. I've considered using ! to mark types in a type signature as intuitionistic, but this approach requires we introduce subtyping into the type system to avoid complications. Moving this specification into the kind system simplifies this problem. Subkinding is a more manageable solution, as the kind system will be simpler than the type system. Linear kind are, quite naturally, a subkind of intuitionistic types.

o <: *

With subkinding, the solutions to the other requirements sort of naturally just follow.

2) Cannot substitute a linear value (as it may be free'd)

Since a * is not a o, we won't be able to use values with star types in functions expecting a linear type.
However, since a o is a *, we can still use o in places where * is expected. This is the exact behavior we want.

3) Cannot be introduced through let binding

Let bindings can be restricted to only produces values with types of kind o. This is a simple solution that can be type checked.

4) Can be introduced through abstraction

Similar to the situation with let binding, we can restrict abstraction to introduce binders that can be evaluated to *. By the subtype relation, we should get o as well.

5) Can not be the result of abstraction elimination
This last one leads to an interesting result. It suggests that no arrow type should have kind *. So, my solution is to have to arrows. One linear, one non-linear.

(->) : * -> o -> o
(-o) : o -> o -> o
infixr 0 -> -o

This is not a restriction we could make with the ! syntax that exists in most literature on linear typing. With kinds, we are specifying that -o only accepts a linear kinded type in the first argument, while -> can accept non-linear and, by subtyping relation, linear. This will allow us to build types like Int -> Int -> Int that are specified to take non-linear values, but ultimately produce a linear value. That is, we can defined non-consumers are regular functions, but they get all the restrictions we desired of them, and they can be sanely typechecked!

I'm very excited about this system. I'm currently researching into how I can implement a typechecker for this system, but it seems well within the realm of possibility, doesn't produce unnecessary complications, and seems to naturally meet all our requirements. Interestingly, I had this idea independently, and then discovered this paper on System F-pop.

Module System Proposal

🔶This is old prototypical stuff. See #3 for more details🔶

Motivation

Every programming language needs a module system, and Hawk is no exception. A module system is necessary to control symbol access between source files.

Specification

A project's source code is organized into files. Each file is considered a module. A module's path is the source file's path relative to the root source directory. Each module contains a set of items. An item is a declaration of a function, type, dependency, etc. A dependency is a path to a module, item, or set of items outside of it's module.

Module paths are essentially period separated file paths. Given some file "A/Foo.hk", the module path is written as "A.Foo". Suppose that module contains items named x, y, z. Their paths would be written as "A.Foo.x", "A.Foo.y", and A.Foo.z", or more simply as "A.Foo(x, y, z)". Parenthesis are used to essentially open up a module and specify multiple contents at once. Items can also be excluded from the path, using a backslash after the left parenthesis. The path "A.Foo(\ x)" specifies all the items in "A.Foo" except for "x".

A dependency is declared on a newline with a left-to-right arrow, "->", followed by a module path. So "-> A.Foo.x" would declare the item "x" as a dependency of whatever module it is placed in. If item x exists, then modules containing this dependency will have access to item x. Dependencies can only be declared at the item level, and cannot be declared inside of functions or other items.

A propagated dependency uses a different left-to-right arrow, "=>", and behaves the same as a regular dependency, except that the dependency is shared with other modules that depend on that module. So if module A depends on module B, and B contains a propagated dependency, then A will inherit that dependency as well. Propagated items will keep their path, and they are not treated as if they are apart of the module propagating said item.

Qualification is the act of prepending a path to an item name inside of a given item. Names require minimum qualification to identify their location. So, given "A.Foo.x" and "A.Bar.x", it is required that x is qualified as either "Foo.x" or "Bar.x". However, the full path "A.Foo.x" and "A.Bar.x" can also be used. If there was no conflict, an item x can simply be referred to as unqualified.

Qualification can be forced by a dependency by prepending an exclamation mark to the dependencies path. In the previous case, the dependency "-> !A.Foo.x" would require that the item x from module "A.Foo" is always qualified. This also results in freeing Bar from the conflict, allowing the item x to mean "A.Bar.x".

An alias can be introduced by appending an at symbol to the dependency path. In the case of "-> A.Foo.x @ Stuff" would allow x to be qualified as "Stuff.x". That qualification can be forced as well. An aliased dependency can only be qualified with that alias. A propagated dependency with an alias will propagate with the given alias.

Costs and Drawbacks

Fragile binary interface problem (typeclasses will reduce this issue)

Alternatives

public/private access controls (extra work for programmer, encapsulation considered harmful)
import/export (not declarative)
reverse dependencies (aka exports) (does not fit into framework, extra work)

Unresolved Questions

Implementation

Feasibility of Low Level Parametric Polymorphism

Motivation

One of the largest innovations in programming languages has been polymorphism. Generalizing data types and functions to work with multiple types has resulted in smaller, easier to manage code bases. With respect to the enormous size of modern codebases, any modern language that lacks polymorphism is hardly worth considering. For Hawk, we want to have polymorphic code without incurring cost at runtime.

Possible implementations

Polymorphism is typically achieved in two ways. The first, most common, is through boxing. While flexible, this approach does not play well with a manual memory managed language like Hawk. Since Hawk plans to make heavy use of polymorphic code, managing boxes would either require explicit boxing everywhere, or implicit boxing, which would seems to require a garbage collector. Furthermore, boxing incurs an obvious runtime cost through indirection, and may lend itself to play poorly with a cpu cache.
The other implementation to consider is monomorphization, where concrete instances of parametric code are generated at compile-time, as if polymorphism is simply syntactic sugar. This is how C++ templates work. Runtime performance of monomorphized code is optimal, so this approach seems to be way to go for Hawk. However, this approach is not without a large fundamental drawback.

For more information see Implementing, and Understanding Type Classes.

The Drawback

With monomorphization, hkc will not be able to produce shared libraries. Since types would be erased at runtime, it will not be possible for a binary to grab polymorphic instances from a shared library. Instead, a hawk library would be a database containing a type-checked AST. Monomorphization is only available when compiling to a binary. The loss of binary modularization impact development is significant. There are strong arguments for shared libraries.

A system's memory footprint is reduced, sometimes greatly, by sharing binary code between applications.
Rolling out security updates is made simpler, since only the exploited library needs replacement, rather than rebuilding every binary that uses that library.
Libraries provide an ABI, which other programming languages can bind to via FFI.

Solution

There is a way to work around this. An ABI requires that we provide a set of monomorphic functions. So, by tagging functions, say with an expose keyword, it is possible to specify the ABI of shared library. The only requirement for expose is that the marked function (or possibly even class instance) is monomorphic, which a compiler can determine. This will allow for the construction of shared libraries that other programming languages can interface with, while also allowing for Hawk programs to reap the full benefits of whole program optimization.

Playing with a new core calculus

Recently I've been experimenting with a new core calculus. Linear subtyping, while interesting, leaves much to be desired in terms of flexibility. I suspect most programmers feel more comfortable with managed, immutable objects rather than consumable objects that are passed around and duplicated. With that feeling, I've been investing different approaches to low-level functional programming.

From the beginning, I wanted to capture a model closest to the machine's model. I want whatever gives me hi-performance, functional programming constructs. Looking at machine architecture, typically we operate of some regions of memory using instructions. From memory, we can create lightweight handles, called pointers. We can form abstractions over our regions of memory, and construct elaborate data schemes for expressing the intended structure of a region of memory.

We can consider data to be an abstraction of physical memory. From data, we construct actuals in memory, which can be thought of as objects. Variable names simply refer to these actuals. Objects can be constructed on the stack or heap. The contents of an object are mutable. Objects on the heap must erased, but objects allocated on the stack will be erased automatically at no cost.

From this framework, we can begin to construct our lambda calculus. This lambda calculus is based on something called assignment calculus, though it does a few things differently. Here is the basic calculus:

exp =
  -- Typical Lambda Calculus
  x, a, b  -- variables
  a b      -- application
  \x -> a  -- abstraction

  -- Constructors
  C        -- stack constructor
  !C       -- heap constructor       
  
  -- Imperative operators
  a;b -- sequence
  a <- x   -- bind
  a = x    -- set
  free xs  -- free

This definition omits control flow and annotations (i.e. type and location annotations). However, this definition shows the terms that our calculus will introduce. Important to this imperative calculus is the sequence term. Sequencing two expressions allows the computer to evaluate one impure expression before it evaluates the other.
We differentiate between bind and set. Bind is used to introduces variables into scope, and set is used to modify existing variables. We can free our variables. For instance, a simple program may look like

cat str =
  = handle <- getFilehandle str
  ; contents <- readFrom handle
  ; print contents
  ; free contents handle

This will allow us to grab a file, get the contents, and finally free the handle and contents when we are done. We can use do notation to eliminate semi-colons.

cat str = do
  handle <- getFilehandle str
  contents <- readFrom handle
  print contents
  free contents handle

Hopefully it's obvious, that while unsafe and prone to grow in complexity due to impurity, this language could be very useful. The expressions are structurally similar to most assembly languages, though it is removed by a few layers of abstractions.

The evaluation model we propose is referred to as call-by-sharing or call-by-object. The idea is that every value on the stack and on the heap are Boxed. While a Box type for making pointers is provided, Boxed values are unavoidable. All names are references to some data, so data is effectively boxed by default.

Needless to say, this approach plays terribly with the cache. Hopefully, secret optimizations maybe applied to unbox objects when possible. However, giving up perfect performance, we still have manual memory management. In addition, all Hawk source code becomes very easy to share with C code. Functions with heavily abstract types will be call-by-object, so their size is obviously a pointer. This is a very helpful property.

All in all, we are strictly evaluating this calculus with call-by-sharing, with all names acting as a reference to some object, and references being passed around everywhere. Objects are mutable, and we write our programs in a functional, expression style. Objects can be constructed on the stack allocated and automatically free'd at the end of their scope, or objects can be constructed on the heap and free'd later by some other expression.

Due to being a lambda calculus, this language will have a type system that can effortlessly handle rank-n types, in addition to kinds. Type classes like functors, applicatives, and monads should be possible. Row types will also be included in this language, which when combined with impurity will allow this language to adopt a strange, efficient style.

All in all, this language will be very highly level at very little cost to the machine. In the future, objects will be unboxed automatically from analysis. At that point, our language should be as fast as C, with much more expressiveness. In addition, a very fast native array type is planned. This will make hawk VERY useful for high-performance computing.

Before, we had planned a language based on the linear calculus, after much thought I decided that linearity actually really complicated the type system. Regions, which were the planned memory management scheme, aren't very efficient. Linearity is very restrictive, though a scheme was developed that might guarantee some mutation, at an unknown cost to memory. Also, what should be stack and what should be heap isn't exactly clear.

So the linear calculus is kinda wild. I believe that with imperative, impure, strict expressions, Hawk is a bit more tame. Call-by-object adds stability to the language that I think a lot of programmers are familiar with. This is important, because this calculus will also compile to very efficient assembly language. In theory, hawk should be efficient, easy to develop with, easy to extend to other platforms, and easy to integrate with existing code bases.

Linear Typing Model

Hawk is intended for systems programming. It has had many different formulations in the past. The git history is rich with syntatix experimentation. Originally, Hawk was intended as a type-safe, polymorphic improvement over C, with a type classes and manual memory management. One of the loftier goals was higher-kinded types, since I'm dissatisfied with the lack-thereof in Rust.

A lot of thought has been put into this, and I've decided this is a misguided idea. Manual memory management is bad™ and mutability is bad™. Referential transparency is a good thing, and a language based on lambda calculus, unmangled by statements, is open to some amazing design decisions. It's really hard to put into words just how much better a purely functional language is over an inherently stateful, imperative one.

I guess if I had to sum it up, it seems clear to me that it's objectionably better to start with a pure functional language and build it up to have a controlled, imperative style, than to have an uncontrolled imperative language that attempts to tack on or force impurity. One has referential transparency, the other never will. That's huge.

Insights like this, among many, have led me to decide that Hawk needs to be purely functional. A purely functional systems language needs to have some sort of automatic memory model, as manual free and new statements are not possible, and would pollute programs with IO. There has been a lot of discussion on this, and the best and brightest of us have concluded the way to go is a memory model based on linear substructural type system.

Linear typing means that we impose the rule "all variables must be used exactly once." Functions in a linear type system essentially consume their arguments. Duplication and disuse are not permitted in a purely linear type system. And equality operation is also quite problematic.

dup :: a -> (a, a)
dup x = (x, x) -- Illegal!

forget :: (a, b) -> a
forget (a, b) = a -- Also illegal!

choose :: a -> b -> Either a b
choose a b =
  if a == b
    then Left a    -- Error, a was already used
    else Right b  -- Error, b was already used

Well, that's not very useful! But linear typing does provide a model that is remarkably simple and allows for controlled, simple modified memory management. Used vars can safely be deallocated or reused. Philip Wadler wrote a great paper on this back in 1990. You can effectively optimize a linear type system to infer mutation.

Linear typing is the way to go, but it needs to be wrangled a bit to work in a real programming language. To that end, I decided tack on a duplication and free expression to Hawk's expression tree...

data Exp b
= EVar  b
| EApp  (Exp b) (Exp b)
| ELam  b (Exp b)
| ELet  (b, Exp b) (Exp b)
| EDup  (Exp b)
| EFree b (Exp b)

These expressions allow memory allocation and duplication to be managed in an explicit, functional way. However, adding duplication alone is not enough to solve the issue at hand. There is still the matter of equality. Linear Lisp was constructed in a similar way, but they included an equality operation in the expression tree.

Equality is basically something that will never free a value, so the linearity checker will ignore when it uses the variable. But if we include equality, we might as well include addition and all the other arithmetic operations as well. But then, what about type classes intended to wrap and dispatch to these exceptional functions? Any values passed to them HAVE to be freed after the result. It's impractical!

But not really. Think of it this way, linear typed functions are hungry, little guys, that will consume everything passed into them. They may either reproduce those values, and allow mutation to be inferred, or destroy them with free. But eq and the rest do not consume their values; they are non-consumers. We can mark them as such.

The rules for a non-consumer are simple. They give up the ability to free a value, or pass that value to a consumer, but in-exchange, using them doesn't count as a usage to the linearity checker. They can also pass those values to dup, since dup itself can also be a non-consumer :) Definitions can be marked as non-consumers with noconsume foo at the top-level. This makes it obvious which functions are non-consumers to both humans and the compiler.

Armed with the dup, free .. so .., and noconsume constructs, we have the basis for a simple, powerful, purely functional, memory-safe systems programming language. I believe this is a huge improvement over the complicated borrow-checker/lifetime system that arose in Rust, and so I'm not just reinventing the wheel.

But, we can take this system EVEN FURTHER! Armed with these constructs, it's possible to write code like the problematic code from before. We can infer and desugar the expressions into linear form when they violate the rules!

pair a = (a, dup a)

forget (a, b) = free b so a

choose a b c d =
   if a == b
     then
       free b
       so a
    else
      free a
      so b

And of course, the compiler can always warn when this occurs, so the programmer can always be aware of what is going on.

So that about sums up Hawk's linear typed memory model. Going forward, this is the model that will be developed. I believe this system is solid, but if anyone would care to point out some major issues (does this play well with polymorphism?), I'm all ears.

Global Namespace

The module system has been a pain point for a while. I feel like there is no satisfactory solution and managing module imports is always a nasty chore. On LtU, there has been discussion discarding module systems entirely.

I've mostly considered this idea nice, but not practical. Global namespacing would lead to the situation in elisp, where names are accompanied by excess prefixes to denote where they belong. However, if definitions were differentiated with hashes, it could be possible to avoid this problem. But then that would once again require qualification with hashes in cases of conflict, which is an uncomfortable task for any human.

After finding Unison, it seems quite apparent that pchiusano has effectively identified and solved the problem. The reason that we can't use a global namespace with definitions differentiated by hash is because it is difficult to specify that hash. However, with an editor that requires correctness, and a decent search/insert tool for construction definitions, that is suddenly no longer a problem.

The search tool can quickly find all the possible functions with the shared name, and can be further configure to only show results with specific metadata, such as package name. This also calls for a radically different style of programming, but it is a serious improvement on the state of things. Paul has written a mountain of blog posts on this subject, and I encourage anyone interested in this topic to read them.

From now on, Hawk will be headed in the same direction as Unison. The currently module system will be discarded, and it will just parse items out of a file and store them into a hash table by name. This will be the state of things while I develop the memory model and code generation facilities. As the language advances, the global namespace will advance beyond a map of text to definitions, and eventually move towards a platform similar to Unison.

Pipeline Construction

For the current language, the pipeline is simple.
parse files -> namecheck -> rename -> typecheck -> desugar -> declassifier -> monomorphization -> core generation -> codegen

Once this pipeline is implemented, we can close this issue, and the compiler will be mostly finished.

The intermediate AST's required for this pipeline are

Concrete
Renamed
Typed
Core

Most of the transformations are to prepare Typed for translation to Core. The process eliminates "with" statements, classes, overloading, and polymorphism. Also simplifies cases and pattern matching. Applying transformations on Typed will result in a Core AST, which will very cleanly map to the final target, LLVM IR.

Missing Documentation

Make a document that describes the syntax and features of this language.

andgate / higher-c Goto Github PK

higher-c's Introduction

Hi there 👋

higher-c's People

Contributors

Stargazers

Watchers

higher-c's Issues

Motivation

Specification

Costs and Drawbacks

Alternatives

Unresolved Questions

Implementation

Motivation

Possible implementations

The Drawback

Solution

Recommend Projects

Recommend Topics

Recommend Org