Coder Social home page Coder Social logo

hikettei / cl-waffe2 Goto Github PK

View Code? Open in Web Editor NEW
119.0 11.0 5.0 8.78 MB

[Experimental] Graph and Tensor Abstraction for Deep Learning all in Common Lisp

Home Page: https://hikettei.github.io/cl-waffe2/

License: MIT License

Common Lisp 90.37% Makefile 0.56% C 9.07%
ai autodiff common-lisp deep-learning-framework math matrix-operations optimizing

cl-waffe2's Introduction

hikettei🌙

I'm a high school student in Japan who dotes on Common Lisp. I am also working as an AI Engineer on Deep Learning Compiler and Quantization. Currently working on building some deep learning ecosystems on Common Lisp including cl-waffe2 and AbstractTensor.lisp.

(make-instance 'hikettei
    :name (apply #'whichever-you-like `(:rulia :hikettei))
    :languages '((:japanese . "mother tongue")
                 (:english  . "Typo Expert"))
    :interests '((:common-lisp . "I have 2~3 years of experience")
                 (:nlp         . NIL)
                 (:transformer . NIL)
                 (:deep-learning . NIL)
                 (:scientific-computing . NIL))
    :e-mail "<[email protected]> Feel free to contact me if you have any :)")

Links

Blog Posts

Talks

Stats

Anurag's GitHub stats

cl-waffe2's People

Contributors

atzmueller avatar elderica avatar hikettei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cl-waffe2's Issues

[Refactoring] I'm considering rewriting the entire code of VM

コードがぐちゃぐちゃ/バグ多すぎ/compile遅すぎ/パフォーマンスがカス/の四つの理由からsource/vm直下のコードを書き直そうか検討している

追加: cl-waffe2/vmパッケージ

./source/vm/generic-tensor ./source/vm/nodesに破壊的変更を加えないままでVMを再実装

判明している現在の課題とか:

  1. cl-waffe2/vm直下に新たに仕様を真面目に策定したVMを構築する
    • Lispのコード全体を再度compileするのは完全に時間の無駄だと判明
    • インライン化したコードを読み込んだり合成するのはVMの仕事にする。(ここの部分はコンパイル走らせない)
    • (!sin x (!copy x))みたいな計算ノード トポロジカルソートをしてない(あほ)
    • RNN実装するときにIRの仕様がもうちょっとちゃんとしてると良い If Mapなどの命令
    • call-with-viewがlparallelで並列化できるように書き直す
    • reverse mode In-placeになってるか??
    • (build out)で一回一次元のデータ構造に直したほうがいい

./source/vm/generic-tensorのVMと互換性があるかをテストしながらvm-refactoringブランチで作業します。

  1. defnode関連の機能
    • 不安定すぎる
    • shape.lispがクソコードなので書き直したい

[Fix] a ton of undefined-type

Loading cl-waffe2 test top-level, a ton of undefined-type occurs.

The class is located in cl-waffe2/vm.facets-tmp which means nodes defined by define-node cause the problem. More precisely, nodes put warning is defined as :device t. Referring to NODENAME-CPUTENSOR, but there's no implementation because all backends use NODENAME-T impls.

I've not reached the place of using NODENAME-CPUTENSOR.

[WIP] Petalisp as a high-level IR?

(This article is WIP)

An overview of Petalisp

https://github.com/marcoheisig/Petalisp/tree/master

(As far as I know), Petalisp is a DSL implemented in Common Lisp for generating parallelized array processing codes, providing:

  • Petalisp works on a fundamental level:
    • Abstract Array Processings, Polyhedral Compiler et al.
    • The IRs are sophisticated; lazy-reshape, transform, ranges and many optimization techniques specialized on them can provide by far the fastest and most systematic JIT Compiler.
  • (Could be, or with more effort) applied to multiple backends likewise tinygrad. (i.e.: relatively easy to implement new backends like Metal, x86, gcc, neon et al)

RISC or CISC?

Deep Learning models are everywhere, but what about the technology behind them? Many deep learning frameworks are in development today, and there are DL compilers with a focus on efficient inference (or training). TVM could be one of the good options, but when you want to make a model specific to an arbitrary environment, there are always compatibility issues (e.g.: pytorch/pytorch#49890, this is a case of PyTorch though).

Concretely speaking, It is possible to implement gemm for many devices (e.g.: CPU, GPU, NEON, AVX, Metal etc...) and many data types (e.g.: uint8, int8, int16, ..., float16, FBloat16, float32, ...). But can it be easier?

With Petalisp, once written at a higher layer, it can be run on various backends instead of implementing gemm (like a template).

;; Petalisp
(defun matrix-multiplication (A B)
  (lazy-reduce #'+
   (lazy #'*
    (lazy-reshape A (transform m n to n m 1))
    (lazy-reshape B (transform n k to n 1 k)))))

as well as tinygrad:

# Tinygrad
c = (a.reshape(N, 1, N) * b.permute(1,0).reshape(1, N, N)).sum(axis=2);

Users don't need anymore to worry about parallelization; just rely on the compiler.

If TVM were CISC, tinygrad would be a RISC.

Why Petalisp is a good choice for replacing the cl-waffe2 compiler

A survey of improving performance

In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:

cl-waffe2 IR

Graph-level optimization is still not enough. Especially, the number of MoveTensorNode should be reduced.

FuseOps

FuseOps Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...) macro, and the compiler reads it.

The full use of SIMD Ops

・Use SLEEF

The full use of lparallel

Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.

[API Enhancement] Project Simplification for those who is new to Common Lisp

  • TODO
    • Step-by-step tutorial for those who aren't familiar with Deep Learning.
    • Tutorial written in Jupyter Notebook
    • Translate tutorial_jp.lisp into English
    • Visualized Explanation of cl-waffe2 (should i make a slide?)
    • I think deftrainer usage should be much more documented, because it is unique and complicated.

Various discussions arising from examples and tutorials

when running examples, there are always erros following:

Couldn't find any implementation of MATMULNODE for (LISPTENSOR).
[Condition of type CL-WAFFE2/VM.NODES::NODE-NOT-FOUND]

how to address?

btw, your waffe2 system looks great, pls keep it on!

examples/mnist/mlp.lisp - reset-compiled-function-cache! question

Using the (current version) of mlp.lisp, for the first call of, e.g.,
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil)
the training loss in the first epoch is around 0.26 usually.

For further runs (when evaluating (train-and-valid-mlp :epoch-num 11 :benchmark-p nil)), the loss is larger (around 0.76 in the first epoch). I suspect that this is caused by some caching in the compiler and different initializations of the compiled structures, since if I evaluate
(cl-waffe2/vm.generic-tensor::reset-compiled-function-cache!)
before evaluating
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil),
then the loss is in the same range as for the very first run.

Is this the intended behavior, or should the reset be applied somewhere when the model is built/compiled?

[FixME] Dynamically Shaped Conv2D isn't working due to !reshape

;; Code
(call (Conv2D 3 6 `(5 5)) (make-input `(N C H W) :X))

won't work since the implementation include:

...
(asnode #'!reshape (* N C-out H-out) t)

Generally speaking, when a mathematical(arithmetic) function needs to be called in the shape calculation, it also needs to be lazily evaluated because the value of N isn't determined at that time.

(TODO: More details)

[Fix] Some specifications on Subscript DSL should be changed in the future release

Current problems (as far as I know) in Subscript DSL is the following:

Difficulty in expressing complicated transmission states

	   :where (Input[N C_in H_in W_in] -> Output[N C_out H_out W_out]
			   where
			   C_in  = in-channels
			   C_out = out-channels
			   ;; H_out = floor(((H_in + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0]) + 1)
			   H_out = (if (numberp H_in) ;; If H_in is a symbol, return -1 (=undetermined, later determined.)
				       (floor (+ 1 (/ (+ H_in (* 2 (car padding)) (* (- (car dilation)) (- (car kernel-size) 1)) -1)
						      (car stride))))
				       -1)
			   ;; W_out = floor(((W_in + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1]) + 1)
			   W_out = (if (numberp W_in)
				       (floor (+ 1 (/ (+ W_in (* 2 (second padding)) (* (- (second dilation)) (- (second kernel-size) 1)) -1)
						      (second stride))))
				       -1))

(at https://github.com/hikettei/cl-waffe2/blob/master/source/nn/conv.lisp#L76)

Since both Convolution and Pooling have too complicated transmission states to express lazily, the :where form returns -1 (can't predict) when the result won't become an integer. This ugly behaviour should be fixed in the future release, but I don't have any idea.

The implementation is ugly

Should be refactored: https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape.lisp

Polish generated Shape-Error

(At https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape-error.lisp)

Shaping-Error should be more pinpoint, In fact, it is possible to make it easy to know where should be fixed, and at which node the error occurred.

My TODO List

今取り組んでることとか課題とかのTODO List

  • Docgen周りのリファクタリング (気が向いたら)
  • metal backend
  • cl-waffe2/nn -> generic
  • Stable Diffusion Inference

Environments / Backends

  • Testing on CCL/SBCL/LispWorks/Allegro (Modern Mode Support?)
  • SIMD Extension [Add] SSE/AVX512/Neon Support.
  • [Add] CUDATensor Backend
  • [Add] MetalBackend
  • 最適化レベルの設定とVMでのRuntime Error Handling
  • build receiving multiple inputs (LazyCons)
  • defmodelの多重ディスパッチ (バックエンドの名前で)

cl-waffe2/base-impl

  • UnfoldをC++で自作してvectorize or oneDNN interop
  • Row-majorのGemm
  • Sparse Gemm, Sparse Matrix Support
  • LazyCons/Permute/Reshape/View/Setqとかのbackend=tのノード: Loadで置き換える (A* = B*;)

cl-waffe2/vm.nodes

  • ネットワーク構築のAPI(defnode/define-impl/define-impl-op/defmodel/defmodel-as)の構築/テスト (Implemented Fairly well)
  • ただのNumpy-likeな行列演算ライブラリとして、cl-waffe2から分離したプロジェクトとして、コンパイルされたcl-waffe2のプログラムの集合をライブラリとして提供することができる (e.g.: topi in TVM)
  • RNNの実装に関して
    • Control FlowをVMに実装するか defmodel-asで部分的にコンパイルしたネットワークをdefine-by-runっぽく繋げて動作する二つの方法のどっちかがある 多分後者でRNN実装
  • defmodel-asの最適化:
    -AOT Compiler, AbstractNodeにコンパイルするときはメモリの割り当てだけ後から変更して再利用できるように!
  • define-by-runモードの実装で一番の課題はコンパイル時間である。
    • (メソッド割り当てが重い) 1. AbstractNodeネットワークの構築 2. ネットワークのコンパイル この二つに分けて最適化。
  • Conv2DのSubscript DSLのバグ
  • forward -> compiler-macro使ってインライン化できない?

cl-waffe2/vm.generic-tensor

  • :cache-when-compiled=nilのdefine-implは、build時に(compile nil body)を実行する原因となるのでWarningを出す

    • Instead, define-impl-opでλ関数として命令を作ってほしい
    • その場合、call-with-viewの関数版のAPIを用いる必要がある。apply-ranked-op的な
  • forwardメソッドの形状検査

  • Memory Pool

    • コンパイル時にTensorに三つの様相を付与して、学習時allocationの回数を最小化するような最適化問題を解けるはず
    • Dynamically Shapingの不安定さを治す
    • forwardの結果がdynamically shapingである時、strideが新しくならないのでmake-inputしないとErrorになるの治す
    • 実装する: make-inputは要素が0で埋められている保証がない。なぜならcacheされるから (Shapeの形状Traceして事前にAllocate?)
    • tensor-displace-toを用いなくてもview関数でbatchの調整を without copying
  • retain-gradオプションを追加する by setting tensor-grad-n=1

  • 逆伝播の時 toplevelにあるMoveをInlining + In-place mutationの最適化は、optimizerの副作用の行方がわからなくなるのでやらない。

  • call-with-view: 大規模な行列に対するlparallel並列化の性能の評価とそれらを有効にする

  • データ型(Dtype)関連

    • Casting
    • ユーザー定義のdtypeをサポート
    • String型が実装できるかやってみる

cl-waffe2/vm

  • defpathマクロによる検索ベースのFusionOpでデバイス特化の最適化を追加する
    • 例えば!sumはBroadcastingとAddNodeベースだが、総和専用のKernelを書いた方が速度精度高い。
  • (log (1+ x))の微分のFusionOPは数値的安定性からMUSTである。これからFusionOPで実装する
  • 最適化 -> (EXP X) -> A, B これを検出して最適化できる箇所がたくさんある。ソートをTensorじゃなくてAbstractNodeのIDベースでやればできそう
  • 最適化: sin(x, out) <- outでcopy(x)するの無駄 allocする計算ノードにする

cl-waffe2

  • deftrainerを廃止する。defoptimizerの使い方と合わせてもっとelegantなAPIにする。
  • ちゃんと頭が回っているときに書きたい

その他

  • AOTのShapeError
  • out-scalar-p忘れのCheck
  • 公平な条件でNumpy+Numba/Theano/Petalispあたりと比較したベンチマークをとってReadme.mdに乗せたい
  • GPT2 Inferenceの安定化
  • ここのTODOリストが99%くらい実現できたらQuicklispにでも登録する。
  • step-by-step
  • 公式ドキュメントのチュートリアルを消すか書き直す
  • Slideかく
  • Optimizer: Add RAdam
  • model-compilerを介して作られたモデルを保存してるハッシュテーブルをgc reachableにして(trivial garbage)、使わなくなったらallocationをFreeする
  • 数学関数のoutパラメーター -> Moveじゃなくてmake-tensorでおk
  • sumでoutリセットする時のScalarMul -> Fillで置き換える
  • モデルのパラメーター保存
  • defpath -> Fusion Pattern ~~ (Replaced with Symbolic Diff)~~
  • Fix: Lazy Stride with do-compiled-loop
  • Fix: !reshape with lazy tensors
  • mgl-pax
  • Im2Col Node for CPUTensor
  • JITLispTensor
  • RepeatNと同じノリのIfNode LoopNode (defmodel-as into :nodeの修正も兼ねて
  • zero cost forward zero cost tensor creation
  • build differentiable
  • Add: OpenMP batch_sizeが小さいときはPyTorchと大きな差をつけているが大きくなるにつれてcl-waffe2が遅くなる

Reading List📕

Lens

https://arxiv.org/pdf/1810.07951.pdf

https://arxiv.org/abs/1809.00738

https://github.com/JuliaDiff/Diffractor.jl

Existing approaches of deep learning compilers

https://arxiv.org/pdf/2002.03794.pdf

https://tvm.apache.org/docs/arch/index.html

https://web.ist.utl.pt/nuno.lopes/pubs/torchy-cc23-extended.pdf

https://towardsdatascience.com/how-pytorch-2-0-accelerates-deep-learning-with-operator-fusion-and-cpu-gpu-code-generation-35132a85bd26

Dynamically working frameworks

https://arxiv.org/pdf/2307.12187.pdf

https://arxiv.org/pdf/1611.06945.pdf

[Refactor] APIs for Network Construction

Overview of Computation Node in cl-waffe2

[AbstractNode] The fundamental unit which binds forward/backward propagations.
 defnode - Declares a general definition of AbstractNode
      L define-impl        implements AbstractNode. Its forward definition is given as a macro (to inline/call-with-view), later (compile nil body) is called, and cached.
      L define-impl-op  Implements as a lambda function.

define-op = defnode + define-impl-op

[Composite] Bundles several AbstractNodes, defined by defmodel macro.
  defmodel - Defines a new Composite
     L defmodel-as Redefining the existing Composite as a function or AbstractNode to reduce compiling time, to use cl-waffe2 as a define-by-run library.

Accordingly, these macros will be deleted in the future release: define-static-node, define-composite-function.

[Enhancement] macros to add

deftrainer

deftrainer is a macro to describe:

  1. criterion/optimizer
  2. Both of nodes of training phase, predicting phase

define-impl-1d-kernel

A template macro for users to implement a new backend without being familiar with complicated APIs (e.g.: call-with-view)

[Enhancement] Compiling time is remained to be optimized.

cl-waffe2 instantly generates/compiles forward kernel depending on given tensors' dimensions, and views. This approach allows me to reduce the computing time of multidimensional offsets, and schedule multithreading in advance. However, this compiling is never done at the top level, but the (compile nil ...) function. 80% of compiling time consists of this kernel compiling time (e.g.: expands of SinNode).

For example, (!sin (!sin (!sin x))) uses the completely same code at each time, albeit we need three times compiling. Therefore, one primary strategy to reduce compiling time is to reuse the compiled kernels.

Wrapping/Adapting another frameworks?

Since the graph processing library is all about cl-waffe2, I feel that building the entire backend from scratch is reinventing the wheel. I don't know which is the best, but the following list is the choice;

  • GGML Wrapper and Interop
  • oneDNN (Most Promising for CPU)
    • with the power of oneDNN, we can provide bfloat16 training and uint8 inference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.