certik / fastgpt Goto Github PK

View Code? Open in Web Editor NEW

178.0 7.0 15.0 201 KB

Fast GPT-2 inference written in Fortran

License: MIT License

CMake 9.32% Shell 2.79% Python 28.14% Fortran 57.76% C 1.99%

fortran gpt-2 high-performance

fastgpt's People

Contributors

Stargazers

Watchers

Forkers

wmeddie justizin scivision wangvei arulselvanmadhavan cmdiaz6 zhuchengyin sromero1992 ouweiquan matteo-grella thegodone dignissimus j-r-jones shaikh-ubaid gha3mi

fastgpt's Issues

Implement parallelization over heads

Currently the attention over heads runs in serial:

fastGPT/gpt2.f90

Line 101 in 01eb84b

! Perform attention over each head

We should try to parallelize it and see if we get any speedups.

Implement model.dat format versioning

Add a version into model.dat
Increment the version with every change in the format (in both the Python writer and Fortran reader)
In the reader, check the version, if it does not agree, refuse to load it.

This will allow to keep changing the format without breaking things. If an older version of fastGPT is used on a newer file, it will say "Incompatible model.dat version, use the create_model.py script to generate a compatible version."

add topic

I suggesting adding the topic gpt-2 in the About section

Upload model.dat online

And change the instructions / workflow to simply download it. That way we eliminate the need to use Python at all, and things become more robust. One would only use the Python create_model.py script if one wants to re-generate the model.dat.

We should upload all 4 versions, so probably something like:

model-fastGPT-124M.dat
model-fastGPT-1558M.dat

Not sure if we should put the version number into the filename itself as well or not.

We should only do this after #30 is fixed, to prevent downloading incompatible model.dat (from older or newer versions of fastGPT).

Show the words as they are computed

The decoding is done across tokens, but we can wait until there is a space or punctuation, and print it.

Add support for OpenBLAS on both Linux and macOS

Currently we only support the default Fortran matmul and macOS Accelerate framework. Add support for OpenBLAS as well, using the same technique (implement a Fortran module for it and allow to select it in cmake).

Implement other sampling methods

Currently only the "greedy" sampling is implemented (the token with the highest probability is selected).

Implement other sampling methods, some options are:

top-p
top-k
temperature (here is an example how it could be done: jaymody/picoGPT#19)
categorical sampling

Not very helpful

Add an MPI (or coarrays) version

Investigate what the best way to parallelize is across nodes using MPI or coarrays.

minloc, maxloc

d4c6190

0e01ebc

Benchmark against PyTorch with Accelerate enabled

So far we only benchmarked against PyTorch+OpenBLAS. We should also benchmark against PyTorch+Accelerate.

Here are a few ways how to do it:

https://towardsdatascience.com/installing-pytorch-on-apple-m1-chip-with-gpu-acceleration-3351dc44d67c
https://twitter.com/isuru_f/status/1636013111592329216 (conda install "libblas=*=*_accelerate")

Implement kv cache

Here is some information what kv cache is: https://kipp.ly/blog/transformer-inference-arithmetic/#kv-cache

Roughly speaking, when new tokens are added at the end of the input and new token is generated, a lot of the computation could be reused from the previous iteration. We need to cache the results and reused them.

Here is a reference implementation in picoGPT: jaymody/picoGPT#7 (and the accompanying blog post https://immortal3.github.io/becoming-the-unbeatable/posts/gpt-kvcache/) that should be straightforward to adapt.

Implement the input tokenizer in Fortran

Currently the input tokenizer is in Python, taken from the original OpenAI's implementation: https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/encode_input.py. We should implement it in Fortran. That will eliminate the need to call the Python script before running fastGPT.

We have to write tests that exercise each code path in the Python implementation to ensure our Fortran implementation is correct.

An example where the current fast_tanh() gives different results

Using the 1558M model and the following input:

python encode_input.py \
        "Alan Turing theorized that computers would one day become very powerful, but even he could not imagine" \
        -n 100

I get the following output with tanh() (equal to PyTorch):

Output tokens:
   703   484   561   307   973    13   198   198     1    40   836   470   892   314  1053  1683  1775   257  3644   326   714   466  1997   326   257  1692   852   714   466   553   339   531    13   198   198  1537   783    11  5176   284   262   670   286   257  1074   286  4837   379   262  2059   286  3442    11 14727    11  9061   389   852   973   284   466  1243   326   547  1752  1807  5340    13   198   198   464  1074   468  4166   257  3644   326   460   711   262   983   286  1514    11   257  3716  4811   983   326  9018  3867  5207  1088   257  3096    13   198   198   464  3644
Decoded output as text:
 how they would be used.

"I don't think I've ever seen a computer that could do anything that a human being could do," he said.

But now, thanks to the work of a team of researchers at the University of California, Berkeley, computers are being used to do things that were once thought impossible.

The team has developed a computer that can play the game of Go, a complex strategy game that involves moving pieces around a board.

The computer

But with fast_tanh() I get the following output:

Output tokens:
   703   484   561   307   973    13   198   198     1    40   836   470   892   314  1053  1683  1775   257  3644   326   714   466  1997   326   257  1692   852   714   466   553   339   531    13   198   198  1537   783    11  5176   284   262   670   286   257  1074   286  4837   379   262  2059   286  3442    11 14727    11  9061   389   852   973   284   466  1243   326   547  1752  1807  5340    13   198   198   464  1074   468  4166   257  3644   326   460   711   262   983   286  1514    11   257  3716  4811   983   326  9018  3867   257  3704  1088   257  3096   284  8006   517  7674
Decoded output as text:
 how they would be used.

"I don't think I've ever seen a computer that could do anything that a human being could do," he said.

But now, thanks to the work of a team of researchers at the University of California, Berkeley, computers are being used to do things that were once thought impossible.

The team has developed a computer that can play the game of Go, a complex strategy game that involves moving a piece around a board to capture more territory

The last 9 tokens are different.

So the exact numerical shape of the tanh function makes a difference. At the very least from reproducibility perspective we have to maintain both versions. I don't know how to judge the quality, if the quality is the same, just slightly different probabilities that in the "greedy" mode give different results, but statistically equivalent.

cmake --build build
[  6%] Building Fortran object CMakeFiles/fastgpt.dir/tokenizer.f90.o
[ 12%] Building Fortran object CMakeFiles/fastgpt.dir/gpt2.f90.o
[ 18%] Building Fortran object CMakeFiles/fastgpt.dir/omp_dummy.f90.o
[ 25%] Building Fortran object CMakeFiles/fastgpt.dir/driver.f90.o
semantic error: Namelists not implemented yet
  --> /home/christoph/computing/fastGPT/driver.f90:20:1
   |
20 | namelist / input_fastGPT / n_tokens_to_generate
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 


Note: Please report unclear or confusing messages as bugs at
https://github.com/lfortran/lfortran/issues.
gmake[2]: *** [CMakeFiles/fastgpt.dir/build.make:101: CMakeFiles/fastgpt.dir/driver.f90.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:93: CMakeFiles/fastgpt.dir/all] Error 2
gmake: *** [Makefile:101: all] Error 2

Segfault with ifort

With the ifort compiler I get a crash after a while.

Build information

$ FC=ifort cmake .. -DFASTGPT_BLAS=Fortran -DCMAKE_BUILD_TYPE=Debug
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The Fortran compiler identification is Intel 20.2.5.20211109
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2022.0.2/linux/bin/intel64/ifort
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2022.0.2/linux/bin/intel64/ifort  -- works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /opt/intel/oneapi/compiler/2022.0.2/linux/bin/intel64/ifort supports Fortran 90
-- Checking whether /opt/intel/oneapi/compiler/2022.0.2/linux/bin/intel64/ifort supports Fortran 90 -- yes


Configuration results
---------------------
Fortran compiler: /opt/intel/oneapi/compiler/2022.0.2/linux/bin/intel64/ifort
Build type: Debug
Fortran compiler flags: -warn all -check all,noarg_temp_created -traceback -O1 -g
Installation prefix: /usr/local
FASTGPT_BLAS: Fortran
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ivan/lrz/fastGPT/build

Output:

(fastgpt) ivan@maxwell:~/lrz/fastGPT$ gprofng collect app ./build/gpt2 
Creating experiment directory test.10.er (Process ID: 260289) ...
Loading the model...
    done.
Model parameters:
n_vocab = 50257
n_ctx   =  1024
n_embd  =   768
n_layer =    12
n_head  =    12
 
Input parameters:
n_seq                =   6
n_tokens_to_generate =  20
 
Input tokens:
   464   995   286  9439 14448   284
Decoded input as text:
The world of tomorrow belongs to
Running model...
           1         262
           2         661
           3         286
           4        9439
           5          13
           6         198
           7         198
           8         464
           9         995
          10         286
          11        9439
          12       14448
          13         284
          14         262
          15         661
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
gpt2               000000000042F1FA  Unknown               Unknown  Unknown
libpthread-2.31.s  00007F2B36905420  Unknown               Unknown  Unknown
gpt2               00000000004268D3  linalg_mp_matmul_          18  linalg_f.f90
gpt2               000000000041EB03  gpt2_mod_mp_gpt2_         165  gpt2.f90
gpt2               00000000004208D6  gpt2_mod_mp_gener         194  gpt2.f90
gpt2               000000000040A56F  MAIN__                     85  main.f90
gpt2               0000000000403862  Unknown               Unknown  Unknown
libc-2.31.so       00007F2B3671D083  __libc_start_main     Unknown  Unknown
gpt2               000000000040376E  Unknown               Unknown  Unknown

I'm guessing this is related to the stack vs heap array issue with the Intel compiler and matmul (https://fortran-lang.discourse.group/t/testing-the-performance-of-matmul-under-default-compiler-settings/4096/27).

Implement a chat bot interface

Add a mode to interact with fastGPT interactively as a chat bot.