-
NYCU 2022 spring AI Final Project
-
The main purpose is to use the GPT-2 model trained & generating text
attr: 0.3.1
attrs: 21.4.0
matplotlib: 3.5.1
tensorflow: 2.9.1
tokenizers: 0.12.1
transformers: 4.19.4
-
Download the release.
-
Follow the step below to construct directories.
-
Put files to the right position
-
Train model
python train.py
-
Write with the model
python write.py [--dir <model_path>] [--max_len <expected_len>]
- Structure of the program
-
files
-
src/*
-
config.py
: Store program configuration and will be instantiated intrain.py
or anywhere it need to be called. -
model.py
: Including model initialize, train, save, visualize and log output. A combination of core functions. -
tokenization.py
: Use this to trained an BPE tokenizer.python tokenization.py
This need to be run if there are no corresponding tokenizer.
-
-
train.py
- It will load tokenizer, build model, setup all project configuration and start training the module.
-
write.py
- It can be used to generate text with existed models, which will stored in trained_model directory.
-
-
Some directories may need to be constructed before the program runs
mkdir trained_data mkdir tokenized_data mkdir trained_model
make sure to put data willing to train under the trained_data directory
-
An example structure with the provided pretrained model and put the data will be like.
-
Some codes may need to be modified for local use
-
train.py
""" Metadata ... """ # ... config = ProjectConfig( ..., data_name="simplebooks-2" )
data_name can be modified
-
implement BPE tokenizer to pre-processing the text data
- Summary of the tokenizers (huggingface.co)
- Aim's to translate between human-readable text and numeric indices
- Indices will be mapped to word embeddings (numerical representations of words) -> This will be done by an embedding layer within the model.
-
Load the tokenizer
-
This Tokenizer will be loaded in model initialization
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
-
- Use
transformers
to construct GPT Model
- Stored history object will be used to visualize the trainning history.
- Use
matplotlib.pyplot
to visualized data
- Validify the train model with test model loss value
- The train dataset is 10 times larger then the test dataset and there are no intersection of them
- A way to test and modify the hyper-parameters
It can be found that the loss value have no fitting problem
Since the training dataset is more larger than the test dataset, on the same number of batches, training model will have lower loss value
- This visualization result will be stored in
trained_model/figure/
- Detail log output can be found in
media/detail_output.md
- Comparison the performence with other project, which is also training the GPT2 model
- The way to retrive the baseline parameter is described in the "baseline" branch
-
Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. (github.com)
-
Difference between Sparse Cross Entropy and Categorical Cross Entropy
-
python - How to disable printing reports after each epoch in Keras? - Stack Overflow
-
How to add some new special tokens to a pretrained tokenizer?