Writing with AI

Purpose

NYCU 2022 spring AI Final Project
The main purpose is to use the GPT-2 model trained & generating text

Requirement

attr: 0.3.1
attrs: 21.4.0
matplotlib: 3.5.1
tensorflow: 2.9.1
tokenizers: 0.12.1
transformers: 4.19.4

Simple Use

Download the release.
Follow the step below to construct directories.
Put files to the right position
Train model
```
python train.py
```

Write with the model

 python write.py [--dir <model_path>] [--max_len <expected_len>]

Code Structure

Structure of Program

Structure of the program

files
- src/*
  - config.py: Store program configuration and will be instantiated in train.py or anywhere it need to be called.
  - model.py: Including model initialize, train, save, visualize and log output. A combination of core functions.
  - tokenization.py: Use this to trained an BPE tokenizer.
```
python tokenization.py
```
    This need to be run if there are no corresponding tokenizer.
- train.py
  - It will load tokenizer, build model, setup all project configuration and start training the module.
- write.py
  - It can be used to generate text with existed models, which will stored in trained_model directory.

Mkdir

Some directories may need to be constructed before the program runs
```
mkdir trained_data
mkdir tokenized_data
mkdir trained_model
```
make sure to put data willing to train under the trained_data directory
An example structure with the provided pretrained model and put the data will be like.

Modify the config

Some codes may need to be modified for local use

train.py

""" Metadata
...
"""
# ...

config = ProjectConfig(
	...,
	data_name="simplebooks-2"
)

data_name can be modified

Preprocessing

BPE Tokenizer

implement BPE tokenizer to pre-processing the text data
- Summary of the tokenizers (huggingface.co)
- Aim's to translate between human-readable text and numeric indices
- Indices will be mapped to word embeddings (numerical representations of words) -> This will be done by an embedding layer within the model.
Load the tokenizer
- This Tokenizer will be loaded in model initialization
```
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
```

Training

TFGPT2LMHeadModel

Use transformers to construct GPT Model

History

Stored history object will be used to visualize the trainning history.
Use matplotlib.pyplot to visualized data

Validation

Validify the train model with test model loss value
The train dataset is 10 times larger then the test dataset and there are no intersection of them
A way to test and modify the hyper-parameters

It can be found that the loss value have no fitting problem

Since the training dataset is more larger than the test dataset, on the same number of batches, training model will have lower loss value

Result

This visualization result will be stored in trained_model/figure/
Detail log output can be found in media/detail_output.md

Text Generation

Loss Value Per Epoch

Loss Value Per Batch

Performance

Comparison the performence with other project, which is also training the GPT2 model
The way to retrive the baseline parameter is described in the "baseline" branch

vm230705 / gpt2-ai-writing Goto Github PK

gpt2-ai-writing's Introduction

Writing with AI

Purpose

Requirement

Simple Use

Code Structure

Structure of Program

Mkdir

Modify the config

Preprocessing

BPE Tokenizer

Training

TFGPT2LMHeadModel

History

Validation

Result

Text Generation

Loss Value Per Epoch

Loss Value Per Batch

Performance

Different Normalizer

Reference

gpt2-ai-writing's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org