Coder Social home page Coder Social logo

gpt2-ai-writing's Introduction

Writing with AI

Purpose

  • NYCU 2022 spring AI Final Project

  • The main purpose is to use the GPT-2 model trained & generating text

Requirement

attr: 0.3.1
attrs: 21.4.0
matplotlib: 3.5.1
tensorflow: 2.9.1
tokenizers: 0.12.1
transformers: 4.19.4

Simple Use

  • Download the release.

  • Follow the step below to construct directories.

  • Put files to the right position

  • Train model

    python train.py
  • Write with the model

     python write.py [--dir <model_path>] [--max_len <expected_len>]

Code Structure

Structure of Program

  • Structure of the program

image-20220613190118332

  • files

    • src/*

      • config.py: Store program configuration and will be instantiated in train.py or anywhere it need to be called.

      • model.py: Including model initialize, train, save, visualize and log output. A combination of core functions.

      • tokenization.py: Use this to trained an BPE tokenizer.

        python tokenization.py

        This need to be run if there are no corresponding tokenizer.

    • train.py

      • It will load tokenizer, build model, setup all project configuration and start training the module.
    • write.py

      • It can be used to generate text with existed models, which will stored in trained_model directory.

Mkdir

  • Some directories may need to be constructed before the program runs

    mkdir trained_data
    mkdir tokenized_data
    mkdir trained_model

    make sure to put data willing to train under the trained_data directory

  • An example structure with the provided pretrained model and put the data will be like.

    sample

Modify the config

  • Some codes may need to be modified for local use

  • train.py

    """ Metadata
    ...
    """
    # ...
    
    config = ProjectConfig(
    	...,
    	data_name="simplebooks-2"
    )

    data_name can be modified

Preprocessing

BPE Tokenizer

  • implement BPE tokenizer to pre-processing the text data

    • Summary of the tokenizers (huggingface.co)
    • Aim's to translate between human-readable text and numeric indices
    • Indices will be mapped to word embeddings (numerical representations of words) -> This will be done by an embedding layer within the model.
  • Load the tokenizer

    • This Tokenizer will be loaded in model initialization

      tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)

Training

TFGPT2LMHeadModel

  • Use transformers to construct GPT Model

History

  • Stored history object will be used to visualize the trainning history.
  • Use matplotlib.pyplot to visualized data

Validation

  • Validify the train model with test model loss value
  • The train dataset is 10 times larger then the test dataset and there are no intersection of them
  • A way to test and modify the hyper-parameters

It can be found that the loss value have no fitting problem

Since the training dataset is more larger than the test dataset, on the same number of batches, training model will have lower loss value

Result

  • This visualization result will be stored in trained_model/figure/
  • Detail log output can be found in media/detail_output.md

Text Generation

Loss Value Per Epoch

loss per epoch

Loss Value Per Batch

loss per batch

Performance

  • Comparison the performence with other project, which is also training the GPT2 model baseline_comparison
  • The way to retrive the baseline parameter is described in the "baseline" branch

Different Normalizer

simplebooks2-30-diff-normalizer

Reference

gpt2-ai-writing's People

Contributors

lyz508 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.