PyFormer

PyFormer is a transformer machine translation model that converts english sentences to python code.

Dataset

The dataset is a small set of english text and their respective python code, approx 4000 entries. You can download the dataset here

Project Setup

Training

Place the dataset file in the root folder.
You can set the configuration in the config file.
To train the model execute the below command.
$ python3 main.py

Inferencing

The training script saves the model and the source and target vocabularies.
If you wish to migrate to another machine, please place save them in the root directory of the project.
Add your english sentences to a file and add the filename to the config.
The below command can be used to inference the model.

$ python3 pyformer.py

Data Cleaning and Preparation

The data was cleaned manually by removing unneccessary indentation and comments.

The data needs to be tokenized properly before feeding it to the model. The english sentences were tokenized using spacy. To tokenize the python code, a lexical analyzer tool is written which tokenizes the code according to python.
If you wish to use the model for your language you can modify the tool here

Embeddings Training

The embeddings are trained using glove. If you wish not to train the embeddings you can set embeddings_training to False in config. Glove uses weighted mean square error to find the corelation between the tokens for the vocabulary. The trained embeddings weights are directly copied to the Decoder embeddings layer of the model.

Model Architecture

The model uses a Encoder Decoder architecture with Multihead Attention.

Click here for clear explanation and visualization of the architecture.

Results

The model gives pretty good results.

The above is a sample example. You can view the results of 35+ examples here

Metrics

The model uses cross entropy loss function.
The model achieves a minimum validation loss of 1.4 at 15th epoch. The validation perplexity is 4.1

The model achieves a minimum training loss of 0.2 at 50th epoch. The training perplexity is 1.2

The model was also tested for bleu_score and word error rate. Since the code length depends on logic, both of which are not suitable metrics for the problem.

vpsingh22 / pyformer Goto Github PK

pyformer's Introduction

PyFormer

Dataset

Project Setup

Training

Inferencing

Data Cleaning and Preparation

Embeddings Training

Model Architecture

Results

Metrics

pyformer's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent