The ghost from ljferrer

tBERT: Transformer-XL + BERT

Training:
Running average of Transformer-XL's output tokens on preceding text + [CLS] token --> BERT

Generation:
BERT's input moves like a sliding window of bigrams, prepended with a [TXL] token from previous text

Notes:

May need to fine-tune both models before joining them.
How to align embedding spaces of these two models?
May need to train both from scratch.
Need to compare to Holographic BERT (Uses [CLS] token output from previous sequence)

4 Bars per Example

In https://github.com/Ljferrer/Ghost/blob/dev/batch/lilBERT/pregenerate_training_data.py

Rewrite create_instances_from_document():

Each verse is considered a document
Small verses / lines are still very valuable
Put [SEP] tokens between each line
Tokens_a = 2-3 lines; tokens_b = 2-1 lines

Baseline BERT

First pass: Get word proposals from BERT base uncased

Extended Batch Preprocessing

The scope of #2 should be broken down into two parts:

Get something running
Fine tuning experiments

In order to run fine tuning experiments:
Preprocessing script: Takes CleanedRapCorpus and TrainingRegiment as input and serves formatted batches in tunable frequency distributions.

Use BERT tokenizer from pytorch pretrained BERT
Uniform Sequence Lengths per batch
Maximize Batch Information (i.e. minimize padding tokens by setting n_tokens with information to a constant value and dynamically expanding batch width given a target sequence length)
Term Frequency requirements
Minimum songs per artist
Fine-tuning regiment

Need Environment YML

Use PyTorch-gpu

Control N Tokens per Batch

Edit main() to have ordered epoch generation —> write to jsonl:

Batches are predefined to have n_target_tokens = bw * uniform seq len
Shuffle batch seq lens

Beam Search

Beam Search Modifications:

Constrained Beam Search https://arxiv.org/pdf/1804.06609.pdf
Increased Diversity of Word Proposals with Temperature Probabilities https://arxiv.org/pdf/1806.04510.pdf

Fine-Tuning Script

#8 Reduces scope: just get a running fine-tuning script with model evals

Perplexity from dev set
Save train logs (b5a77ca)
Save train checkpoints (8f79846)

Handling Multi-token Words

Some words are broken down into multiple tokens, i.e. 'disrespectful' = "di", "##sr", "##es", "##pe", "##ct", "##ful".

When a user will ask for a word, they may want a multi-token suggestion. How can we search across a variable number of tokens?

Beam Search? or something else?

Dataset Datasheet

Young BERT

bert-large-uncased

Multiple GPUs
Batch size: 32 - 64 - 128
Learning rate: 3e-5 - 5e-5 - 6e-5

Download Rap Lyric Dataset

iPynb to scrape artists

n Artists Downloaded:

340/1362
680/1362
1020/1362
1362/1362

Artists: Wikipedia Top Hip Hop Musicians
Source: Genius.com
Means: LyricsGenius pip package

Notes to be aware of:

Sometimes the artist names from Wikipedia's list are automatically changed by LyricsGenius
- 80% of the time, it trivially changes punctuation to find the correct artist index, as one would expect
- 20% of the time, it finds a similar(ish) name of a verifiably different artist (i.e. F(something)??? --> Fall Out Boy (manually deleted with rm lyrics_falloutboy_*.json))
Using excluded_terms = ['(Remix)', '(Live)', '(Translation)'] automatically skips songs with live anywhere in the song title string
- 30% of the time, a song is skipped erroneously because a word like 'Alive' is in it

Script to Aggregate & Clean Dataset

Rules:

Assert detected language == English
Skip songs with lines > 80 chars

Copy fields:

Clean lyrics fields:

sections: [ {type: [Header], lines: [ " ", ... ] } ]
- Remove ad-libs (wut wut)

Mask Rhymes more often

Mask the token before '[SEP]'

Optional:

This type of masking probability should be an additional CLI argument (default 50%) -- settled on a hard-coded 40/60 rhyme/rand split
Avoid masking the first rhyming word -- masked tokens are randomly sampled

ljferrer / ghost Goto Github PK

ghost's People

Contributors

Stargazers

Watchers

ghost's Issues

Recommend Projects

Recommend Topics

Recommend Org