Coder Social home page Coder Social logo

ghost's People

Contributors

ljferrer avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ghost's Issues

tBERT: Transformer-XL + BERT

Training:
Running average of Transformer-XL's output tokens on preceding text + [CLS] token --> BERT

Generation:
BERT's input moves like a sliding window of bigrams, prepended with a [TXL] token from previous text

Notes:

  • May need to fine-tune both models before joining them.
  • How to align embedding spaces of these two models?
  • May need to train both from scratch.
  • Need to compare to Holographic BERT (Uses [CLS] token output from previous sequence)

Baseline BERT

  • First pass: Get word proposals from BERT base uncased

Extended Batch Preprocessing

The scope of #2 should be broken down into two parts:

  1. Get something running
  2. Fine tuning experiments

In order to run fine tuning experiments:
Preprocessing script: Takes CleanedRapCorpus and TrainingRegiment as input and serves formatted batches in tunable frequency distributions.

  • Use BERT tokenizer from pytorch pretrained BERT
  • Uniform Sequence Lengths per batch
  • Maximize Batch Information (i.e. minimize padding tokens by setting n_tokens with information to a constant value and dynamically expanding batch width given a target sequence length)
  • Term Frequency requirements
  • Minimum songs per artist
  • Fine-tuning regiment

Control N Tokens per Batch

Edit main() to have ordered epoch generation โ€”> write to jsonl:

  • Batches are predefined to have n_target_tokens = bw * uniform seq len
  • Shuffle batch seq lens

Fine-Tuning Script

#8 Reduces scope: just get a running fine-tuning script with model evals

  • Perplexity from dev set
  • Save train logs (b5a77ca)
  • Save train checkpoints (8f79846)

Handling Multi-token Words

Some words are broken down into multiple tokens, i.e. 'disrespectful' = "di", "##sr", "##es", "##pe", "##ct", "##ful".

When a user will ask for a word, they may want a multi-token suggestion. How can we search across a variable number of tokens?

Beam Search? or something else?

Dataset Datasheet

  • Create README.md for dataset
  • Data Acquisition
  • Data Cleaning
  • Tokenization
  • Artist Statistics
    • n songs
    • dates released
    • vocabulary sizes
    • Term frequencies
    • Key term dispersion plots
    • etc.

Young BERT

bert-large-uncased

  • Multiple GPUs
  • Batch size: 32 - 64 - 128
  • Learning rate: 3e-5 - 5e-5 - 6e-5

Download Rap Lyric Dataset

  • iPynb to scrape artists

n Artists Downloaded:

  • 340/1362
  • 680/1362
  • 1020/1362
  • 1362/1362

Artists: Wikipedia Top Hip Hop Musicians
Source: Genius.com
Means: LyricsGenius pip package

Notes to be aware of:

  • Sometimes the artist names from Wikipedia's list are automatically changed by LyricsGenius
    • 80% of the time, it trivially changes punctuation to find the correct artist index, as one would expect
    • 20% of the time, it finds a similar(ish) name of a verifiably different artist (i.e. F(something)??? --> Fall Out Boy (manually deleted with rm lyrics_falloutboy_*.json))
  • Using excluded_terms = ['(Remix)', '(Live)', '(Translation)'] automatically skips songs with live anywhere in the song title string
    • 30% of the time, a song is skipped erroneously because a word like 'Alive' is in it

Script to Aggregate & Clean Dataset

Rules:

  • Assert detected language == English
  • Skip songs with lines > 80 chars

Copy fields:

  • artist: " "
  • title: " "
  • year: "YYYY-MM-DD"
  • image: "url"
  • raw_lyrics: "..."

Clean lyrics fields:

  • sections: [ {type: [Header], lines: [ " ", ... ] } ]
    • Remove ad-libs (wut wut)

Mask Rhymes more often

  • Mask the token before '[SEP]'

Optional:

  • This type of masking probability should be an additional CLI argument (default 50%) -- settled on a hard-coded 40/60 rhyme/rand split
  • Avoid masking the first rhyming word -- masked tokens are randomly sampled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.