minimaxir / aitextgen Goto Github PK

A robust Python tool for text-based AI training and generation using GPT-2.

License: MIT License

Python 58.91% Jupyter Notebook 40.84% Dockerfile 0.24%

aitextgen's Introduction

aitextgen

A robust Python tool for text-based AI training and generation using OpenAI's GPT-2 and EleutherAI's GPT Neo/GPT-3 architecture.

aitextgen is a Python package that leverages PyTorch, Hugging Face Transformers and pytorch-lightning with specific optimizations for text generation using GPT-2, plus many added features. It is the successor to textgenrnn and gpt-2-simple, taking the best of both packages:

Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/350M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
Generates text faster than gpt-2-simple and with better memory efficiency!
With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included generate() function to allow a massive amount of control over the generated text.
With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also multiple GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to merge datasets without biasing the resulting dataset, or cross-train on multiple datasets to create blended output.

You can read more about aitextgen in the documentation!

Demo

You can play with aitextgen for free with powerful GPUs using these Colaboratory Notebooks!

You can also play with custom Reddit and Hacker News demo models on your own PC.

Installation

aitextgen can be installed from PyPI:

pip3 install aitextgen

Quick Examples

Here's how you can quickly test out aitextgen on your own computer, even if you don't have a GPU!

For generating text from a pretrained GPT-2 model:

from aitextgen import aitextgen

# Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
ai = aitextgen()

ai.generate()
ai.generate(n=3, max_length=100)
ai.generate(n=3, prompt="I believe in unicorns because", max_length=100)
ai.generate_to_file(n=10, prompt="I believe in unicorns because", max_length=100, temperature=1.2)

You can also generate from the command line:

aitextgen generate
aitextgen generate --prompt "I believe in unicorns because" --to_file False

Want to train your own mini GPT-2 model on your own computer? You can follow along in this Jupyter Notebook or, download this text file of Shakespeare's plays, cd to that directory in a Terminal, open up a python3 console and go:

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

# The name of the downloaded Shakespeare text for training
file_name = "input.txt"

# Train a custom BPE Tokenizer on the downloaded text
# This will save one file: `aitextgen.tokenizer.json`, which contains the
# information needed to rebuild the tokenizer.
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.
config = GPT2ConfigCPU()

# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

# With your trained model, you can reload the model at any time by
# providing the folder containing the pytorch_model.bin model weights + the config, and providing the tokenizer.
ai2 = aitextgen(model_folder="trained_model",
                tokenizer_file="aitextgen.tokenizer.json")

ai2.generate(10, prompt="ROMEO:")

Want to run aitextgen and finetune GPT-2? Use the Colab notebooks in the Demos section, or follow the documentation to get more information and learn some helpful tips!

Known Issues

TPUs cannot be used to train a model: although you can train an aitextgen model on TPUs by setting n_tpu_cores=8 in an appropriate runtime, and the training loss indeed does decrease, there are a number of miscellaneous blocking problems. [Tracking GitHub Issue]

Upcoming Features

The current release (v0.5.X) of aitextgen is considered to be a beta, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.

The next versions of aitextgen (and one of the reasons I made this package in the first place) will have native support for schema-based generation. (See this repo for a rough proof-of-concept.)

Additionally, I plan to develop an aitextgen SaaS to allow anyone to run aitextgen in the cloud and build APIs/Twitter+Slack+Discord bots with just a few clicks. (The primary constraint is compute cost; if any venture capitalists are interested in funding the development of such a service, let me know.)

I've listed more tentative features in the UPCOMING document.

Ethics

aitextgen is a tool primarily intended to help facilitate creative content. It is not a tool intended to deceive. Although parody accounts are an obvious use case for this package, make sure you are as upfront as possible with the methodology of the text you create. This includes:

State that the text was generated using aitextgen and/or a GPT-2 model architecture. (A link to this repo would be a bonus!)
If parodying a person, explicitly state that it is a parody, and reference who it is parodying.
If the generated text is human-curated, or if it's unsupervised random output.
Indicating who is maintaining/curating the AI-generated text.
Make a good-faith effort to remove overfit output from the generated text that matches the input text verbatim.

It's fun to anthropomorphise the nameless "AI" as an abstract genius, but part of the reason I made aitextgen (and all my previous text-generation projects) is to make the technology more accessible and accurately demonstrate both its promise, and its limitations. Any AI text generation projects that are deliberately deceptive may be disavowed.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

aitextgen's People

Contributors

Stargazers

Watchers

Forkers

tanitall casey-barr filipe-monteiro htpeter wlmgithub victor9000 huylb314 rogerfitz shafiahmed deepu vdt olel-may amirstudy cdpierse trisongz akshayverma99 hbcbh1999 jon-chun wkryst ivandir mbrukman phongtnit ielm micseb sbalk zeta1999 reducibly fishbiscuit lethaiq hefedev jt-aa biranchi2018 saif-al-zarrar vukovicnikola sysang maximumdata vibster ds-ml-ai-collab dracount marius-zenitech aryamaan777 abdulk084 dldinternet legitosaurus moritzheiden arevolutioner binkiebinkie tawawhite typical-byte-world imtoanle molamk skysunlimited adbmd muralits98 stjordanis marcustomlinson naveenhooda2000 helpful-bus fborrasumh martin12333 fen0s johnfewell sytelus przegali linkonbsmrstu dumpmemory leq6c sozdam rogervaas entn-at nomansaleem4 kaushikb11 dx9240 issipathana tommyct614 bala-yarabikki akshit2409 jbdatascience dotnaught alfalmi drhema ugoboby swcrazyfan cocktailpy donfanning cderinbogaz anoop-nutrimedy mohamedalirashad michaellavelle derula minhpqn sonam-pankaj95 deepfriedcyber pavankumarg1729 iamjaco leonardcyber gradient-ai dbanys stungkit josrzn

aitextgen's Issues

Sentencepiece Support

You can employed Google SentencePiece tokenizer suport Vietnamese language. Thankyou!

Torchscript Text Generation Support

Although we can export and load Torchscript models, the forward() function of a Torchscript model must match what is generated.

The traced TS model has a forward() w/ 2 parameters, but the raw forward() function of GPT2LMModel has 9 parameters.

Solution may be to use an extended Model class just for TorchScript, with an override forward() function w/ 2 parameters.

Enhancement: Bad Word IDs

TypeError on 1000th step (?) of ai.train()

I just received a TypeError: sequence item 0: expected str instance, NoneType found while training. It seems it crashed at step 1000. Can I debug this?

code

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen import aitextgen

file_name = "lyrics.txt"

train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"

ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, to_gpu=True)
data = TokenDataset(file_name, vocab_file=vocab_file, merges_file=merges_file, block_size=64)
ai.train(data, batch_size=16, num_steps=5000)

GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]

1,000 steps reached: saving model to /trained_model                                     
1,000 steps reached: generating sample texts.                                           
Loss: 4.361 — Avg: 4.437 — GPU Mem: 5486 MB:  20%|██        | 1000/5000 [04:20<17:20,  3.84it/s]

-----------------------------------------------
TypeError     Traceback (most recent call last)
<ipython-input-6-360102ef47bb> in <module>
----> 1 ai.train(data, batch_size=16, num_steps=5000)

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/aitextgen/aitextgen.py in train(self, train_data, output_dir, fp16, fp16_opt_level, n_gpu, n_tpu_cores, max_grad_norm, gradient_accumulation_steps, seed, learning_rate, weight_decay, adam_epsilon, warmup_steps, num_steps, save_every, generate_every, n_generate, loggers, batch_size, num_workers, benchmark, avg_loss_smoothing, save_gdrive, run_id, **kwargs)
    561 
    562         trainer = pl.Trainer(**train_params)
--> 563         trainer.fit(train_model)
    564 
    565         logger.info(f"Saving trained model pytorch_model.bin to /{output_dir}")

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    857 
    858         elif self.single_gpu:
--> 859             self.single_gpu_train(model)
    860 
    861         elif self.use_tpu:  # pragma: no-cover

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
    501             self.optimizers = optimizers
    502 
--> 503         self.run_pretrain_routine(model)
    504 
    505     def tpu_train(self, tpu_core_idx, model):

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1013 
   1014         # CORE TRAINING LOOP
-> 1015         self.train()
   1016 
   1017     def test(

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    345                 # RUN TNG EPOCH
    346                 # -----------------
--> 347                 self.run_training_epoch()
    348 
    349                 # update LR schedulers

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    417             # RUN TRAIN STEP
    418             # ---------------
--> 419             _outputs = self.run_training_batch(batch, batch_idx)
    420             batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
    421 

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx)
    636         with self.profiler.profile('on_batch_end'):
    637             # callbacks
--> 638             self.on_batch_end()
    639             # model hooks
    640             if self.is_function_implemented('on_batch_end'):

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py in on_batch_end(self)
     61         """Called when the training batch ends."""
     62         for callback in self.callbacks:
---> 63             callback.on_batch_end(self, self.get_model())
     64 
     65     def on_validation_batch_start(self):

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/aitextgen/train.py in on_batch_end(self, trainer, pl_module)
    182                 and self.steps % self.generate_every == 0
    183             ):
--> 184                 self.generate_sample_text(trainer, pl_module)
    185 
    186     def generate_sample_text(self, trainer, pl_module):

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/aitextgen/train.py in generate_sample_text(self, trainer, pl_module)
    197             temperature=0.7,
    198         )
--> 199         gen_texts = [
    200             pl_module.tokenizer.decode(output, skip_special_tokens=True)
    201             for output in outputs

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/aitextgen/train.py in <listcomp>(.0)
    198         )
    199         gen_texts = [
--> 200             pl_module.tokenizer.decode(output, skip_special_tokens=True)
    201             for output in outputs
    202         ]

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/transformers/tokenization_utils.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces)
   2172                 current_sub_text.append(token)
   2173         if current_sub_text:
-> 2174             sub_texts.append(self.convert_tokens_to_string(current_sub_text))
   2175         text = " ".join(sub_texts)
   2176 

/run/media/pablo/084A2BF94A2BE264/data/venv/lib/python3.8/site-packages/transformers/tokenization_gpt2.py in convert_tokens_to_string(self, tokens)
    233     def convert_tokens_to_string(self, tokens):
    234         """ Converts a sequence of tokens (string) in a single string. """
--> 235         text = "".join(tokens)
    236         text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
    237         return text

TypeError: sequence item 0: expected str instance, NoneType found

Generation with Multi GPUs?

Anything I need to do to get it to use multiple GPUs for text generation? I remember you added it to gpt-2-simple, and I've seen that AITEXTGEN has it for training, but I was curious about generation. I finally caved and signed up for Google Cloud, and I was curious whether I would see much of a speed increase getting a Jupyter notebook with 2 GPUs. Maybe just a T4 and fp16 is faster than T4x2? Thanks in advance!

Train on large textfile

Hi, I'm trying to train a model from scratch as I want it to generate text in another language (Swedish).
My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s>
The txt-file is about 300MB in size.
However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes.
My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.

Is there any way to make aitextgen work with 300MB trainingdata?
Are there any parameters I can tweak to have it use less memory?
Should I arrange the trainingdata in another way?

Finetuning 355M or larger GPT-2 models / Gradient Checkpointing

Gradient checkpointing must be implemented to avoid going OOM when finetuning those models.

That is apparently done at the training level and PyTorch has tricks to do it easily, but I am having difficulty getting it to work correctly.

TypeError: optimizer_step() got an unexpected keyword argument 'using_native_amp'

Hi,
I am trying to finetune a gpt2 model ("124") on a local hardware. I keep getting this error:

ValueError: Keyword arguments {'return_attention_masks': False} not recognized.
0%| | 0/683971 [00:00<?, ?it/s]

I read that it could be due to broken transformers for the 3.0 version, I upgraded to 3.0.1 still getting the error. However, downgrading to 2.11.0 (or 2.9.1) did solve it.

Now the tokenization finished normally, but when the training (fine-tuning) started, I got this error:
TypeError: optimizer_step() got an unexpected keyword argument 'using_native_amp'
0%| | 0/5000 [00:46<?, ?it/s]

Any suggestion?

I am also wondering if we can upload a tokenized (.npz) text directly that we have previously prepared while using the original gpt2_simple repository, I mean before aitextgen being created.?

Thanks

Best practices for dataset preparation to avoid MemoryError: Unable to allocate

I'm trying to train Reddit model and the problem that I have met is related to custom model training only.

I have exported ~500 Mb of Reddit comments to the *.txt file, from a custom list of non-English subs, and separated every publication with:
<|endoftext|>

Every section could contain thousands of comments.

The problem is that when I'm running tokenization I got an error after "Compute merges" stage:
MemoryError: Unable to allocate 57.3 GiB for an array with shape (3126653, 9842) and data type uint16.

I have tried to use *.csv instead with line_by_line=True, where every comment on a new line. After 50 000 steps on this csv dataset, I got a model that was "contextual" much worse than it was of 15 000 steps with txt file without line_by_line=True.

So, the question is: how is better to work with txt files and how much symbols should be inside one <|endoftext|> section.

My config:
config = build_gpt2_config(vocab_size=10000, max_length=512, dropout=0.0, n_embd=512, n_layer=8, n_head=8)
data = TokenDataset(file_name, save_cache=True, vocab_file=vocab_file, merges_file=merges_file, block_size=512)

If anyone has any ideas how to fix this, I would appreciate ✨

user's definied sos_token and eos_token

I tried to initialize a model with the following lines and it produces errors on the cuda:
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:216

Code:
ai = aitextgen(tf_gpt2="124M", bos_token="<|startoftext|>", eos_token="<|endoftext|>", to_gpu=True)

I guess <|startoftext|> wasn't part of the vocabulary?

Thanks,

Determine if a model is quantized to prevent training

Might be best to search layers for the torch.qint8 datatype.

size of the generated model

My input data file is 20gb and I have broken it into 100 mb files and train it with this code

for file in os.listdir(directory):
      filename = os.fsdecode(file)
      train_tokenizer(filename)
      vocab_file = "aitextgen-vocab.json"
      merges_file = "aitextgen-merges.txt"
      ai = aitextgen(model="trained_model/pytorch_model.bin",vocab_file=vocab_file, merges_file=merges_file, config=config)
      ai.train(filename,   line_by_line=False,     from_cache=False,
           num_steps=40000,         generate_every=20000,         save_every=20000,
         save_gdrive=True,         learning_rate=1e-4,         batch_size=16         )
      copy_file_to_gdrive("trained_model/pytorch_model.bin")
      copy_file_to_gdrive("trained_model/config.json")
      os.remove(filename)

For each iteration I save the trained model and then load it again for the next iteration of training.

However I notice that the generated model is stuck at 6 mb after processing a gig of data. Am I doing something terribly stupid or is there an issue in the generated model?

Saving genterate_to_file is there a way to write the name of the file

I have code that I want to display from the file it saves, how can I rename the generated files to something more simple from the code itself?

Stepwise Generation

A stepwise implementation of generation (instead of using generate() wholesale) is necessary for:

Generating text beyond max_length of the model (infinite generation via sliding windows
Returning partial generated text to generate in "real time".

Should be similar to the Huggingface implementation using the past parameter.

torch>=1.3

Hi! Sorry for so noob question, but when i try to pip3 install aitextgen i have this error:
ERROR: Could not find a version that satisfies the requirement torch>=1.3 (from pytorch-lightning>=0.7.6->aitextgen) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.3 (from pytorch-lightning>=0.7.6->aitextgen)

I try to google it, but nothing similar was found.

Bypassing 1024 max length in sample size

Is there a way to do it? If so, what is it?

In your notebook example to create gpt-2 from scratch there's a line:
config = build_gpt2_config(vocab_size=5000, max_length=32, dropout=0.0, n_embd=256, n_layer=8, n_head=8)

Can I change max_length to 4096 for example?

Size mismatch error on converted TF weights during generation

I'm trying to use a converted gpt-2-simple file, I used the command in the docs to convert the TF weights to PyTorch, but I'm getting a whole lot of errors when I try to generate from it. The code I used and the error message are as follows:

from aitextgen import aitextgen

ai = aitextgen(model="pytorch/pytorch_model.bin")
ai.generate()

2020-06-24 21:48:04.187031: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-06-24 21:48:04.189511: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:aitextgen:Loading GPT-2 model from provided pytorch/pytorch_model.bin.
Traceback (most recent call last):
File "testgen.py", line 4, in
ai = aitextgen(model="pytorch/pytorch_model.bin")
File "D:\Anaconda3\envs\gpt2-pytorch\lib\site-packages\aitextgen\aitextgen.py", line 176, in init
self.model = GPT2LMHeadModel.from_pretrained(model, config=config)
File "D:\Anaconda3\envs\gpt2-pytorch\lib\site-packages\transformers\modeling_utils.py", line 751, in from_pretrained
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:
size mismatch for wte.weight: copying a param with shape torch.Size([50257, 1024]) from checkpoint, the shape in current model is torch.Size([50257, 768]).
size mismatch for wpe.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 768]).
size mismatch for h.0.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.0.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.0.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.0.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.0.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.0.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.0.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.0.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.0.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.0.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.0.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.0.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.1.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.1.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.1.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.1.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.1.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.1.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.1.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.2.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.2.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.2.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.2.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.2.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.2.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.2.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.3.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.3.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.3.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.3.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.3.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.3.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.3.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.4.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.4.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.4.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.4.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.4.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.4.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.4.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.5.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.5.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.5.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.5.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.5.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.5.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.5.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.6.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.6.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.6.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.6.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.6.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.6.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.6.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.7.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.7.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.7.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.7.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.7.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.7.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.7.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.8.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.8.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.8.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.8.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.8.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.8.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.8.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.9.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.9.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.9.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.9.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.9.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.9.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.9.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.10.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.10.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.10.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.10.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.10.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.10.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.10.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.ln_1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.ln_1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.attn.c_attn.weight: copying a param with shape torch.Size([1024, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2304]).
size mismatch for h.11.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for h.11.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for h.11.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.ln_2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.ln_2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for h.11.mlp.c_fc.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for h.11.mlp.c_fc.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for h.11.mlp.c_proj.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for h.11.mlp.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for ln_f.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for ln_f.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

I'm running python 3.8.3, PyTorch 1.5.1, and TF 2.2.0 on W10 Anaconda,
Also tested on python 3.7.6, PyTorch 1.3.1, and TF 1.14.0.

I can also generate properly with the given 124M model, just not with my own 355M model.

Default config for foreign language texts?

I'd like one, especially considering that it'd be not only useful for foreign languages, but also for stuff like generating Reddit posts. I'm playing around with the model rn, but cannot really find good proportions for training on 60mb Russian dataset. Would be cool to just have another model option for that stuff instead of picking the parameters yourself, considering how long it takes to start a training (about 10 mins for me on colab to just start it?)

Training not initiating. FileNotFound error.

Win10, GTX 980 TI

Seems to a fail on ai.train(data, batch_size=16, num_steps=5000)

GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]

  0%|                                                                                                                                                                                                                                                                                              | 0/5000 [00:00<?, ?it/s]2020-05-19 18:35:19.768502: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:19.773454: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:22.895475: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:22.899970: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:26.042555: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:26.047759: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:29.252180: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:29.258204: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:32.641811: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:32.645910: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:35.870656: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:35.877014: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:39.012882: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:39.017863: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:42.022978: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:42.028400: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:45.055458: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:45.059695: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:48.180616: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:48.185013: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:51.316556: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:51.321046: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:54.589330: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:54.594693: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:35:58.077221: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:35:58.084748: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:36:01.480059: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:36:01.485463: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:36:04.616764: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:36:04.621302: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-05-19 18:36:07.672641: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
2020-05-19 18:36:07.676995: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python37\lib\site-packages\aitextgen\aitextgen.py", line 563, in train
    trainer.fit(train_model)
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 347, in train
    self.run_training_epoch()
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 638, in run_training_batch
    self.on_batch_end()
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 63, in on_batch_end
    callback.on_batch_end(self, self.get_model())
  File "C:\Program Files\Python37\lib\site-packages\aitextgen\train.py", line 168, in on_batch_end
    desc += f" — GPU Mem: {get_gpu_memory_map()['gpu_0']} MB"
  File "C:\Program Files\Python37\lib\site-packages\pytorch_lightning\core\memory.py", line 279, in get_gpu_memory_map
    check=True)
  File "C:\Program Files\Python37\lib\subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Program Files\Python37\lib\subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "C:\Program Files\Python37\lib\subprocess.py", line 1207, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Support for Parallel CPU execution (distributed parallelization)

Hi,
Just wondering if aitextgen finetuning supports multi-threading/multi-CPU (parallel/distributed) training. I understand the execution is much faster on GPUs, but in case that GPU is not available, can we get the same performance using many CPUs ?
I tried running the code twice with doubling the number of CPUs, but no increase in performance was found.

Thanks

Support for DistilGPT2 model

I know that most people are looking for the bigger models, is there a possibility to train distilGPT2 to get lower latency when it comes to inference.
Lower latency would come in handy if you want to build a real-time application.

merge_datasets() doesn't work post-numpy migration

From #6:

/usr/local/lib/python3.6/dist-packages/aitextgen/TokenDataset.py in init(self, file_path, vocab_file, merges_file, texts, line_by_line, from_cache, header, save_cache, cache_destination, compress, block_size, tokenized_texts, text_delim, bos_token, eos_token, unk_token, pad_token, progress_bar_refresh_rate, **kwargs)
75 if tokenized_texts:
76 self.tokens = tokenized_texts
---> 77 self.num_subsets = self.tokens.shape[0] - block_size
78 self.block_size = block_size
79 self.file_path = "merged TokenDataset"

AttributeError: 'list' object has no attribute 'shape'

Not able to run Finetune124M GPT-2 model on colab

When run ai.train() I get below error.

06/30/2020 06:51:36 — INFO — aitextgen.TokenDataset — Encoding 269 sets of tokens from article_mod.txt.

ValueError Traceback (most recent call last)
in ()
7 save_gdrive=False,
8 learning_rate=1e-4,
----> 9 batch_size=1,)

5 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
311
312 if kwargs:
--> 313 raise ValueError(f"Keyword arguments {kwargs} not recognized.")
314
315 # Set the truncation and padding strategy and restore the initial configuration

ValueError: Keyword arguments {'return_attention_masks': False} not recognized.

Error when running 'Train a Custom GPT-2 Model + Tokenizer' on Colab

I get the following error when running train_tokenizer(file_name)

TypeError Traceback (most recent call last)
in ()
----> 1 train_tokenizer(file_name)

1 frames

/usr/local/lib/python3.6/dist-packages/aitextgen/tokenizers.py in train_tokenizer(files, dropout, vocab_size, min_frequency, save_path, added_tokens, bos_token, eos_token, unk_token)
58 + "You will need both files to build the GPT2Tokenizer."
59 )
---> 60 tokenizer.save(save_path, PREFIX)

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
330 A path to the destination Tokenizer file
331 """
--> 332 return self._tokenizer.save(path, pretty)
333
334 def to_str(self, pretty: bool = False):

TypeError:

Slow generation on GPU with 100% CPU used

Hi!

I've trained my own model (about 29M parameters) and trying to run generation with it on AWS EC2 instances.
I've tried to run a generator both on g4dn.large and p3.2xlarge instances (they have Tesla T4 and V100 respectievely). Here is the code:

ai = aitextgen(
    model=model_dir+'pytorch_model.bin', 
    vocab_file=model_dir+'aitextgen-vocab.json', 
    merges_file=model_dir+'aitextgen-merges.txt', 
    config=model_dir+'config.json',
    to_gpu=True
)

ai.generate(n=5, return_as_list=True, max_length=random.randint(80, 140), temperature=0.7, repetition_penalty=1.2)

Unfortunately, it takes about 3-5 seconds to generate it on Tesla T4 and on Tesla V100. And one of the CPU cores is going up to 100% at this time. GPU has about 1,5Gb GPU RAM taken and max. 25% of GPU processor usage on T4 and about 5% on V100.

So it looks like CPU is the bottleneck.

Is that OK, that CPU is so utilized during the generation process?
Is it possible to generate only using CPU?
Is it possible to run parallel execution on multiple cores?

Thank you!

[Colab exclusive] Colab freezes whole PC if getting back to notebook after disconnect

Basically, what happens, is that Colab shows you the whole training from the start of code execution after reconnecting to the runtime, and during that proccess it just hangs up your PC. I guess, because too many cmd refreshing operations? Can't even move the cursor in that moment, so I have to hard reset my PC.

Allow specifying optimizers and schedulers for training

This will be necessary for more memory-efficient training and potential optimizations (e.g. Cosine scheduling vs. linear scheduling)

Transformers 3.0.0 changes

Umbrella for changes need to support compatibility.

The base required version will need to be set to 3.0.0 as well.

ONNX Export

ONNX has good transformers export and is a point of focus for the ONNX team. ONNX exporting may be more useful/flexible than TorchScript (I have a specific secret use case in mind...)

Could not find a version that satisfies the requirement torch>=1.3

What should i do?

Add support for loading serialized custom tokenizers

New feature in tokenizers and transformers. This is the default state for new tokenizers, so support must be added.

Will likely include in #40

Generate to array of strings

It seems the only options are to generate to stdout or a file. I want the result as an array of strings so I do this as a workaround:

ai = aitextgen()
ai.generate_to_file(n=10)

with open('generated_file.txt', 'r') as file:
    data = file.read().replace('\n', '')

data_arr = data.split("====================")[0:-1]

Incremental training?

This is an awesome project. However, being able to train a model from scratch incrementally would be awesome, because Google Colab crashes after training from scratch on any dataset >10MB.

Bolded Prompt text limits available character sequences

A user is unable to generate text with a pipe (|) or another character used in a regular expression. Doing so will result in the following exception:
Example:

ai.generate(1, prompt_text="This is | a test |")
INFO:aitextgen:Loading GPT-2 model from provided trained_model/pytorch_model.bin.
INFO:aitextgen:Using a custom tokenizer.
Traceback (most recent call last):
  File "./chatbot-infer.py", line 21, in <module>
    ai.generate(1, prompt="0|agent1|what are you doing?")
  File ".../lib/python3.7/site-packages/aitextgen/aitextgen.py", line 301, in generate
    for text in gen_texts
  File ".../lib/python3.7/site-packages/aitextgen/aitextgen.py", line 301, in <listcomp>
    for text in gen_texts
  File "/usr/lib/python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 645, in _parse
    source.tell() - here + len(this))
re.error: nothing to repeat at position 3

This appears to be due to: https://github.com/minimaxir/aitextgen/blob/master/aitextgen/aitextgen.py#L299 where the prompt_text is being used in a regex.

Would it be better to do string substitution or just eliminate the bolding feature?

For context, I have a GPT2 model and custom dataset that uses pipes | in sequences to denote agents. I'd be happy to remove the bolding feature, especially given that it is a display/presentation feature not necessarily something the library should be doing (IMO).

TokenDataset loading from cache

Hi! I have very simple question, i made encoded Dataset and upload it to Collab Notebook "Training from scratch". How can train tokenizer on it? If i just run "Training the Tokenizer" step with this file, i've get error "AssertionError: files must be a string or a list."
But in Finetuning Collab it work with file and load it from cache.

Training Not Initiating (Windows 10)

OS: Windows 10
GPU: GTX 1060

Everything appears to run fine up until ".train" is hit, then everything comes to a halt.

`[00:00:00] Reading files █████████████████████████████████████████████████████████████████████████ 100
[00:00:01] Tokenize words █████████████████████████████████████████████████████████████████████████ 15057 / 15057
[00:00:00] Count pairs █████████████████████████████████████████████████████████████████████████ 15057 / 15057
[00:00:00] Compute merges █████████████████████████████████████████████████████████████████████████ 4743 / 4743

INFO:aitextgen.tokenizers:Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.
INFO:aitextgen:Constructing GPT-2 model from provided config.
INFO:aitextgen:Using a custom tokenizer.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
Traceback (most recent call last):
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 105, in spawn_main
File "shootme.py", line 16, in
exitcode = _main(fd)
ai.train(data, batch_size=16, num_steps=5000)
File "Z:\0__0\0_seo\aitextgen\aitextgen\aitextgen.py", line 563, in train
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 114, in _main
trainer.fit(train_model)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\trainer.py", line 859, in fit
prepare(preparation_data)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 225, in prepare
self.single_gpu_train(model)
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\distrib_parts.py", line 503, in single_gpu_train
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
self.run_pretrain_routine(model)
File "C:\Users_\Anaconda3\envs\aitext\lib\runpy.py", line 263, in run_path
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\trainer.py", line 1015, in run_pretrain_routine
pkg_name=pkg_name, script_name=fname)
File "C:\Users_\Anaconda3\envs\aitext\lib\runpy.py", line 96, in _run_module_code
self.train()
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 347, in train
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users_\Anaconda3\envs\aitext\lib\runpy.py", line 85, in _run_code
self.run_training_epoch()
exec(code, run_globals)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 406, in run_training_epoch
File "Z:\0__0\0_seo\aitextgen\shootme.py", line 16, in
ai.train(data, batch_size=16, num_steps=5000)
enumerate(_with_is_last(train_dataloader)), "get_train_batch"
File "Z:\0__0\0_seo\aitextgen\aitextgen\aitextgen.py", line 563, in train
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\profiler\profilers.py", line 64, in profile_iterable
trainer.fit(train_model)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\trainer.py", line 859, in fit
value = next(iterator)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 800, in _with_is_last
self.single_gpu_train(model)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\distrib_parts.py", line 503, in single_gpu_train
it = iter(iterable)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
self.run_pretrain_routine(model)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\trainer.py", line 1015, in run_pretrain_routine
return _MultiProcessingDataLoaderIter(self)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
self.train()
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 347, in train
w.start()
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\process.py", line 112, in start
self.run_training_epoch()
self._popen = self._Popen(self)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 406, in run_training_epoch
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\context.py", line 223, in _Popen
enumerate(_with_is_last(train_dataloader)), "get_train_batch"
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\profiler\profilers.py", line 64, in profile_iterable
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\context.py", line 322, in _Popen
value = next(iterator)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\pytorch_lightning-0.7.6-py3.7.egg\pytorch_lightning\trainer\training_loop.py", line 800, in _with_is_last
return Popen(process_obj)
it = iter(iterable)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
reduction.dump(process_obj, to_child)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\reduction.py", line 60, in dump
return _MultiProcessingDataLoaderIter(self)
ForkingPickler(file, protocol).dump(obj)
File "C:\Users_\Anaconda3\envs\aitext\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
BrokenPipeError: [Errno 32] Broken pipe
w.start()
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\popen_spawn_win32.py", line 46, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "C:\Users_\Anaconda3\envs\aitext\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

0%| | 0/5000 [00:06<?, ?it/s]
0%| | 0/5000 [00:00<?, ?it/s]`

Create a streamlit app

https://github.com/streamlit/streamlit

A good way to set up a miniapp quickly for aitextgen that can be run locally or in the cloud (a bespoke solution will be better but streamlit would get 80% of the way there)

TPU Support

Although you can train an aitextgen model on TPUs by setting n_tpu_cores=8 in an appropriate runtime, and the training loss indeed does decrease, there are a number of miscellaneous blocking problems:

The model stored in aitextgen does not update, even after training.
Saving the model via save_pretrained() causes hang, even with xm.rendezvous()
Memory leaks on the host system (especially with large batch size)
fp16 doesn't work at all, and there's no training loss decrease.

Will gladly take any suggestions/PRs to help resolve these!

Generating 'complete' text

Using a small (10) or large (100) max_length I tend to get chopped off sentences.
ex:
One in five Americans are overweight, according to a new

I don't know whether the cause of this is aitextgen or gpt-2 etc. but I thought it was worth asking if there's a way to get complete sentences. My workaround is to chop off anything after the last period/!/? (works better with longer text)

GPT-3 support (?)

https://github.com/openai/gpt-3

Still very unclear how OpenAI is treating this, but if they do release the small model, then I'll def add support.

Implementation depends on:

base Huggingface integration
new training loop (might be the same as the old training loop)
Need to reduce GPT2 hardcoding where available
Casting for FP16 to FP32 (or more native FP16 support)

init() got an unexpected keyword argument 'disable_validation'

when I'm running the colab finetune example I'm getting this error.
which trace back to:
trainer = pl.Trainer(**train_params)

Building TokenDataset consumes excessive amounts of RAM

Following the method to build and cache a line-by-line TokenDataset with a single text file as input. Crashed with OOM. Attempting to reduce by half file size.

Data Specs: 17.1 gb - > 15494632 lines
Tokenizer: 350k Vocab Size
Merges: 4.9mb
Vocab: 6.4mb
Max Length: 512

VM Specs: AWS g4dn-16xlarge [64 vCPU/256 GB RAM/9 GB SWAP]

Possibly write to file as optional checkpoint batching step or deleting vars from text_list as they're tokenized to prevent python from holding onto memory?

Not able to recreate demo in colab

When running the current colab notebook i'm not able to get past the ai.train step:

Any ideas why?

Loading a tf_gpt2 model from local folder

Hi,
It may be a trivial question, I am wondering how can we load a tf_gpt2 model from local folder.
I tried:
ai = aitextgen(tf_gpt2="124M")

But it ignores the 124M I created and looked for the model on Google servers. The problem is for security reason, the cluster cannot connect to the servers, I have to download the model locally and copy the folder to a cluster directory to load and fine-tune?

Any suggestion ?

Thanks,
Marawan

Is there a way to continue training on an existing model?

Like, after 2000 steps I've got a trained model, and I wanna continue training it. In gpt2_simple, it was automatical and only asked you for run name. Is it even implemented yet? Couldn't find anything about that in neither docs nor source code.

Super minimal version that can be distributed with a package. (< 3MB)

Should be possible by using a context window of 64 or 128, maybe turn down parameters.

This version can also be finetuned/generated on CPU.

Prompt fails to bold in console if a regex special character is used

yay, regex shenanigans!

Generation Filter Callback

Generation quality still has weirdness. Allow users to provide a function to filter generated texts even more (and for functions like generate_one, keep generating until there is a valid output)

Custom model displays nan loss with to_fp16 parameter

If you build a custom model (for training from scratch) with fp16 option set to True, then on training the loss would be nan, and on first text generation attempt it'll throw a CUDA error on you.
The error states cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

Different tokenization methods

Which methods does aitextgen support for tokenization (BPE, wordpiece etc..)?
If only one, how can we expand to use others?

Thanks!