Coder Social home page Coder Social logo

hhnnhh / zeh_text_generation Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 527 KB

Natural Language Generation Neural Net (GPT-2) for text generation in German, fine-tuned with parts of the book "Spieltrieb" by Juli Zeh

Python 71.56% Shell 1.05% CSS 12.38% JavaScript 8.99% HTML 6.02%
german huggingface nlg nlg-german nlp deeplearning neural-network

zeh_text_generation's Introduction

GPT-2 generated lyric pic

Juli Zeh Lyrics Generator

Natural Language Generation Neural Net

pretrained GPT-2 for text generation in German


Fine-tuned and presented by Hannah Bohle, November 2020

Warning: The following GPT-2 model was pretrained on an unknown German corpus by an anonymous user at huggingface. Therefore, we cannot rule out embedding biases and enabling unintended negative uses induced by the corpus. In addition, the German novel used for fine-tuning contains explicit language. When using the model, please be aware that all content used for pretraining and fine-tuning will affect the generated text.

Installation:

For installation see: huggingface

Most models from transformers can either be based on Tensorflow, Pytorch or both. The provided GPT-2 is (currently only) based on Pytorch (November 2020).

Content:

  1. Installation inside venv: 1. TF and/or 2. Pytorch and 3. transformers
  2. install dependencies
  3. load the dataset
  4. prepare the dataset and build a TextDataset
  5. load the pre-trained German-GPT-2 model and tokenizer 1.initialize Trainer with Training Arguments
  6. train and save the model
  7. test the model

Dependencies

Data

Model pre-trained on German text and fine-tuned on novel "Spieltrieb" by Juli Zeh. Data contains only parts of the novel with randomized chapters to prevent copyright violations.

Txt-file consists of 79886 words. Txt file can be found in 'data' folder.

Preprocessing

For text preprocessing it is important to use the same tokenizer as was used for pretraining the model. In this case we use the "german-gpt2" tokenizer as provided by an anonymous user at huggingface.

tokenizer = AutoTokenizer.from_pretrained("anonymous-german-nlp/german-gpt2")

Model

Caution: We use a GPT-2 pretrained on an unknown German corpus. Therefore, we cannot rule out embedding biases and enabling unintended negative uses.

For the English version, huggingface states that GPT2 "was trained on millions of webpages with a causal language modeling objective" However, it is unclear which sources were used to pretrain the German model and how large the corpus was.

model = AutoModelWithLMHead.from_pretrained("anonymous-german-nlp/german-gpt2")

Result

Model is trained and saved and able to generate lyrics. Unfortunately the pretrained model cannot be saved in github because of its size (~500MB).

Neural nets calculate probabilities. In its basic versions, such as using greedy search, the neural net simply selects the word with the highest probability as its next word. As a result, AI-generated texts often suffer from repetitions.

There are different possibilities to make AI-generated text sound more human like. For instance, by penalizing n-gram repetitions or by add some randomness to text generation, e.g. varying the temperature (for different temperature options, see also my LSTM RNN).

In GPT-2 some advanced sampling methods were introduced which were used for generating the lyrics which are presented in the beginning of the Readme, namely top_k and top_p.

outputs = german_model.generate(inputs, max_length = max_length, do_sample=True, top_p=0.95, top_k=50,num_return_sequences=2)

top_k => In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.

top_p => Having set p=0.95, Top-p sampling picks the minimum number of words to exceed together p=.95% of the probability mass

num_return sequences makes it possible to generate several samples from the provided prompt.

When prompted "Ada liebte ihre Katze" the GPT-2 generated the following lyrics. It might sound a bit biblical and old-fashioned but pretty human-like in my opinion.

"Ada liebte ihre Katze und den Esel, und nun wird sie von ihrer Mutter und vom Haus gewiesen, daß sie ihr Vater wird, und sie wird den Namen ihres Vaters Jela nennen. Und sein Vater war ein Richter in seiner Stadt. Und Jela sagte: Zu wem soll ich hinaufgehen, du schöner Mann? Er antwortete: Mein Herr, wir haben keinen Umgang miteinander. Mein Herr, so sollst du hinaufgehen und zu ihm hinabgehen und sagen: Ich bin Jela, dein Bruder; wir wollen hinfahren, daß wir hingehen. Und er sah Jela und seinen Vater an und sagte: Ich werde hingehen, Jela zu suchen; und ich werde zu ihr hinabgehen und sagen: Ich"

Outlook

According to huggingface, "Text generation is currently possible with GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer in PyTorch and for most models in Tensorflow as well. [...] GPT-2 is usually a good choice for open-ended text generation because it was trained on millions of webpages with a causal language modeling objective." It would be really interesting to compare the quality of generated text by German pretrained text generation models.

Resources

zeh_text_generation's People

Contributors

hhnnhh avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.