Coder Social home page Coder Social logo

oliverguhr / german-sentiment Goto Github PK

View Code? Open in Web Editor NEW
61.0 3.0 11.0 87.85 MB

A data set and model for german sentiment classification.

License: MIT License

Python 97.04% Shell 2.96%
sentiment-analysis sentiment-classification german-language transformer bert-model fasttext machine-learning deep-learning

german-sentiment's Introduction

Broad-Coverage German Sentiment Classification Model for Dialog Systems

This repository contains the code and data for the Paper "Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems" published at LREC 2020.

Usage

If you like to use the models for your own projects please head over to this repository. It contains a Python package that provides a easy to use interface.

Data Sets

We trained our models on a combination of self created and exisiting data sets, to cover a broad variety of topics and domains.

Data Set Positive Samples Neutral Samples Negative Samples Total Samples
Emotions 188 28 1,090 1,306
filmstarts 40,049 0 15,610 55,659
GermEval-2017 1,371 16,309 5,845 23,525
holidaycheck 3,135,449 0 388,744 3,524,193
Leipzig Wikipedia Corpus 2016 0 1,000,000 0 1,000,000
PotTS 3,448 2,487 1,569 7,504
SB10k 1,716 4,628 1,130 7,474
SCARE 538,103 0 197,279 735,382
Sum 3,720,324 1,023,452 611,267 5,355,043

The data sets without the SCARE Dataset can be downloaded from here. Due to legal requirements, we can not provide the SCARE data set directly, but you can obtain the data from the author directly. However, if you are interested in this data, please obtain the Scare data set from the autors and integrate it usign our provided scripts to create the combined data set.

The unprocessed data set can be downloaded from here (1.5 GB), it contains all hotel and movie reviews, plus a set of neutral german texts.

The Filmstarts data set consists of 71,229 user written movie reviews in the German language. We have collected this data from the German website filmstarts.de using a web crawler. The users can label their reviews in the range of 0.5 to 5 stars. With 40,049 documents the majority of the reviews in this data set are positive and only 15,610 reviews are negative. All data was downloaded between the 15th and 16th of October 2018, containing reviews up to this date.

The holidaycheck data set contains hotel reviews from the German website holidaycheck.de. The users of this website can write a general review and rate their hotel. Additionally, they can review and rate six specific aspects: location & surroundings, rooms, service, cuisine, sports & entertainment and hotel. A full review contains therefore seven texts and the associated star rating in the range from zero to six stars. In total, we have downloaded 4,832,001 text-rating pairs for hotels from ten destinations: Egypt, Bulgaria, China, Greece, India, Majorca, Mexico, Tenerife, Thailand and Tunisia. The reviews were obtained from November to December 2018 and contain reviews up to this date. After removing all reviews with no stars or four stars, the data set contains 3,524,193 text-rating pairs.

The Emotions data set contains a list of utterances that we have recorded during the "Wizard of Oz" experiments with the service robots. We have noticed, that people used insults while talking to the robot. Since most of these words are filtered in social media and review platforms, other data sets do not contain such words. We used synonym replacement as a data augmentation technique to generate new utterances based on our recordings. Besides negative feedback, this data set contains also positive feedback and phrases about sexual identity and orientation that where labelled as neutral. Overall this data set contains 1,306 examples.

Trained Models

You can download our trained models for FastText and Bert here (6 GB). With this models we achived following results:

Bert

Data Set Balanced Unbalanced
SCARE 0.9409 0.9436
GermEval-2017 0.7727 0.7885
holidaycheck 0.9552 0.9775
SB10k 0.6930 0.6720
filmstarts 0.9062 0.9219
PotTS 0.6423 0.6502
emotions 0.9652 0.9621
Leipzig Wikipedia Corpus 2016 0.9983 0.9981
combined 0.9636 0.9744

Micro averaged F1 scores for BERT trained on the balanced and unbalanced data set.

Fast Text

Data Set Balanced Unbalanced
SCARE 0.9071 0.9083
GermEval-2017 0.6970 0.6980
holidaycheck 0.9296 0.9639
SB10k 0.6862 0.6213
filmstarts 0.8206 0.8432
PotTS 0.5268 0.5416
emotions 0.9913 0.9773
Leipzig Wikipedia Corpus 2016 0.9883 0.9886
combined 0.9405 0.9573

Micro averaged F1 scores for FastText trained on the balanced and unbalanced.

Setup

We recommend to install this project in a python virtual environment. To install and activate this virtual environment you need to execute this three commands.

pip3 install virtualenv
python3 -m venv ./venv
source venv/bin/activate

Make sure that you are using a recent python version by running "python -V ". You should at least run Python 3.6.

python -V
> Python 3.6.8

Next, install the needed python packages.

pip install -r requirements.txt

In order to reproduce the results, you need to download our models and data. We provide a script that downloads all required packages:

sh download-models-and-data.sh

Paper & Citetation

You can read the paper here. Please cite us if you found this useful.

@InProceedings{guhr-EtAl:2020:LREC,
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  title     = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month          = {May},
  year           = {2020},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1620--1625},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.202/}
}

If you use the combined data set for your work, you can use this list to cite all the contained data sets:

@LanguageResource{sanger_scare_2016,
	address = {Portorož, Slovenia},
	title = {{SCARE} ― {The} {Sentiment} {Corpus} of {App} {Reviews} with {Fine}-grained {Annotations} in {German}},
	url = {https://www.aclweb.org/anthology/L16-1178},	
	urldate = {2019-11-07},
	booktitle = {Proceedings of the {Tenth} {International} {Conference} on {Language} {Resources} and {Evaluation} ({LREC}'16)},
	publisher = {European Language Resources Association (ELRA)},
	author = {Sänger, Mario and Leser, Ulf and Kemmerer, Steffen and Adolphs, Peter and Klinger, Roman},
	year = {2016},
	pages = {1114--1121}
}

@LanguageResource{sidarenka_potts:_2016,
	address = {Paris, France},
	title = {{PotTS}: {The} {Potsdam} {Twitter} {Sentiment} {Corpus}},
	isbn = {978-2-9517408-9-1},
	language = {english},
	booktitle = {Proceedings of the {Tenth} {International} {Conference} on {Language} {Resources} and {Evaluation} ({LREC} 2016)},
	publisher = {European Language Resources Association (ELRA)},
	author = {Sidarenka, Uladzimir},
	editor = {Chair), Nicoletta Calzolari (Conference and Choukri, Khalid and Declerck, Thierry and Goggi, Sara and Grobelnik, Marko and Maegaard, Bente and Mariani, Joseph and Mazo, Helene and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios},
	year = {2016},
	note = {event-place: Portorož, Slovenia}
}

@LanguageResource{cieliebak_twitter_2017,
	address = {Valencia, Spain},
	title = {A {Twitter} {Corpus} and {Benchmark} {Resources} for {German} {Sentiment} {Analysis}},
	url = {https://www.aclweb.org/anthology/W17-1106},
	doi = {10.18653/v1/W17-1106},
	urldate = {2019-11-07},
	booktitle = {Proceedings of the {Fifth} {International} {Workshop} on {Natural} {Language} {Processing} for {Social} {Media}},
	publisher = {Association for Computational Linguistics},
	author = {Cieliebak, Mark and Deriu, Jan Milan and Egger, Dominic and Uzdilli, Fatih},
	month = apr,
	year = {2017},
	pages = {45--51}
}

@LanguageResource{wojatzki_germeval_2017,
	address = {Berlin, Germany},
	title = {{GermEval} 2017: {Shared} {Task} on {Aspect}-based {Sentiment} in {Social} {Media} {Customer} {Feedback}},
	booktitle = {Proceedings of the {GermEval} 2017 – {Shared} {Task} on {Aspect}-based {Sentiment} in {Social} {Media} {Customer} {Feedback}},
	author = {Wojatzki, Michael and Ruppert, Eugen and Holschneider, Sarah and Zesch, Torsten and Biemann, Chris},
	year = {2017},
	pages = {1--12}	
}

@inproceedings{goldhahn-etal-2012-building,
    title = "Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages",
    author = "Goldhahn, Dirk  and
      Eckart, Thomas  and
      Quasthoff, Uwe",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf",
    pages = "759--765"
}

german-sentiment's People

Contributors

dependabot[bot] avatar oliverguhr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

german-sentiment's Issues

Germansentient not working in conda / jupyter (?)

Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model

Note that I also tested it in a standard python env and it worked just fine. Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup? I'm on MacOSX.

Different results in downloaded model compared to Hugging face API

Hello Oliver Guhr,

First of all, thank you for your great work.

After finding your model on Hugging Face,I was testing the model. I also used the germena-sentiment-lib.
I found some anomalies when I was working locally, so I tested the same sentences using the Hugging Face API.
Hugging Face API sentiment bert

The hugging face API is giving far better results compared to the downloaded model.
I spotted the anomalies while testing the following sentences.

Wie sicher ist das?
Im Ausland ist das anders.
Jeder kann zum Ziel werden.
Das soll mit dem Update der Apps ab Januar funktionieren.
Wie funktioniert das technisch?
Wie sicher ist das?
Irgend etwas passiert hier.
Worum geht es dabei?

My questions are -
Are the hosted models and the models that we download same in every aspect and configurations?
If they are different can you let me know what is different in the hosted API and how can I improve the downloaded model's performance or how can I reproduce the same results as the API.

Thank you for your time.

Export to ONNX

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

  1. is this correct
  2. why do some keys in the vocab.json start with ##
  3. why are keys are named ["unused{x}"]
  4. why does the prediction not scale 0 to 1, but are signed floats
  5. why do some strings not work in my version not work. The string "Ein scheiß Film" works on hugging face but not in the export.
  6. Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1])) 

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

Possible kernel crashing - Jupyter Notebooks

Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model

Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup?

Question: Text cleaning for BERT and FastText

Hi there,

I have a question connected to the way the data was cleaned for both models.
I guess before training for the FastText model, all these cleaners have been used: https://github.com/oliverguhr/german-sentiment/blob/master/fasttext/textcleaner.py#L66

This is why for example the FastText model doesnt "understand" emojis, while the BERT model does:
https://huggingface.co/oliverguhr/german-sentiment-bert?text=%F0%9F%98%A1

In the BERT folder, I dont see similar cleaners, but at the same time, the minimal example in huggingface hub https://huggingface.co/oliverguhr/german-sentiment-bert?#a-minimal-working-sample show a subset of the FastText cleaners/preprocessors.

Could you please clarify which cleaners are recommended/have been used for the preprocessing of the BERT train data?

Attention mask in the example

Hi there

There seems to be a tiny mistake in the minimal working sample on the model page of Hugging Face. You are only passing the input IDs to the model but you do not include the attention mask. That means that all the padding tokens are receiving attention when they should not. This can skew your results significantly, as my testing shows.

Here is a working version where all the return values of the tokenizer are given to the model, with the expected results.

class SentimentModel():
    def __init__(self, model_name: str):
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
        self.clean_http_urls = re.compile(r'https*\\S+', re.MULTILINE)
        self.clean_at_mentions = re.compile(r'@\\S+', re.MULTILINE)

    def predict_sentiment(self, texts: List[str])-> List[str]:
        texts = [self.clean_text(text) for text in texts]  
        # Pad, truncate where necessary, and return as tensors
        encoded = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

        with torch.no_grad():
            logits = self.model(**encoded)

        # Get the highest scoring label IDs and convert to labels
        label_ids = torch.argmax(logits[0], axis=1)
        return [self.model.config.id2label[label_id.item()] for label_id in label_ids]

    def replace_numbers(self,text: str) -> str:
            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")

    def clean_text(self, text: str)-> str:
            text = text.replace("\n", " ")
            text = self.clean_http_urls.sub('', text)
            text = self.clean_at_mentions.sub('', text)
            text = self.replace_numbers(text)
            text = self.clean_chars.sub('', text) # use only text chars
            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
            text = text.strip().lower()
            return text

Wrong paper URL

Thanks for the package. It is very fast and a good idea. Also thanks that code is actually delivered for a scientific paper. Unfortunately the URL to the paper seems to be wrong, both here (DOI, 404) and in the package manager (wrong paper). Maybe you want to change this, then it is easier to read the background.

Wrong sentiment on some emojis

Some emoticons have the wrong sentiment.
😗 Negative
😘 Negative
😙 Negative
😚 Negative
😛 Negative
😜 Negative
😝 Negative
🙂 Negative
🙏 Negative

Feedback regarding case

Hi,
I just had a look on your training data and just wanted to give feedback.
You have all texts lowercased but the language model you did use was case sensitive.

See tokenizer_config.json:

{"do_lower_case": false, "model_max_length": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

IMO this will reduce quality because "Dummkopf" != "dummkopf" for the pretrained language model.

Small error that breaks example code on Hugging Face model card

Hi Oliver

First of all – thanks very much for making your model available. Really helpful and very much appreciated.

I noticed a small quirk that broke your sample code on Hugging Face for me:

Bildschirmfoto 2021-11-17 um 20 31 31

In the function clean_text you start by replacing \n with spaces. If I just copy and paste the code I don't copy the invisible \n but get an actual line break in my IDE (Sublime). This leads to wrong predictions.

If I replace the line break with an actual \n I get correct predictions. It might be helpful to fix that in the sample code if possible.

Again – thanks for your work. 👍

Have a great day!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.