oliverguhr / german-sentiment Goto Github PK

A data set and model for german sentiment classification.

License: MIT License

Python 97.04% Shell 2.96%

sentiment-analysis sentiment-classification german-language transformer bert-model fasttext machine-learning deep-learning

german-sentiment's People

Contributors

Stargazers

Watchers

Forkers

tafew bkiefer chaostheorie gnmarten myyyvothrr tiyaro matiassommer kklotzke isidoraalimpic mevbagci wkzdgh

german-sentiment's Issues

Export to ONNX

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

is this correct
why do some keys in the vocab.json start with ##
why are keys are named ["unused{x}"]
why does the prediction not scale 0 to 1, but are signed floats
why do some strings not work in my version not work. The string "Ein scheiß Film" works on hugging face but not in the export.
Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1]))

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

Different results in downloaded model compared to Hugging face API

Hello Oliver Guhr,

First of all, thank you for your great work.

After finding your model on Hugging Face,I was testing the model. I also used the germena-sentiment-lib.
I found some anomalies when I was working locally, so I tested the same sentences using the Hugging Face API.
Hugging Face API sentiment bert

The hugging face API is giving far better results compared to the downloaded model.
I spotted the anomalies while testing the following sentences.

Wie sicher ist das?
Im Ausland ist das anders.
Jeder kann zum Ziel werden.
Das soll mit dem Update der Apps ab Januar funktionieren.
Wie funktioniert das technisch?
Wie sicher ist das?
Irgend etwas passiert hier.
Worum geht es dabei?

My questions are -
Are the hosted models and the models that we download same in every aspect and configurations?
If they are different can you let me know what is different in the hosted API and how can I improve the downloaded model's performance or how can I reproduce the same results as the API.

Thank you for your time.

HF Model Card Example misses a `model.eval()` call.

IMO on your example at the HF model card you miss a model.eval() call. From HF doc:

Set the model in evaluation mode to deactivate the DropOut modules

This is IMPORTANT to have reproducible results during evaluation!

See here: https://huggingface.co/oliverguhr/german-sentiment-bert#a-minimal-working-sample

What do you think?

Wrong sentiment on some emojis

Some emoticons have the wrong sentiment.
😗 Negative
😘 Negative
😙 Negative
😚 Negative
😛 Negative
😜 Negative
😝 Negative
🙂 Negative
🙏 Negative

Feedback regarding case

Hi,
I just had a look on your training data and just wanted to give feedback.
You have all texts lowercased but the language model you did use was case sensitive.

See tokenizer_config.json:

{"do_lower_case": false, "model_max_length": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

IMO this will reduce quality because "Dummkopf" != "dummkopf" for the pretrained language model.

Small error that breaks example code on Hugging Face model card

Hi Oliver

First of all – thanks very much for making your model available. Really helpful and very much appreciated.

I noticed a small quirk that broke your sample code on Hugging Face for me:

In the function clean_text you start by replacing \n with spaces. If I just copy and paste the code I don't copy the invisible \n but get an actual line break in my IDE (Sublime). This leads to wrong predictions.

If I replace the line break with an actual \n I get correct predictions. It might be helpful to fix that in the sample code if possible.

Again – thanks for your work. 👍

Have a great day!

Question: Text cleaning for BERT and FastText

Hi there,

I have a question connected to the way the data was cleaned for both models.
I guess before training for the FastText model, all these cleaners have been used: https://github.com/oliverguhr/german-sentiment/blob/master/fasttext/textcleaner.py#L66

This is why for example the FastText model doesnt "understand" emojis, while the BERT model does:
https://huggingface.co/oliverguhr/german-sentiment-bert?text=%F0%9F%98%A1

In the BERT folder, I dont see similar cleaners, but at the same time, the minimal example in huggingface hub https://huggingface.co/oliverguhr/german-sentiment-bert?#a-minimal-working-sample show a subset of the FastText cleaners/preprocessors.

Could you please clarify which cleaners are recommended/have been used for the preprocessing of the BERT train data?

Attention mask in the example

Hi there

There seems to be a tiny mistake in the minimal working sample on the model page of Hugging Face. You are only passing the input IDs to the model but you do not include the attention mask. That means that all the padding tokens are receiving attention when they should not. This can skew your results significantly, as my testing shows.

Here is a working version where all the return values of the tokenizer are given to the model, with the expected results.

class SentimentModel():
    def __init__(self, model_name: str):
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
        self.clean_http_urls = re.compile(r'https*\\S+', re.MULTILINE)
        self.clean_at_mentions = re.compile(r'@\\S+', re.MULTILINE)

    def predict_sentiment(self, texts: List[str])-> List[str]:
        texts = [self.clean_text(text) for text in texts]  
        # Pad, truncate where necessary, and return as tensors
        encoded = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

        with torch.no_grad():
            logits = self.model(**encoded)

        # Get the highest scoring label IDs and convert to labels
        label_ids = torch.argmax(logits[0], axis=1)
        return [self.model.config.id2label[label_id.item()] for label_id in label_ids]

    def replace_numbers(self,text: str) -> str:
            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")

    def clean_text(self, text: str)-> str:
            text = text.replace("\n", " ")
            text = self.clean_http_urls.sub('', text)
            text = self.clean_at_mentions.sub('', text)
            text = self.replace_numbers(text)
            text = self.clean_chars.sub('', text) # use only text chars
            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
            text = text.strip().lower()
            return text

Wrong paper URL

Thanks for the package. It is very fast and a good idea. Also thanks that code is actually delivered for a scientific paper. Unfortunately the URL to the paper seems to be wrong, both here (DOI, 404) and in the package manager (wrong paper). Maybe you want to change this, then it is easier to read the background.

Possible kernel crashing - Jupyter Notebooks

Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model

Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup?

Germansentient not working in conda / jupyter (?)

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

Note that I also tested it in a standard python env and it worked just fine. Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup? I'm on MacOSX.

FastText model not found

Hi,

According to your paper, you have compared the BERT model to a FastText model, the latter of which I would like to download and test.
Would you be willing to update the link with the FastText model?

I suppose https://www2.htw-dresden.de/~guhr/dist/sentiment/models.zip needs to be updated.

Thanks in advance,
Best

Sentiment text corpus for open source electra sentiment model.

Hi @oliverguhr ,
I would like to train and open source a German sentiment model based on our new German Electra Language Model:
https://huggingface.co/german-nlp-group/electra-base-german-uncased

Would you share your preprocessed sentiment training data if I release the work under open source? That would be awsome!

Sincerely,
Philip