Coder Social home page Coder Social logo

oliverguhr / german-sentiment Goto Github PK

View Code? Open in Web Editor NEW
61.0 3.0 11.0 87.83 MB

A data set and model for german sentiment classification.

License: MIT License

Python 97.04% Shell 2.96%
sentiment-analysis sentiment-classification german-language transformer bert-model fasttext machine-learning deep-learning

german-sentiment's People

Contributors

dependabot[bot] avatar oliverguhr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

german-sentiment's Issues

Export to ONNX

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

  1. is this correct
  2. why do some keys in the vocab.json start with ##
  3. why are keys are named ["unused{x}"]
  4. why does the prediction not scale 0 to 1, but are signed floats
  5. why do some strings not work in my version not work. The string "Ein scheiรŸ Film" works on hugging face but not in the export.
  6. Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1])) 

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

Different results in downloaded model compared to Hugging face API

Hello Oliver Guhr,

First of all, thank you for your great work.

After finding your model on Hugging Face,I was testing the model. I also used the germena-sentiment-lib.
I found some anomalies when I was working locally, so I tested the same sentences using the Hugging Face API.
Hugging Face API sentiment bert

The hugging face API is giving far better results compared to the downloaded model.
I spotted the anomalies while testing the following sentences.

Wie sicher ist das?
Im Ausland ist das anders.
Jeder kann zum Ziel werden.
Das soll mit dem Update der Apps ab Januar funktionieren.
Wie funktioniert das technisch?
Wie sicher ist das?
Irgend etwas passiert hier.
Worum geht es dabei?

My questions are -
Are the hosted models and the models that we download same in every aspect and configurations?
If they are different can you let me know what is different in the hosted API and how can I improve the downloaded model's performance or how can I reproduce the same results as the API.

Thank you for your time.

Wrong sentiment on some emojis

Some emoticons have the wrong sentiment.
๐Ÿ˜— Negative
๐Ÿ˜˜ Negative
๐Ÿ˜™ Negative
๐Ÿ˜š Negative
๐Ÿ˜› Negative
๐Ÿ˜œ Negative
๐Ÿ˜ Negative
๐Ÿ™‚ Negative
๐Ÿ™ Negative

Feedback regarding case

Hi,
I just had a look on your training data and just wanted to give feedback.
You have all texts lowercased but the language model you did use was case sensitive.

See tokenizer_config.json:

{"do_lower_case": false, "model_max_length": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

IMO this will reduce quality because "Dummkopf" != "dummkopf" for the pretrained language model.

Small error that breaks example code on Hugging Face model card

Hi Oliver

First of all โ€“ thanks very much for making your model available. Really helpful and very much appreciated.

I noticed a small quirk that broke your sample code on Hugging Face for me:

Bildschirmfoto 2021-11-17 um 20 31 31

In the function clean_text you start by replacing \n with spaces. If I just copy and paste the code I don't copy the invisible \n but get an actual line break in my IDE (Sublime). This leads to wrong predictions.

If I replace the line break with an actual \n I get correct predictions. It might be helpful to fix that in the sample code if possible.

Again โ€“ thanks for your work. ๐Ÿ‘

Have a great day!

Question: Text cleaning for BERT and FastText

Hi there,

I have a question connected to the way the data was cleaned for both models.
I guess before training for the FastText model, all these cleaners have been used: https://github.com/oliverguhr/german-sentiment/blob/master/fasttext/textcleaner.py#L66

This is why for example the FastText model doesnt "understand" emojis, while the BERT model does:
https://huggingface.co/oliverguhr/german-sentiment-bert?text=%F0%9F%98%A1

In the BERT folder, I dont see similar cleaners, but at the same time, the minimal example in huggingface hub https://huggingface.co/oliverguhr/german-sentiment-bert?#a-minimal-working-sample show a subset of the FastText cleaners/preprocessors.

Could you please clarify which cleaners are recommended/have been used for the preprocessing of the BERT train data?

Attention mask in the example

Hi there

There seems to be a tiny mistake in the minimal working sample on the model page of Hugging Face. You are only passing the input IDs to the model but you do not include the attention mask. That means that all the padding tokens are receiving attention when they should not. This can skew your results significantly, as my testing shows.

Here is a working version where all the return values of the tokenizer are given to the model, with the expected results.

class SentimentModel():
    def __init__(self, model_name: str):
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.clean_chars = re.compile(r'[^A-Za-zรผรถรคร–รœร„รŸ ]', re.MULTILINE)
        self.clean_http_urls = re.compile(r'https*\\S+', re.MULTILINE)
        self.clean_at_mentions = re.compile(r'@\\S+', re.MULTILINE)

    def predict_sentiment(self, texts: List[str])-> List[str]:
        texts = [self.clean_text(text) for text in texts]  
        # Pad, truncate where necessary, and return as tensors
        encoded = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

        with torch.no_grad():
            logits = self.model(**encoded)

        # Get the highest scoring label IDs and convert to labels
        label_ids = torch.argmax(logits[0], axis=1)
        return [self.model.config.id2label[label_id.item()] for label_id in label_ids]

    def replace_numbers(self,text: str) -> str:
            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fรผnf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")

    def clean_text(self, text: str)-> str:
            text = text.replace("\n", " ")
            text = self.clean_http_urls.sub('', text)
            text = self.clean_at_mentions.sub('', text)
            text = self.replace_numbers(text)
            text = self.clean_chars.sub('', text) # use only text chars
            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
            text = text.strip().lower()
            return text

Wrong paper URL

Thanks for the package. It is very fast and a good idea. Also thanks that code is actually delivered for a scientific paper. Unfortunately the URL to the paper seems to be wrong, both here (DOI, 404) and in the package manager (wrong paper). Maybe you want to change this, then it is easier to read the background.

Possible kernel crashing - Jupyter Notebooks

Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model

Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup?

Germansentient not working in conda / jupyter (?)

Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to

/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/

However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model

Note that I also tested it in a standard python env and it worked just fine. Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup? I'm on MacOSX.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.