oliverguhr / german-sentiment Goto Github PK
View Code? Open in Web Editor NEWA data set and model for german sentiment classification.
License: MIT License
A data set and model for german sentiment classification.
License: MIT License
Could I add some ONNX export version?
My current attempt is:
# Initialize the model
model = germansentiment.SentimentModel()
# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)
# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")
# Export the vocab
with open('vocab.json', 'w') as f:
json.dump(model.tokenizer.vocab, f)
Than I used the model in Elixir:
{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")
{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)
# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])
{init_fn, predict_fn} = Axon.build(model)
predict_fn.(params, token_tensor)
But I still have some problems/questions:
##
about 4) I currently scale the prediction as follows:
prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1]))
about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.
I opened a question in the Elixir Forum about it here.
Hello Oliver Guhr,
First of all, thank you for your great work.
After finding your model on Hugging Face,I was testing the model. I also used the germena-sentiment-lib.
I found some anomalies when I was working locally, so I tested the same sentences using the Hugging Face API.
Hugging Face API sentiment bert
The hugging face API is giving far better results compared to the downloaded model.
I spotted the anomalies while testing the following sentences.
Wie sicher ist das?
Im Ausland ist das anders.
Jeder kann zum Ziel werden.
Das soll mit dem Update der Apps ab Januar funktionieren.
Wie funktioniert das technisch?
Wie sicher ist das?
Irgend etwas passiert hier.
Worum geht es dabei?
My questions are -
Are the hosted models and the models that we download same in every aspect and configurations?
If they are different can you let me know what is different in the hosted API and how can I improve the downloaded model's performance or how can I reproduce the same results as the API.
Thank you for your time.
IMO on your example at the HF model card you miss a model.eval()
call. From HF doc:
- Set the model in evaluation mode to deactivate the DropOut modules
- This is IMPORTANT to have reproducible results during evaluation!
See here: https://huggingface.co/oliverguhr/german-sentiment-bert#a-minimal-working-sample
What do you think?
Some emoticons have the wrong sentiment.
๐ Negative
๐ Negative
๐ Negative
๐ Negative
๐ Negative
๐ Negative
๐ Negative
๐ Negative
๐ Negative
Hi,
I just had a look on your training data and just wanted to give feedback.
You have all texts lowercased but the language model you did use was case sensitive.
See tokenizer_config.json:
{"do_lower_case": false, "model_max_length": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
IMO this will reduce quality because "Dummkopf" !=
"dummkopf" for the pretrained language model.
Hi Oliver
First of all โ thanks very much for making your model available. Really helpful and very much appreciated.
I noticed a small quirk that broke your sample code on Hugging Face for me:
In the function clean_text
you start by replacing \n
with spaces. If I just copy and paste the code I don't copy the invisible \n
but get an actual line break in my IDE (Sublime). This leads to wrong predictions.
If I replace the line break with an actual \n
I get correct predictions. It might be helpful to fix that in the sample code if possible.
Again โ thanks for your work. ๐
Have a great day!
Hi there,
I have a question connected to the way the data was cleaned for both models.
I guess before training for the FastText model, all these cleaners have been used: https://github.com/oliverguhr/german-sentiment/blob/master/fasttext/textcleaner.py#L66
This is why for example the FastText model doesnt "understand" emojis, while the BERT model does:
https://huggingface.co/oliverguhr/german-sentiment-bert?text=%F0%9F%98%A1
In the BERT folder, I dont see similar cleaners, but at the same time, the minimal example in huggingface hub https://huggingface.co/oliverguhr/german-sentiment-bert?#a-minimal-working-sample show a subset of the FastText cleaners/preprocessors.
Could you please clarify which cleaners are recommended/have been used for the preprocessing of the BERT train data?
Hi there
There seems to be a tiny mistake in the minimal working sample on the model page of Hugging Face. You are only passing the input IDs to the model but you do not include the attention mask. That means that all the padding tokens are receiving attention when they should not. This can skew your results significantly, as my testing shows.
Here is a working version where all the return values of the tokenizer are given to the model, with the expected results.
class SentimentModel():
def __init__(self, model_name: str):
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.clean_chars = re.compile(r'[^A-Za-zรผรถรครรรร ]', re.MULTILINE)
self.clean_http_urls = re.compile(r'https*\\S+', re.MULTILINE)
self.clean_at_mentions = re.compile(r'@\\S+', re.MULTILINE)
def predict_sentiment(self, texts: List[str])-> List[str]:
texts = [self.clean_text(text) for text in texts]
# Pad, truncate where necessary, and return as tensors
encoded = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
logits = self.model(**encoded)
# Get the highest scoring label IDs and convert to labels
label_ids = torch.argmax(logits[0], axis=1)
return [self.model.config.id2label[label_id.item()] for label_id in label_ids]
def replace_numbers(self,text: str) -> str:
return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fรผnf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")
def clean_text(self, text: str)-> str:
text = text.replace("\n", " ")
text = self.clean_http_urls.sub('', text)
text = self.clean_at_mentions.sub('', text)
text = self.replace_numbers(text)
text = self.clean_chars.sub('', text) # use only text chars
text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
text = text.strip().lower()
return text
Thanks for the package. It is very fast and a good idea. Also thanks that code is actually delivered for a scientific paper. Unfortunately the URL to the paper seems to be wrong, both here (DOI, 404) and in the package manager (wrong paper). Maybe you want to change this, then it is easier to read the background.
Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to
/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/
However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model
Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup?
Hi Oliver, sorry, I couldn't see another way to directly contact you. I am trying to run germansentiment 1.1.0 in a conda environment and a jupyter lab notebook. I am using the default ipykernel (Python 3). I have run pip install germansentiment in the conda env, and indeed it has installed correctly to
/opt/anaconda3/envs/my_env/lib/python3.9/site-packages/germansentiment/
However doing the import causes a kernel crash every time: from germansentiment import SentimentModel. I see I am not alone in this error at least: https://stackoverflow.com/questions/72396420/kernel-keeps-dying-while-using-bert-based-sentiment-analysis-model
Note that I also tested it in a standard python env and it worked just fine. Is there a special trick required to get it working in Conda / Jupyter? Or is it just something fishy about my setup? I'm on MacOSX.
Hi,
According to your paper, you have compared the BERT model to a FastText model, the latter of which I would like to download and test.
Would you be willing to update the link with the FastText model?
I suppose https://www2.htw-dresden.de/~guhr/dist/sentiment/models.zip needs to be updated.
Thanks in advance,
Best
Hi @oliverguhr ,
I would like to train and open source a German sentiment model based on our new German Electra Language Model:
https://huggingface.co/german-nlp-group/electra-base-german-uncased
Would you share your preprocessed sentiment training data if I release the work under open source? That would be awsome!
Sincerely,
Philip
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.