cardiffnlp / xlm-t Goto Github PK
View Code? Open in Web Editor NEWRepository for XLM-T, a framework for evaluating multilingual language models on Twitter data
License: Apache License 2.0
Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data
License: Apache License 2.0
How could I get all three possible labels with probabilities in output?
I follow the step to run the model using a dataset with 5k tweets. However, when doing the 'output = model(**encoded_input)' the colab pro+ always crashes. Is there any solution to this issue? Thanks a lot!!!!
I believe this line may be a bug, as it overrides the global variable set earlier in the file:
xlm-t/src/adapter_finetuning.py
Line 63 in aa9b15e
Hello,
First, thanks for these great models! I was wondering if I could use these models for zero-shots classification, especially for emotion detection (Ekman). While doing so, I encountered this error regarding the config of the models and the missing entailment
parameter which seems mandatory for zero-shot classification:
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Basic code to reproduce:
model_path = "cardiffnlp/twitter-xlm-roberta-base"
zero_shot = pipeline("zero-shot-classification", model=model_path, tokenizer=model_path,)
zero_shot("You are an awful man,", candidate_labels=["anger", "disgust", "fear", "joy", "sadness", "surprise"], multi_label=True,)
It seems to lead indeed to weird outputs compared to the XLM-Roberta XNLI model that can be tested here: https://huggingface.co/zero-shot/.
Is there something missing in head_config.json
?
Thanks!
Hello,
I was wondering if the preprocess
function could be enhanced as right now, it strips punctuations before and after usernames/URLs. Or was it done on purpose? I couldn't find a justification of this in your paper.
Right now, the preprocess
function below would convert:
I love you @louisia!!!!
to
I love you @user
# Preprocess text (username and link placeholders)
def preprocess(text):
new_text = []
for t in text.split(" "):
t = '@user' if t.startswith('@') and len(t) > 1 else t
t = 'http' if t.startswith('http') else t
new_text.append(t)
return " ".join(new_text)
It seems to me that punctuations could help the model predict the sentiment of a tweet a little better if it was available to it. Another example: some users on twitter, start their tweets with a dot like this:
.@rudy is really bad. What a shame.
They do that to avoid the reply system while still quoting a username. With the actual pre-processing function, "@rudy" doesn't get replaced because there is a dot right before the @.
Is there any particular reason why the preprocessing function was done this way or we could try to make it more flexible in our end by keeping the punctuations next to usernames or URLs?
Thank you!
I think the Monolingual models will also make a great contribution!
ubuntu16.04
adapter-transformers==1.1.1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:81:00.0 Off | N/A |
| 41% 26C P8 20W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
when I run adapter_fintuning.py I got this error:
root@ubuntu:/home/project/xlm-t-main# python src/adapter_finetuning.py
Some weights of the model checkpoint at cardiffnlp/twitter-xlm-roberta-base were not used when initializing XLMRobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
anybody can help?
The work on this is excellent. I've been using for fine-tuning sentiment / topic models with great success!
One very nitpicky thing I noticed in the repo is that the README.md contains a link to https://huggingface.co/docs/transformers/model_doc/xlmroberta which is now broken. I believe that this should be https://huggingface.co/docs/transformers/model_doc/xlm-roberta.
Hi
Im using transformers version '4.26.1' on databricks. The code below return the following error
Code:
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
Error:
AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<command-798773577171854> in <module>
15
16 model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
---> 17 sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
18
19 df_group_notna['sentiment'] = df_group_notna['Feedback'].apply(lambda x: sentiment_task(x)[0]['label'])
/databricks/python/lib/python3.7/site-packages/transformers/pipelines/__init__.py in pipeline(task, model, config, tokenizer, feature_extractor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
827
828 tokenizer = AutoTokenizer.from_pretrained(
--> 829 tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
830 )
831
/databricks/python/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
674 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
675 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 676 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
677 else:
678 if tokenizer_class_py is not None:
/databricks/python/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1811 local_files_only=local_files_only,
1812 _commit_hash=commit_hash,
-> 1813 **kwargs,
1814 )
1815
/databricks/python/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1957 # Instantiate tokenizer.
1958 try:
-> 1959 tokenizer = cls(*init_inputs, **init_kwargs)
1960 except OSError:
1961 raise OSError(
/databricks/python/lib/python3.7/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py in __init__(self, vocab_file, tokenizer_file, bos_token, eos_token, sep_token, cls_token, unk_token, pad_token, mask_token, **kwargs)
163 pad_token=pad_token,
164 mask_token=mask_token,
--> 165 **kwargs,
166 )
167
/databricks/python/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
112 elif slow_tokenizer is not None:
113 # We need to convert a slow tokenizer to build the backend
--> 114 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
115 elif self.slow_tokenizer_class is not None:
116 # We need to create and convert a slow tokenizer to build the backend
/databricks/python/lib/python3.7/site-packages/transformers/convert_slow_tokenizer.py in convert_slow_tokenizer(transformer_tokenizer)
1160 converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
1161
-> 1162 return converter_class(transformer_tokenizer).converted()
/databricks/python/lib/python3.7/site-packages/transformers/convert_slow_tokenizer.py in __init__(self, *args)
436 super().__init__(*args)
437
--> 438 from .utils import sentencepiece_model_pb2 as model_pb2
439
440 m = model_pb2.ModelProto()
/databricks/python/lib/python3.7/site-packages/transformers/utils/sentencepiece_model_pb2.py in <module>
32 syntax="proto2",
33 serialized_options=b"H\003",
---> 34 create_key=_descriptor._internal_create_key,
35 serialized_pb=(
36 b'\n\x19sentencepiece_model.proto\x12\rsentencepiece"\xa1\n\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01'
AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key'
Hi
xlm-t works really well on my dataset however to make it effective performance-wise I was wondering if you guys are planning to release smaller pretrained or distilled models?
Best
Anup
The fine-tuning colab notebook doesn't fully work.
The first cell in the Fine-tuning
section errors out. I'm guessing it's probably due to specific versions of packages being required (which are not enforced in the first cell).
Also, as a very minor additional thing, this link is broken: This notebook was modified from https://huggingface.co/transformers/custom_datasets.html
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.