vinairesearch / bertweet Goto Github PK

View Code? Open in Web Editor NEW

557.0 12.0 50.0 137 KB

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)

License: MIT License

Python 100.00%

python3 bert roberta transformers fairseq language-model english part-of-speech-tagging ner named-entity-recognition

BERTweet: A pre-trained language model for English Tweets

BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the RoBERTa pre-training procedure. The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the COVID-19 pandemic. The general architecture and experimental results of BERTweet can be found in our paper:

@inproceedings{bertweet,
title     = {{BERTweet: A pre-trained language model for English Tweets}},
author    = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
pages     = {9--14},
year      = {2020}
}

Please CITE our paper when BERTweet is used to help produce published results or is incorporated into other software.

Main results

Using BERTweet with `transformers`

Installation

Install transformers with pip: pip install transformers, or install transformers from source.
Note that we merged a slow tokenizer for BERTweet into the main transformers branch. The process of merging a fast tokenizer for BERTweet is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model	#params	Arch.	Max length	Pre-training data
`vinai/bertweet-base`	135M	base	128	850M English Tweets (cased)
`vinai/bertweet-covid19-base-cased`	135M	base	128	23M COVID-19 English Tweets (cased)
`vinai/bertweet-covid19-base-uncased`	135M	base	128	23M COVID-19 English Tweets (uncased)
`vinai/bertweet-large`	355M	large	512	873M English Tweets (cased)

09/2020: Two pre-trained models vinai/bertweet-covid19-base-cased and vinai/bertweet-covid19-base-uncased are resulted by further pre-training the pre-trained model vinai/bertweet-base on a corpus of 23M COVID-19 English Tweets.
08/2021: Released vinai/bertweet-large.

Example usage

import torch
from transformers import AutoModel, AutoTokenizer 

bertweet = AutoModel.from_pretrained("vinai/bertweet-large")

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "DHEC confirms HTTPURL via @USER :crying_face:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-large")

Normalize raw input Tweets

Before applying BPE to the pre-training corpus of English Tweets, we tokenized these Tweets using TweetTokenizer from the NLTK toolkit and used the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets.

Given the raw input Tweets, to obtain the same pre-processing output, users could employ our TweetNormalizer module.

Installation: pip3 install nltk emoji==0.6.0
The emoji version must be either 0.5.4 or 0.6.0. Newer emoji versions have been updated to newer versions of the Emoji Charts, thus not consistent with the one used for pre-processing our pre-training Tweet corpus.

import torch
from transformers import AutoTokenizer
from TweetNormalizer import normalizeTweet

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

line = normalizeTweet("DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier 😢")

input_ids = torch.tensor([tokenizer.encode(line)])

Using BERTweet with `fairseq`

Please see details at HERE!

License

MIT License

Copyright (c) 2020-2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

bertweet's People

Contributors

Stargazers

Watchers

Forkers

mcdavid109 kc2fresh karryharsh ismail-30 dandrocec nhatrio milkigit guruprasaad123 pauldevos jbdatascience wkryst saroyehun trendingtechnology oagn dnaaun deepbrain hungbui tienthanhdhcn cedar33 jeniyat hadryan mltlachac bobycv06fpm nremeikis luoy25 quangchiem139 arundhati-b gabrielwong1991 grant-rk anshiquanshu66 galexa05 suhasagg techthiyanes fm-chen mechanicpanic ayobame jmansfield89 adityaguin babajideowoyele tahirlanre elmapple yhliu2022 michileo tiyaro sbocconi jeongsikpark1998 morlikowski yanjiangjerry tnt305

bertweet's Issues

Example code to use BERTweet with only Transformers?

It is a bit onerous to mix APIs from transformers and fairseq. Any chance to have some demo code with transformers only?

Preprocessing of tweets

Hello,
I saw your preprocessing steps (where you convert links and mentions to :USER and HTTPURL , but I am wondering, when you trained BERTweet, I imagine there can be strange tokens/symbols and so on, so where you mask 15% of the tokens for each sentence, why does the model not get confused when trying to predict for example ":@" or ":D" or some strange symbols?

Regards

Can't load Tokenizer

When trying to load the Tokenizer using the following code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

I got these error messages:

AttributeError Traceback (most recent call last)
in ()
1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
368 if use_fast and not config.tokenizer_class.endswith("Fast"):
369 tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
--> 370 tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
371 if tokenizer_class is None:
372 tokenizer_class_candidate = config.tokenizer_class

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in tokenizer_class_from_name(class_name)
271 )
272 for c in all_tokenizer_classes:
--> 273 if c.name == class_name:
274 return c
275

AttributeError: 'NoneType' object has no attribute 'name'

Python 3.6.9
Transformers 4.3.2

Question about normalization=True

Hello,
How do you save the tokenizer with a custom parameter in the AutoTokenizer normalization=True? And how do you point to the preprocess function?

Reproducing the results of fine-tuning XLMR large in the paper

Hi, I'm Interested in your great work and tried to reproduce you results of fine-tuning XLMR (with my own code). And I got 92.6 in Ritter11, 93.4 in ARK, 95.0 in TB-v2. I find that the results of ARK is lower than the reported results in the paper. In the paper, you applied "soft" and "hard" strategy to the dataset while I did nothing. Therefore, I think the reason is possibly the data- preprocessing, am I right?

using model for local tweets author prediction

@famanson @andreydung @datquocnguyen @tienthanhdhcn @thanhluong
can i use this model for local tweets author prediction of english tweets?

Truncated Tweets from Archive Team Tweet Stream

Great work as an LM using Tweets. I am wondering if the tweets that are downloaded from the Archive Team website were also truncated when you trained the model?

Could you share the dataset in pre-trained phase?

How do I perform sentiment analysis using it?

Way to mask multiple words in a sentence?

Hi,
Is there a way to mask multiple words in a sentence?

Thanks in advance.

feature requrest

If you have a plan to release a larger model please consider the following options as well:

large positional embedding (for example for conversational threads on Twitter would be useful)
multilingual model

Thanks!

Is the dataset (80GB) tweets can be accessed?

Hi,
Where can i access the 80GB twitter dataset that you used for training the model.

vinai/bertweet-large returns LABEL_0 all the time

Hi all, I have something fish going on with bertweet-large model. The code and the output is below. I also test a dataset of 5000 tweets and it returns label_0 for all of them. Do you have any ideas what might be the issue?

Thanks

Best

CODE

classifier = pipeline('sentiment-analysis', model="vinai/bertweet-large") # , return_all_scores = True)

print(classifier('I hate you'))

print(classifier('I love you'))

print(classifier('I you'))

OUTPUT

[{'label': 'LABEL_0', 'score': 0.6324337720870972}]
[{'label': 'LABEL_0', 'score': 0.6408738493919373}]
[{'label': 'LABEL_0', 'score': 0.6261811256408691}]

About the tokenizer for Bertweet-Large

I download the model wieght and config file form https://huggingface.co/vinai/bertweet-large and find that the vocab and tokenizer are different from those in Bertweet-Base(https://huggingface.co/vinai/bertweet-base). Moreover, I cannot find the ':crying_face:' token in large version's vocab.json. The tokenizer seems more like a RoBERTaTokenizer inestad of BertweetTokenizer. May the researchers introduce the changes?

Model outputs tuples

Hi, could you explain how you generate the tweet sentence embedding please? I check the shape of the output based on the example, features = bertweet(input_ids) seems to have embeddings of each token in feature[0] (e.g., [1,20,768]) and tweet sentence embedding in feature[1] (e.g., [1, 768])? If so, please could you let me know how you generate feature[1]? Is it based on [CLS] token or simply average the whole word token embeddings? Thanks!

What is the masked token in BERTweet?

What is the for [MASK] token for BERTweet?

Config and SequenceClassification

Hi all and thanks for the cool contribution,

Now that the PR is merged on transformers, I am trying to include your model in the simpletransformers repository, in order to use it in my project.

I have read on the README that BERTweet has a BERT-base configuration (and shares the pre-training procedure with RoBERTa). Therefore, how come is it associated with a RobertaConfig in src/transformers/tokenization_auto.py (in TOKENIZER_MAPPING, see the changed files in the PR)? Shouldn't we use a BertConfig instead?
When loading the weights to fine-tune on a text classification downstream task, should we use BertForSequenceClassification or RobertaForSequenceClassification?

Thanks a lot in advance.

Model name was not found in tokenizers model name list

I tried to run your example usage:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    print(features)

I'm getting the following error:

OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-large-openai-detector, roberta-large-mnli, roberta-large, roberta-base-openai-detector, roberta-base, distilroberta-base). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

My environment:
python 3.5.6
transformers 2.5.1
torch 1.4.0

Some emojis not tokenized properly

Hi dev team
I appreciate you guys for making this model to facilitate nlp research in tweets! I have been trying to use the BERTweet to do my project on Twitter, however I think I've just found something weird with the tokenization step.
The 'Example usage' tab in README gives a sample tweet: "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @user 😢"
I tried to tokenize this tweet with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False), then use print(tokenizer.convert_ids_to_tokens(tokenizer.encode(line))), I get:

['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', '<unk>', ':', '</s>']

Or with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False, normalization=True), I get:
['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', ':', 'cry', ':', '</s>']

Either way, the tokenization is not correct for the emoji string ":cry:"

I have checked the source code implemented in Transformer, I think what went wrong is that for emoji.demojizer(), you need to set the option use_aliases=True to cover all emojis, otherwise some just won't get included.

I have also checked tokenizer.get_vocab()[':cry:'], and it returns a KeyError

What are pre-processing steps applied

I want to know what are the preprocessing steps applied other than emoji, username and URL?
are tweets is lowered?
are digits removed?
what about punctuations?

Huggingface version is not working

error

When I execute the huggingface version It throws the following error:

`OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

reproduce

https://colab.research.google.com/drive/1bwWQAX9Ql0d1fTVSQd1KpP2AQledgyKZ?usp=sharing

Script to postprocess the prediction outputs on the Ritter11-T-POS test set

def convertTags(filein):
    writer = open(filein+".post", "w")
    lines = open(filein, "r").readlines()
    for ind in range(len(lines)):
        line = lines[ind]
        tokTag = line.strip().split()
        if len(tokTag) == 0:
            writer.write("\n")
            continue
            
        if tokTag[0] == "@USER":
            tokTag[1] = "USR"
        elif tokTag[0] == "HTTPURL":
            tokTag[1] = "URL"
        elif tokTag[0].startswith("#"):
            tokTag[1] = "HT"
        elif tokTag[0] == "RT":
            tokTag[1] = "RT"

        if tokTag[0] == "(" or tokTag[0] == ")":
            if ind >= 1:
                tokTag_1 = lines[ind-1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]
            if ind < len(lines) - 1:
                tokTag_1 = lines[ind + 1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]

        writer.write("\t".join(tokTag) + "\n")
    writer.close()

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets

Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base model using Huggingface, but since it is not supported by PreTrainedTokenizerFast, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large model, so I assume its not impossible.

Issue when fine-tuning the model from huggingface hub

Thanks for making the model available in huggingface hub. I tried to use it with some existing code I have. I've been running the same code with some 10+ models from huggingface hub with no issue. When I try to run with: "vinai/bertweet-base"

I get the following error (note model loads fine and it seems it starts training for several iterations) - see below.

I'm not sure what the problem could be. Could the version of transformers and/or pytorch be the problem? Do you know which versions you tried it with? I'm using transformers 3.4 and torch 1.5.1+cu101

Thanks for your help!

| 44/1923 [00:11<08:11,  3.82it/s]Traceback (most recent call last):
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
    main()
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
    trainer.train(
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
    tr_loss += self.training_step(model, inputs)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 1070, in training_step
    loss.backward()
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x2b9e5852d536 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd43696 (0x2b9e2155e696 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&) + 0x19e1 (0x2b9e2251ce11 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)

Use model output for sentiment classifcation

Thanks a lot for the work on BERTweet. In the paper you describe using the model for 3-class sentiment analysis. Can you please provide an example how it is done?

The readme example:

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
    features = bertweet(input_ids)

produces the results

# Inspect results
print(f'Pooler outputs shape: {features["pooler_output"].shape}')
print(f'Last hidden states shape: {features["last_hidden_state"].shape}')

Pooler outputs shape: torch.Size([1, 768])
Hidden states shape: torch.Size([1, 20, 768])

How do you then use the model outputs to classify the sentiment of the tweet?

error while running the sample code

Hi,
I am using google colab and trying to run the sample usage code you have given,
from fairseq.data.encoders.fastbpe import fastBPE from fairseq import options parser = options.get_preprocessing_parser() parser.add_argument('--bpe-codes', type=str, help='path to fastBPE BPE',default="BERTweet_base_fairseq/bpe.codes") args = parser.parse_args()

I'm facing an issue while passing args.
The error is in attached img. How do i have this error?

Would you release the tutorial about how to generate bpe.codes and dict.txt file?

Would you release the tutorial about how to generate bpe.codes and dict.txt file, and the preprocess pipeline about how generate pretrain data?

I want to train a another language bertweet

Can't load BERTweet Tokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

results in:
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 220, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1524, in _from_pretrained
raise EnvironmentError(
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

transformers version: 3.1.0

AutoTokenizer gives error

The sample script provided here gives error. The script is given below:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)

Error:

Traceback (most recent call last):
File "temp.py", line 5, in
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
File "<conda_env>//lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 372, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "<conda_env>/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 275, in tokenizer_class_from_name
if c.name == class_name:
AttributeError: 'NoneType' object has no attribute 'name'

The error is resolved by using:

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

How to get the dependency parsing result using BERTweet

Hi there, thanks for the brilliant work, I want to ask for help about how to get the dependency parsing result using BERTweet?

tokenizer does not pad the data

All the following calls to the tokenizer return the same ids

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased")


input_ids = tokenizer.encode(line, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, max_length=128, return_tensors="pt")

BPE installation error

I am working on a remote Ubuntu server using SSH. Python 3.6. I am unable to install fastBPE and get this error:
error: command 'gcc' failed with exit status 1
As a result, I'm unable to execute my code. I am unable to install this with pip or conda as well. Please help.

Sentimental analysis of tweets.

I want to do a sentimental analysis of the tweets and I want to use this model for that purpose.

Can someone provide me a high-level overview of what/How I should be doing to accomplish my task?

Applying Bertweet to a huge pandas dataframe

Hello everyone :)

I'm a psychologist researcher studying user behaviour on social media.

As part of my research, i get a huge amount of tweets on a specific hashtag (~25.000.000 tweets).

I would like to do sentiment analysis on this dataset, i previously used the default HugginFace for SA, but the results weren't that great:

classifier = pipeline("sentiment-analysis")

tweets_df['sentiment'] = tweets_df['text'].apply(lambda row : (classifier(row))[0]['label'])

I run the example code on the main gh page:

I'm here asking for some guidance, of how can i apply it now in a performatic way and get a new column called 'sentiment' with sentiment analysis of each row?

bertweet tokenizer compatibility with encode_plus

Unable to use the BERTweet tokenizer with encode_plus. While the tokenizer.encode tokenizes correctly, the tokenizer.encode_plus doesn't work correctly on the raw tweets.

How to deal with batch data input

How to deal with the input of batch data of different lengths, such as batch_ Size = 2, "I like playing basketball" and "it's not a good day" are two sentences as input?

synergy Jina <> BERTweet

hi VinAI team,

Great work 👍 I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.

I'm asking if we can build a synergy between Jina <> BERTweet (& PhoBERT, post it separately). https://github.com/jina-ai/jina

Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.

Potential synergy

I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?
If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.

Re-training the language model

Hi all,

Thank you for the great work! This will solve the problem of adapting BERT (trained on Wikipedia and the book corpus) to the tweet domain. Of course, the problems such as adapting it to own domain of tweet data is still there. For this purpose, it would be useful to re-train the BERTweet language model first to teach BERTweet to speak the language of a specific domain. I have been investigating some tutorials and the trainer module of transformers. Do you have any guidance, script or a tutorial that can be shared?

Many thanks

Sentence embeddings

How do I generate entire tweet embeddings instead of word embeddings using BERTweet?

IndexError: index out of range in self

Hi, I encountered an error: "IndexError: index out of range in self." Below is my code. Can you help identify where the problem is? Is it related to the length of the sequence? I can provide the specific text if you need it.

pretrain_model = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = AutoModelForMaskedLM.from_pretrained(pretrain_model)

inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
last_hidden_states = model(**inputs, output_hidden_states=True).hidden_states[-1]

Host the model on Huggingface?

It would be nice to have the model also hosted on huggingface (https://huggingface.co/models), so people could use it from the huggingface API without manually downloading the model dump.

Using with BERTweet with Farm

When I try to use the BERTweet model with the Farm Package I get the following error. It seems to struggling to find the model but I do not understand why. I am using the Jupyter notebook described in this article and have included the cell code with error below.

lang_model = "vinai/bertweet-base"
do_lower_case = False

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=lang_model,
    do_lower_case=do_lower_case)
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

/var/folders/5r/p050t_sd4l130ytlj_x4wxyh0000gn/T/ipykernel_39034/78292544.py in <module>
      2 do_lower_case = False
      3 
----> 4 tokenizer = Tokenizer.load(
      5     pretrained_model_name_or_path=lang_model,
      6     do_lower_case=do_lower_case)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/farm/modeling/tokenization.py in load(cls, pretrained_model_name_or_path, revision, tokenizer_class, use_fast, **kwargs)
     95         elif "RobertaTokenizer" in tokenizer_class:
     96             if use_fast:
---> 97                 ret = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
     98             else:
     99                 ret = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1706                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
   1707             )
-> 1708             raise EnvironmentError(msg)
   1709 
   1710         for file_id, file_path in vocab_files.items():

OSError: Can't load tokenizer for 'vinai/bertweet-base'. Make sure that:

- 'vinai/bertweet-base' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'vinai/bertweet-base' is the correct path to a directory containing relevant tokenizer files

Could you share the pre-processed tweet data?

Hi all,

Thanks for the great work! I used it to make this little emoji recommender:

http://rensdimmendaal.com/emoji/

I'd love to expand the number of different emoji I can recommend. However, I cannot at the moment, because some emoji are split into multiple tokens. Would you be willing to share the preprocessed data of bertweet-base so I can add these as single tokens to the vocabulary and tune the model?

next sentence prediction

Thanks for developing BERTweet
here is a conceptual question when it comes to utilizing tweets for training BERT model and I am curious how you have handled that.

Bert Language model has a "next sentence prediction" model, where through building the LM model try to optimize predicting the next sentence.

since tweets are short and often contain one sentence. I am curious how you have handled that and how have you bypassed NSP part ?

Thank you again.

is the checkpoints available to public ?

Tokenizer vinai/bertweet-covid19-base-uncased

Does vinai/bertweet-covid19-base-uncased use the same tokenizer as bertweet-base? I've been trying to run a code on my annotated data and keeps giving me an error "index is out of range in self" during train stage if I download model and tokenizer from bertweet-covid19-base-uncased. The only way it worked for me was using the model from bertweet-covid19-base-uncased and tokenizer from bertweet-base.