Coder Social home page Coder Social logo

vinairesearch / bertweet Goto Github PK

View Code? Open in Web Editor NEW
557.0 12.0 50.0 137 KB

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)

License: MIT License

Python 100.00%
python3 bert roberta transformers fairseq language-model english part-of-speech-tagging ner named-entity-recognition

bertweet's Introduction

Table of contents

  1. Introduction
  2. Main results
  3. Using BERTweet with transformers
  4. Using BERTweet with fairseq

BERTweet: A pre-trained language model for English Tweets

BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the RoBERTa pre-training procedure. The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the COVID-19 pandemic. The general architecture and experimental results of BERTweet can be found in our paper:

@inproceedings{bertweet,
title     = {{BERTweet: A pre-trained language model for English Tweets}},
author    = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
pages     = {9--14},
year      = {2020}
}

Please CITE our paper when BERTweet is used to help produce published results or is incorporated into other software.

Main results

postagging        ner

sentiment        irony

Using BERTweet with transformers

Installation

  • Install transformers with pip: pip install transformers, or install transformers from source.
    Note that we merged a slow tokenizer for BERTweet into the main transformers branch. The process of merging a fast tokenizer for BERTweet is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
  • Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model #params Arch. Max length Pre-training data
vinai/bertweet-base 135M base 128 850M English Tweets (cased)
vinai/bertweet-covid19-base-cased 135M base 128 23M COVID-19 English Tweets (cased)
vinai/bertweet-covid19-base-uncased 135M base 128 23M COVID-19 English Tweets (uncased)
vinai/bertweet-large 355M large 512 873M English Tweets (cased)
  • 09/2020: Two pre-trained models vinai/bertweet-covid19-base-cased and vinai/bertweet-covid19-base-uncased are resulted by further pre-training the pre-trained model vinai/bertweet-base on a corpus of 23M COVID-19 English Tweets.
  • 08/2021: Released vinai/bertweet-large.

Example usage

import torch
from transformers import AutoModel, AutoTokenizer 

bertweet = AutoModel.from_pretrained("vinai/bertweet-large")

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "DHEC confirms HTTPURL via @USER :crying_face:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-large")

Normalize raw input Tweets

Before applying BPE to the pre-training corpus of English Tweets, we tokenized these Tweets using TweetTokenizer from the NLTK toolkit and used the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets.

Given the raw input Tweets, to obtain the same pre-processing output, users could employ our TweetNormalizer module.

  • Installation: pip3 install nltk emoji==0.6.0
  • The emoji version must be either 0.5.4 or 0.6.0. Newer emoji versions have been updated to newer versions of the Emoji Charts, thus not consistent with the one used for pre-processing our pre-training Tweet corpus.
import torch
from transformers import AutoTokenizer
from TweetNormalizer import normalizeTweet

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

line = normalizeTweet("DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier 😢")

input_ids = torch.tensor([tokenizer.encode(line)])

Using BERTweet with fairseq

Please see details at HERE!

License

MIT License

Copyright (c) 2020-2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

bertweet's People

Contributors

datquocnguyen avatar tienthanhdhcn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bertweet's Issues

Preprocessing of tweets

Hello,
I saw your preprocessing steps (where you convert links and mentions to :USER and HTTPURL , but I am wondering, when you trained BERTweet, I imagine there can be strange tokens/symbols and so on, so where you mask 15% of the tokens for each sentence, why does the model not get confused when trying to predict for example ":@" or ":D" or some strange symbols?

Regards

Can't load Tokenizer

When trying to load the Tokenizer using the following code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

I got these error messages:


AttributeError Traceback (most recent call last)
in ()
1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
368 if use_fast and not config.tokenizer_class.endswith("Fast"):
369 tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
--> 370 tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
371 if tokenizer_class is None:
372 tokenizer_class_candidate = config.tokenizer_class

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in tokenizer_class_from_name(class_name)
271 )
272 for c in all_tokenizer_classes:
--> 273 if c.name == class_name:
274 return c
275

AttributeError: 'NoneType' object has no attribute 'name'

Python 3.6.9
Transformers 4.3.2

Question about normalization=True

Hello,
How do you save the tokenizer with a custom parameter in the AutoTokenizer normalization=True? And how do you point to the preprocess function?

Reproducing the results of fine-tuning XLMR large in the paper

Hi, I'm Interested in your great work and tried to reproduce you results of fine-tuning XLMR (with my own code). And I got 92.6 in Ritter11, 93.4 in ARK, 95.0 in TB-v2. I find that the results of ARK is lower than the reported results in the paper. In the paper, you applied "soft" and "hard" strategy to the dataset while I did nothing. Therefore, I think the reason is possibly the data- preprocessing, am I right?

feature requrest

If you have a plan to release a larger model please consider the following options as well:

  • large positional embedding (for example for conversational threads on Twitter would be useful)
  • multilingual model

Thanks!

vinai/bertweet-large returns LABEL_0 all the time

Hi all, I have something fish going on with bertweet-large model. The code and the output is below. I also test a dataset of 5000 tweets and it returns label_0 for all of them. Do you have any ideas what might be the issue?

Thanks

Best

CODE

classifier = pipeline('sentiment-analysis', model="vinai/bertweet-large") # , return_all_scores = True)

print(classifier('I hate you'))

print(classifier('I love you'))

print(classifier('I you'))

OUTPUT

[{'label': 'LABEL_0', 'score': 0.6324337720870972}]
[{'label': 'LABEL_0', 'score': 0.6408738493919373}]
[{'label': 'LABEL_0', 'score': 0.6261811256408691}]

Model outputs tuples

Hi, could you explain how you generate the tweet sentence embedding please? I check the shape of the output based on the example, features = bertweet(input_ids) seems to have embeddings of each token in feature[0] (e.g., [1,20,768]) and tweet sentence embedding in feature[1] (e.g., [1, 768])? If so, please could you let me know how you generate feature[1]? Is it based on [CLS] token or simply average the whole word token embeddings? Thanks!

Config and SequenceClassification

Hi all and thanks for the cool contribution,

Now that the PR is merged on transformers, I am trying to include your model in the simpletransformers repository, in order to use it in my project.

  • I have read on the README that BERTweet has a BERT-base configuration (and shares the pre-training procedure with RoBERTa). Therefore, how come is it associated with a RobertaConfig in src/transformers/tokenization_auto.py (in TOKENIZER_MAPPING, see the changed files in the PR)? Shouldn't we use a BertConfig instead?
  • When loading the weights to fine-tune on a text classification downstream task, should we use BertForSequenceClassification or RobertaForSequenceClassification?

Thanks a lot in advance.

Model name was not found in tokenizers model name list

I tried to run your example usage:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    print(features)

I'm getting the following error:

OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-large-openai-detector, roberta-large-mnli, roberta-large, roberta-base-openai-detector, roberta-base, distilroberta-base). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

My environment:
python 3.5.6
transformers 2.5.1
torch 1.4.0

Some emojis not tokenized properly

Hi dev team
I appreciate you guys for making this model to facilitate nlp research in tweets! I have been trying to use the BERTweet to do my project on Twitter, however I think I've just found something weird with the tokenization step.
The 'Example usage' tab in README gives a sample tweet: "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @user 😢"
I tried to tokenize this tweet with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False), then use print(tokenizer.convert_ids_to_tokens(tokenizer.encode(line))), I get:

['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', '<unk>', ':', '</s>']

Or with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False, normalization=True), I get:
['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', ':', 'cry', ':', '</s>']

Either way, the tokenization is not correct for the emoji string ":cry:"

I have checked the source code implemented in Transformer, I think what went wrong is that for emoji.demojizer(), you need to set the option use_aliases=True to cover all emojis, otherwise some just won't get included.

I have also checked tokenizer.get_vocab()[':cry:'], and it returns a KeyError

What are pre-processing steps applied

I want to know what are the preprocessing steps applied other than emoji, username and URL?
are tweets is lowered?
are digits removed?
what about punctuations?

Huggingface version is not working

error

When I execute the huggingface version It throws the following error:

`OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

`

reproduce

https://colab.research.google.com/drive/1bwWQAX9Ql0d1fTVSQd1KpP2AQledgyKZ?usp=sharing

Script to postprocess the prediction outputs on the Ritter11-T-POS test set

def convertTags(filein):
    writer = open(filein+".post", "w")
    lines = open(filein, "r").readlines()
    for ind in range(len(lines)):
        line = lines[ind]
        tokTag = line.strip().split()
        if len(tokTag) == 0:
            writer.write("\n")
            continue
            
        if tokTag[0] == "@USER":
            tokTag[1] = "USR"
        elif tokTag[0] == "HTTPURL":
            tokTag[1] = "URL"
        elif tokTag[0].startswith("#"):
            tokTag[1] = "HT"
        elif tokTag[0] == "RT":
            tokTag[1] = "RT"

        if tokTag[0] == "(" or tokTag[0] == ")":
            if ind >= 1:
                tokTag_1 = lines[ind-1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]
            if ind < len(lines) - 1:
                tokTag_1 = lines[ind + 1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]

        writer.write("\t".join(tokTag) + "\n")
    writer.close()

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets

Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base model using Huggingface, but since it is not supported by PreTrainedTokenizerFast, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large model, so I assume its not impossible.

Issue when fine-tuning the model from huggingface hub

Thanks for making the model available in huggingface hub. I tried to use it with some existing code I have. I've been running the same code with some 10+ models from huggingface hub with no issue. When I try to run with: "vinai/bertweet-base"

I get the following error (note model loads fine and it seems it starts training for several iterations) - see below.

I'm not sure what the problem could be. Could the version of transformers and/or pytorch be the problem? Do you know which versions you tried it with? I'm using transformers 3.4 and torch 1.5.1+cu101

Thanks for your help!

| 44/1923 [00:11<08:11,  3.82it/s]Traceback (most recent call last):
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
    main()
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
    trainer.train(
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
    tr_loss += self.training_step(model, inputs)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 1070, in training_step
    loss.backward()
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x2b9e5852d536 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd43696 (0x2b9e2155e696 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&) + 0x19e1 (0x2b9e2251ce11 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)

Use model output for sentiment classifcation

Thanks a lot for the work on BERTweet. In the paper you describe using the model for 3-class sentiment analysis. Can you please provide an example how it is done?

The readme example:

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
    features = bertweet(input_ids)

produces the results

# Inspect results
print(f'Pooler outputs shape: {features["pooler_output"].shape}')
print(f'Last hidden states shape: {features["last_hidden_state"].shape}')

Pooler outputs shape: torch.Size([1, 768])
Hidden states shape: torch.Size([1, 20, 768])

How do you then use the model outputs to classify the sentiment of the tweet?

error while running the sample code

Hi,
I am using google colab and trying to run the sample usage code you have given,
from fairseq.data.encoders.fastbpe import fastBPE from fairseq import options parser = options.get_preprocessing_parser() parser.add_argument('--bpe-codes', type=str, help='path to fastBPE BPE',default="BERTweet_base_fairseq/bpe.codes") args = parser.parse_args()

I'm facing an issue while passing args.
The error is in attached img. How do i have this error?
Screenshot from 2020-06-02 10-48-24

Can't load BERTweet Tokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

results in:
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 220, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1524, in _from_pretrained
raise EnvironmentError(
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

transformers version: 3.1.0

AutoTokenizer gives error

The sample script provided here gives error. The script is given below:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)

Error:

Traceback (most recent call last):
File "temp.py", line 5, in
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
File "<conda_env>//lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 372, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "<conda_env>/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 275, in tokenizer_class_from_name
if c.name == class_name:
AttributeError: 'NoneType' object has no attribute 'name'

The error is resolved by using:

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

tokenizer does not pad the data

All the following calls to the tokenizer return the same ids

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased")


input_ids = tokenizer.encode(line, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, max_length=128, return_tensors="pt")

BPE installation error

I am working on a remote Ubuntu server using SSH. Python 3.6. I am unable to install fastBPE and get this error:
error: command 'gcc' failed with exit status 1
As a result, I'm unable to execute my code. I am unable to install this with pip or conda as well. Please help.

Sentimental analysis of tweets.

I want to do a sentimental analysis of the tweets and I want to use this model for that purpose.

Can someone provide me a high-level overview of what/How I should be doing to accomplish my task?

Applying Bertweet to a huge pandas dataframe

Hello everyone :)

I'm a psychologist researcher studying user behaviour on social media.

As part of my research, i get a huge amount of tweets on a specific hashtag (~25.000.000 tweets).

I would like to do sentiment analysis on this dataset, i previously used the default HugginFace for SA, but the results weren't that great:

classifier = pipeline("sentiment-analysis")

tweets_df['sentiment'] = tweets_df['text'].apply(lambda row : (classifier(row))[0]['label'])

I run the example code on the main gh page:

image

I'm here asking for some guidance, of how can i apply it now in a performatic way and get a new column called 'sentiment' with sentiment analysis of each row?

How to deal with batch data input

How to deal with the input of batch data of different lengths, such as batch_ Size = 2, "I like playing basketball" and "it's not a good day" are two sentences as input?

synergy Jina <> BERTweet

hi VinAI team,

Great work 👍 I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.

I'm asking if we can build a synergy between Jina <> BERTweet (& PhoBERT, post it separately). https://github.com/jina-ai/jina
image

Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.

Potential synergy

  1. I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?

  2. If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.

Re-training the language model

Hi all,

Thank you for the great work! This will solve the problem of adapting BERT (trained on Wikipedia and the book corpus) to the tweet domain. Of course, the problems such as adapting it to own domain of tweet data is still there. For this purpose, it would be useful to re-train the BERTweet language model first to teach BERTweet to speak the language of a specific domain. I have been investigating some tutorials and the trainer module of transformers. Do you have any guidance, script or a tutorial that can be shared?

Many thanks

Sentence embeddings

How do I generate entire tweet embeddings instead of word embeddings using BERTweet?

IndexError: index out of range in self

Hi, I encountered an error: "IndexError: index out of range in self." Below is my code. Can you help identify where the problem is? Is it related to the length of the sequence? I can provide the specific text if you need it.

pretrain_model = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = AutoModelForMaskedLM.from_pretrained(pretrain_model)

inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
last_hidden_states = model(**inputs, output_hidden_states=True).hidden_states[-1]

Using with BERTweet with Farm

When I try to use the BERTweet model with the Farm Package I get the following error. It seems to struggling to find the model but I do not understand why. I am using the Jupyter notebook described in this article and have included the cell code with error below.

lang_model = "vinai/bertweet-base"
do_lower_case = False

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=lang_model,
    do_lower_case=do_lower_case)
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

/var/folders/5r/p050t_sd4l130ytlj_x4wxyh0000gn/T/ipykernel_39034/78292544.py in <module>
      2 do_lower_case = False
      3 
----> 4 tokenizer = Tokenizer.load(
      5     pretrained_model_name_or_path=lang_model,
      6     do_lower_case=do_lower_case)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/farm/modeling/tokenization.py in load(cls, pretrained_model_name_or_path, revision, tokenizer_class, use_fast, **kwargs)
     95         elif "RobertaTokenizer" in tokenizer_class:
     96             if use_fast:
---> 97                 ret = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
     98             else:
     99                 ret = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1706                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
   1707             )
-> 1708             raise EnvironmentError(msg)
   1709 
   1710         for file_id, file_path in vocab_files.items():

OSError: Can't load tokenizer for 'vinai/bertweet-base'. Make sure that:

- 'vinai/bertweet-base' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'vinai/bertweet-base' is the correct path to a directory containing relevant tokenizer files

Could you share the pre-processed tweet data?

Hi all,

Thanks for the great work! I used it to make this little emoji recommender:

http://rensdimmendaal.com/emoji/

I'd love to expand the number of different emoji I can recommend. However, I cannot at the moment, because some emoji are split into multiple tokens. Would you be willing to share the preprocessed data of bertweet-base so I can add these as single tokens to the vocabulary and tune the model?

next sentence prediction

Thanks for developing BERTweet
here is a conceptual question when it comes to utilizing tweets for training BERT model and I am curious how you have handled that.

Bert Language model has a "next sentence prediction" model, where through building the LM model try to optimize predicting the next sentence.

since tweets are short and often contain one sentence. I am curious how you have handled that and how have you bypassed NSP part ?

Thank you again.

Tokenizer vinai/bertweet-covid19-base-uncased

Does vinai/bertweet-covid19-base-uncased use the same tokenizer as bertweet-base? I've been trying to run a code on my annotated data and keeps giving me an error "index is out of range in self" during train stage if I download model and tokenizer from bertweet-covid19-base-uncased. The only way it worked for me was using the model from bertweet-covid19-base-uncased and tokenizer from bertweet-base.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.