bfelbo / deepmoji Goto Github PK

State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.

License: MIT License

Python 100.00%

ai deep-learning keras machine-learning natural-language-processing neural-networks nlp python sentiment-analysis tensorflow text-classification

deepmoji's People

Contributors

Stargazers

Watchers

Forkers

codeaudit goodrahstar aykutfirat asingh33 mamonraab ryanleary edvoinea vikingmew hamedmp vicsanjinez anujgupta82 thomwolf lampts xiao666 zcc888 albertwy schwittleymani cdyangbo loretoparisi honorforlee leocasarsa royaljain yonashub mirunapislar aishikchakraborty brandonzhao soartang dpanshu rht shenxin008 scarlettdima caomw ilovecv shubhampachori12110095 a382695908 datasoccer hellozjj mmelodious solens alex4321 cemheren tokestermw mingchen62 dobatymo mageed silversss 52mango deeplearningsky jpzhangvincent tetegra hailiang-wang sunnymarkliu isuruwg s4sarath nawshad shichaoji johal ceshine rajat95 tpnguyen firojalam chenghuige tensorspace shirley02101175 sofineismine teamelixir meelement cnnjqzr yiannnnn henrychan28 matiji66 smhendryx yz2812 nateastle neo4reo davidtore amiramohamd95 derekhoward blueseasky gulhati yanqzhi ratneshj5 steveyangrh ysenarath adithya604 abdelrahman-t ramananm littttttlebird pydataman dynamicwebpaige devangpaliwal yavuzkomecoglu tpzjj612 priyankaranke subbareddy248 afcarl zhangxuan1918 shihuaxing andradeandrey yhuangbl

deepmoji's Issues

Benchmark Dataset Splits

Hello,

I was trying to recreate the results of the new method in the paper when I got confused about the splits provided in the dataset. For example, the SS-Youtube dataset is said to have 1000 training samples and 1142 test samples in Table 4 of the paper. However, the splits that are given in the raw.pickle file of the dataset result in a split of 800 training, 100 validation, and 1242 test samples. Does this imply a change in splits in any case or is there a typo in the paper?

Thank you.

[Errno 2] No such file or directory: 'model/deepmoji_weights.hdf5'

I get this error when I try to run
python scripts/download_weights.py

correlation matrix calculation & data preprocessing

hi, I have some confuse about the calculation of correlation matrix——how can we use pretraining test set to get the matrix? here is my assumption:

the each emoji, we have 10K test sample, use pretrained model to predict emoji distribution with 64 emoji, we get (10K, 64) matrix for each emoji
averaging in the axis 0, we get (1, 64) for each emoji and calculate the correlation between each other
do I understand corretly ?

what's more, labeling a text with emoji(s) in the original twitter, what if the emoji doesn't occur in selected emoji set(64)? drop it or keep in the text just as a char ?

looking forward your reply

Live Demo is down

The backend for the live demo on the website is down

Hierarchical Clustering from Fig. 3 and 6/7 in paper

Is there a chance that you might share the hierarchical clustering shown in these images?

Perhaps I have missed it in the code, specifically I am looking for a coded schema that represents the hierarchy.

Unable to freeze Bidirectional Layer

I've run the script "examples/finetune_youtube_last.py". The trainable weights should be the last layer, but I found the Bidirectional layers are not frozen. I use tensorflow 1.3.0 as backend, and the keras version is 2.0.9.

confusion in finetuning script

Training Time?

Just wondering if you had rough figures on how long it took to train the released model (type/number of GPUs, epochs, total time?) I'm hoping to build a similar corpus and just trying to plan while the data collects. Thanks.

"New" method

Hello,

I am unsure about the meaning of the "new" method for transfer learning on DeepMoji. In finetuning.py it seems that the only difference between "full" and "new" is the loss rate. The paper describes "new" as a "model trained without pretraining," indicating that it is similar to a network with the same architecture but without transferring any weights. However, when I run "new" against both "full" and a newly instantiated model (using deepmoji_architecture), I get differing results. I would very much appreciate an explanation. Thank you!

Problem adjusting finetuning examples with my data

Hi there,

First, thank you so much for all of your work on this project. I'm really excited about what this model can be used for.

I'm having a problem finetuning the model for my data. I'm using the finetuning_insults_chain-thaw.py example.

My data has 5 classes and when I run it through the load_benchmark() function, it separates my data in the training, validation, and test sets that have the following shapes:
(1040, 30)
(116, 30)
(290, 30)

My labels have the following shapes:
(1040,)
(116,)
(290,)

All of my label arrays have the following unique values: [1, 2, 3, 4, 5]

I hit an error in the finetune() function where I get the following error:

ValueError: Error when checking target: expected softmax to have shape (None, 5) but got array with shape (116, 6)

I find this odd because the example seems to work on the benchmark data. I'm curious why my data is being reshaped in an unexpected way. If you could point me in the right direction for how to troubleshoot this I'd really appreciate it.

Thanks!

Here is the error report:

Extended vocabulary for embedding layer from 50000 to 50556 tokens.
Loading weights for bi_lstm_0
Loading weights for bi_lstm_1
Loading weights for attlayer
Ignoring weights for softmax
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 30)           0
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 256)      12942336    input_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 30, 256)      0           embedding[0][0]
__________________________________________________________________________________________________
embed_drop (SpatialDropout1D)   (None, 30, 256)      0           activation_1[0][0]
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional)       (None, 30, 1024)     3149824     embed_drop[0][0]
__________________________________________________________________________________________________
bi_lstm_1 (Bidirectional)       (None, 30, 1024)     6295552     bi_lstm_0[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 30, 2304)     0           bi_lstm_1[0][0]
                                                                 bi_lstm_0[0][0]
                                                                 embed_drop[0][0]
__________________________________________________________________________________________________
attlayer (AttentionWeightedAver (None, 2304)         2304        concatenate_1[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 2304)         0           attlayer[0][0]
__________________________________________________________________________________________________
softmax (Dense)                 (None, 5)            11525       dropout_1[0][0]
==================================================================================================
Total params: 22,401,541
Trainable params: 22,401,541
Non-trainable params: 0
__________________________________________________________________________________________________
Method:  chain-thaw
Metric:  acc
Classes: 5
Training..
WARNING:tensorflow:From .../deepmoji/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:1340: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Finetuning softmax
Traceback (most recent call last):
  File "finetune_android_chain-thaw.py", line 76, in <module>
    data['batch_size'], method='chain-thaw')
  File ".../deepmoji/finetuning.py", line 377, in finetune
    evaluate=metric, verbose=verbose)
  File "...deepmoji/finetuning.py", line 546, in chain_thaw
    batch_size=batch_size, verbose=verbose)
  File ".../deepmoji/finetuning.py", line 626, in train_by_chain_thaw
    callbacks=callbacks, verbose=(verbose >= 2))
  File ".../deepmoji/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File ".../deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 2013, in fit_generator
    val_x, val_y, val_sample_weight)
  File ".../deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 1413, in _standardize_user_data
    exception_prefix='target')
  File ".../deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 154, in _standardize_input_data
    str(array.shape))
ValueError: Error when checking target: expected softmax to have shape (None, 5) but got array with shape (116, 6)```

Issue in IMDB example

https://github.com/bfelbo/DeepMoji/blob/master/examples/imdb_from_scratch.py#L11

ImportError Traceback (most recent call last)
ipython-input-7-949a75e33038> in module>()
4 from keras.preprocessing import sequence
5 from keras.datasets import imdb
----> 6 from deepmoji.model_def import DeepMoji

ImportError: cannot import name DeepMoji

when I open deepmoji.model_def file, there is no function named "DeepMoji"

Am I missing something ?

can i use transfer learning to apply the pretrained model to a regression task?

hello, my data is composed in the form: (a sentence) + (each sentence has a sentiment score in range[-1,1]).
I have a look at the "finetune_youtube_last.py", the parameter of the deepmoji_transfer is number of classes, so does the deepmoji supports regression tasks?
Thank you!

Return More Emoji

Is it possible to return a wider set of emoji than the 64 emoji defined here?

Would like to modify Deepmoji to be able to score text to use a wider set of emoji using the Twitter data used in the model.

Emoji Codes

Thank you for your great work! Which is the mapping between the emoji (html) codes and the values in the output?

(.env) [loretoparisi@:mbploreto examples]$ cat test_sentences.csv 
Text,Top5%,Emoji_1,Emoji_2,Emoji_3,Emoji_4,Emoji_5,Pct_1,Pct_2,Pct_3,Pct_4,Pct_5
I love mom's cooking,0.66769218631088734,36,4,8,16,47,0.490533,0.0876578,0.0309252,0.0296329,0.0289432
I love how you never reply back..,0.39060430228710175,1,19,55,25,46,0.13978,0.0825157,0.0627883,0.0541912,0.0513291
I love cruising with my homies,0.5413312129676342,31,6,30,15,13,0.339657,0.0660581,0.0570588,0.040674,0.0378832
I love messing with yo mind!!,0.48666666820645332,54,44,9,50,49,0.172363,0.118436,0.0796059,0.0637103,0.0525513
I love you and now you're just gone..,0.67333512753248215,46,5,27,35,34,0.391165,0.110334,0.0734624,0.0529587,0.0454147
This is shit,0.31180432066321373,55,32,27,1,37,0.0700932,0.0639694,0.0601157,0.0595257,0.0581003
This is the shit,0.37477066367864609,48,11,6,31,9,0.108907,0.0965946,0.0648208,0.0565014,0.0479466

Deploy model in mobile.

Hello team!
I was wondering how can we go about deploying the deepmoji model on mobile. The optimised size is around 22 MB. For deployment purpose on client side we need model size about 3-4MB. Ant tips on how can we go about it or how can we go about compressing the size of the model ?
Thanks in advance!

problem with saving the intermediate weights in `class_avg_tune`

Hi,

I find it difficult to save the intermediate weights while calling class_avg_tune_trainable. I wish to keep track of the weights and not have only the last iterations' weights. Is that possible in the current implementation? I looked everywhere and it seems like it's not yet implemented. Am I missing or overlooking something? I am currently trying to pretrain my own model on a different dataset that is labeled for 10 classes rather than the emoji one and then to further apply the transfer methods in deepmoji.

Thank you!

Deepmoji Live demo is down

Hello
The backend for the live demo on the website is down
🙏🤞🏽🤓

Is the training data private?

Hi @bfelbo ,

Is the tweets dataset that the model was trained on available somewhere?

ValueError: Dimension 1 in both shapes must be equal, but are 64 and 1.

Hi,
I have finetuned the model with my data using finetune_dataset.py. When I use the finetuned model to run scripts/score_texts_emojis.py, I get the following error.

Traceback (most recent call last):
  File "emo_run.py", line 82, in <module>
    model = deepmoji_emojis(maxlen, PRETRAINED_PATH)
  File "C:\Users\ntelkunte\AppData\Local\Programs\Python\Python36\DeepMoji\deepmoji\model_def.py", line 58, in deepmoji_emojis
    model.load_weights(weight_path, by_name=False)
:
:
ValueError: Dimension 1 in both shapes must be equal, but are 64 and 1. Shapes are [2304,64] and [2304,1]. for 'Assign_14' (op: 'Assign') with input shapes: [2304,64], [2304,1].

Python 3 support?

Python 2.7 will be deprecated by the end of the year, are there plans to update the model to support Python 3 anytime soon?

Message length limits?

From the site demo (server is down as of this issue ticket date), it appears that text input beyond 140 characters cannot be analyzed? Can someone confirm this? Thanks.

Fine-tuning to predict emotion labels

Hi,
I have been trying to use DeepMoji to predict emotion labels for a given text. I have 7 labels I would like to classify:

{'0': 'neutral', '1': 'angry', '2': 'disgusted', '3': 'afraid', '4': 'joyful', '5': 'sad', '6': 'surprised'}

For that purpose, I have used the principles from finetune_youtube_last.py to fine-tune the model for this task with nb_classes=7 and the accuracy measure. The fine-tuning worked nicely and it reported an accuracy of about 85% in the end. The training data was split 0.7, 0.1, 0.2 for training, test and validation (default params).

I have used the DailyDialogues dataset with the following emotion distribution:

{ "neutral": 72143, "joyful": 11182, "surprised": 1600, "afraid": 146, "disgusted": 303, "sad": 969, "angry": 827}

I am fully aware that the unbalanced distribution of the emotions might yield in poorer results.

Now, I have built the classifier using the deepmoji_transfer model passing it the nb_classes=7 and the path to my fine-tuned model. I am trying to predict the emotion label similar to what is done in examples/score_texts_emojis.py

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
model = deepmoji_transfer(maxlen, PRETRAINED_PATH, nb_classes=7)
prob = model.predict(tokenized)[0]

Given the message : 'I'm sorry I'm so late . I had a really bad day .' with the true label '5: sad', the model predicts the following prob:

prob = [ 0.09569884  0.0398518  -0.04161     0.00866034  0.06039366 -0.07945964
 -0.07649311]

Yielding in the following sorted predictions (using top_elements(prob, 7)):

sorted_idx = [0 4 1 3 2 6 5]
sorted_labels = ['neutral', 'joyful', 'angry', 'afraid', 'disgusted', 'surprised', 'sad']

So the predicted label would be 'neutral' here.

I have several questions/problems with this classifier that you might help me with:

What do the negative probabilities in the prob array mean? e.g. the prob -0.04161 of the label '2': 'disgusted'.
Do you see any issues in my procedure for predicting the label from the code above?
I observe non-deterministic behaviour. Every time I reload the model, it yields in different predictions. For example when running the prediction several times for the sentence above whilst re-loading the model from scratch each time, it yields in a different prediction almost every time. If you run a prediction several times within the same model, it always produces the same predictions (OK).
When I validate the model on my test set, I get a precision of about 13.4% compared to the accuracy of 85% during fine-tuning. How can this be? I don't think I'm overfitting the training data since when I run my validation on the training data I get a precision of just about 2.6%

Your support is highly appreciated! Thanks in advance for the efforts.

Returning "confidence" for each predicted emoji?

Hi, just wonder if there is a way to get 'confidence' associated with each emoji predicted? Such functionality will be very useful. (apologies if it already exists and I've missed it)

Typo DATASET vs. DATASETS?

Is the constant singular when it should be plural? Or should it be dset?

flake8 testing of https://github.com/bfelbo/DeepMoji

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./scripts/analyze_all_results.py:16:75: F821 undefined name 'DATASET'
    RESULT_PATHS = glob.glob('{}/{}_{}_*_results.txt'.format(RESULTS_DIR, DATASET, METHOD))
                                                                          ^

./scripts/analyze_all_results.py:30:32: F821 undefined name 'DATASET'
    print('Dataset: {}'.format(DATASET))
                               ^

How is emotional impact of words highlighted?

Hi @bfelbo,

On the demo page, there is a section that shows the impact of words on the emoji predictions. How to calculate that?

Error when attempting to use either Theano and Tensorflow

Hello,

So I have been trying to solve these problems for a while. I am running Python 2.7 64 bit on Windows. When I try to use Tensorflow, when I run nose tests for it I think I get "No Module named Tensorflow", an error message relating to Tensorflow being a different version maybe than the one that is being searched for by the test itself (specifically some file not reading version 5 but version six), but couldn't figure out that one.

Anyways, so I am now attempting to use Theano, and have reinstalled multiple times, but the nose test just keeps giving me "ImportError: No module named Theano" so clearly this problem is the same. Do I need to install the directory somewhere specific?
If it makes any difference, I am downloading the zip files instead of cloning.
Any ideas? I am going to go and try to install this on Mac on another computer and see if that works, but I would really prefer to install it on my own.

index error with SCv1 and SCv2-GEN

Hello, I really appreciate your amazing work and am trying to replicate it. The error I've got with SCv1 and SCv2-GEN is like below, and have no clue since I had no problem with other datasets. Is there anything that I need to edit additionally? Thank you in advance!

File "/deepmoji/class_avg_finetuning.py", line 103, in class_avg_finetune
.format(expected_shape, ls.shape[1]))
IndexError: tuple index out of range

Test Failed: IOError ([Errno 2] No such file or directory: '../model/vocabulary.json')

======================================================================
ERROR: Failure: IOError ([Errno 2] No such file or directory: '../model/vocabulary.json')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/content/DeepMoji/tests/test_sentence_tokenizer.py", line 26, in <module>
    with open('../model/vocabulary.json', 'r') as f:
IOError: [Errno 2] No such file or directory: '../model/vocabulary.json'

----------------------------------------------------------------------
Ran 25 tests in 33.304s

FAILED (errors=1)

Clarification

How were the Kaggle and Youtube comments classified—were they cut off at 140 characters or was transfer learning applied with fewer but longer examples?

cannot connect to remote server

it says:

when I enter something that is not a part of Examples.

btw @bfelbo you looks exactly like Richard from Silicon Valley.

How did you relate data labels to emojis

I have been trying to predict emojis using your data and your model as reference.But I am unable to relate the labels mentioned in your data with emojis.After training each dataset and getting models for each one,how do i relate all those models to get a combined and new model to predict among all the 64 emoji codes

Unable to open file: /model/deepmoji_weights.hdf5'

HI,
I tried the "score_texts_emojis.py" in example directory.
But seems it lack the hdf5 file in model directory?

Training data

I have found that you have used a dataset of 56.6 billion tweets as raw data, so I was wondering that how can you get so many tweets, did you use a crawler system or just get them from twitter api or something else?

What would be the process to train a new model to predict emojis on a different dataset?

Attention weights

Do you have a preferred/recommended way of exposing the attention weights at inference-time for visualizing the words/timesteps that were most heavily weighted for the prediction with this Keras implementation?

Opensource Dataset

I know, it's a lot to ask but any plans on making the dataset opensource?

How are emojis handled? How are they encoded?

Hi https://github.com/bfelbo
Sorry for pestering you! I am currently building a similar model for Emoji prediction for Hindi(India) language. I couldn't understand how are you treating emojis gathered from a tweet for training the model. If say X_train are the sentences from tweet data then what is Y_train? How are emojis handled? Are the one-hot encoded or some other methodology has been adopted?
Hoping for a response as I am a bit stuck here. I have written scripts for building vocabulary from the corpus gathered.
Thanks a ton !!

Building similar model for Hindi(India) language.

I am trying to build similar model for Hindi(Devnagri). The pipeline in Deepmoji from building vocab to tokenizing, will it be the same or can we go for use of pre-trained word embeddings as provided by Fasttext and elmo to be used here. I am just curious as why building our own vocab appealed as a better way to target the problem than using word embeddings.
I had this notion that once we have converted the words to their respective numeric representation, the model will work accordingly. But your team has specifically gone for buildling dataset specific words, then tokenizing them and finally training. What does the model do when an Out-of-vocabulary(OOV) word comes in a test case?
Sorry for so many doubts. I am just beginning to understand emotion analysis.
Thanks!!

Errors importing data from a csv file with Unicode decoding: Any Ideas?

Hey everyone, so I finally have gotten past some erros encoding my data in a csv to Unicode so that my data can be scored. I am using the score-texts-emoji file in examples to try and attempt this.

Now I am getting problems decoding bytes within my file and I was wondering what the soution might be for this and if anyone has run into this problem. Here is what my code looks like :

TEST_SENTENCES = []
with open('Cleaned_Data3.csv', 'rU') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        TEST_SENTENCES.append(row["Tweet"])
    try:
        [x.encode('utf-8') for x in TEST_SENTENCES]
    except:
        for rows in TEST_SENTENCES: #attempt to fix the problem 
            str=unicode(str, errors='replace')

'str' object has no attribute 'decode' in shorten_word from filter_utiles.py

hi,
am trying to convert your code to python 3.7, would you provide a version of shorten_word from filter_utils.py for python 3 since decode() is not supported for str please.

code update

Hey there 🎀,

the paper to this repository looks amazing and I would love to have some possibility to experiment with it hands on. Are you still planning to release some code and pre-trained models here?

Many thanks 🤝,
Marcel

Problem with F1 Metric in Fine Tuning

Hi,

Thanks for the nice and useful repository!

Currently, using the F1 metric instead of the accuracy does not work when fine tuning. For that to work, the global variables might have to be changed at this line ('weighted'->'weighted_f1'):

DeepMoji/deepmoji/global_variables.py

Line 27 in 2ad102a

FINETUNING_METRICS = ['acc', 'weighted']

Also, the fine-tuning code has to be modified at this line as there is no 'weighted_f1' average option for f1_score:

DeepMoji/deepmoji/finetuning.py

Line 471 in 2ad102a

average='weighted_f1')

Struggling to run the examples

Hi, awesome work.

I'm having a hard time getting one of the examples to run (score_texts_emojis.py). It's throwing an error when compiling the regular expression in the tokenizer.py module:

RE_PATTERN = re.compile(ur'|'.join(IGNORED) + ur'|(' + ur'|'.join(TOKENS) + ur')',
re.UNICODE)

File "C:\Users\W\Downloads\DeepMoji-master\DeepMoji-master\deepmoji\tokenizer.py", line 139, in
re.UNICODE)
File "C:\Python27\lib\re.py", line 194, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range

Here's the resulting regular expression before it attempts to compile it:

(u'\s+|((?:https?://|www\.)(?:[a-zA-Z]|[0-9]|[$-@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|\b[a-zA-Z0-9.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b|[a-zA-Z]+[-][a-zA-Z]+|#[a-zA-Z0-9]+|@[a-zA-Z0-9_]+|(?:<+/?3+)+|\-\\-|x\x|\^\\^|o\.o|o\o|$\:|$\:|\)\;|$\;|\>\:\-?D+|\>\:\-?d+|\>\:\-?p+|\>\:\-?P+|\>\:\-?v+|\>\:\-?$+|\>\:\-?o+|\>\:\-?O+|\>\:\-?$+|\>\:\-?3+|\>\:\-?\/+|\>\:\-?\|+|\>\:\-?\\+|\>\:\,?D+|\>\:\,?d+|\>\:\,?p+|\>\:\,?P+|\>\:\,?v+|\>\:\,?$+|\>\:\,?o+|\>\:\,?O+|\>\:\,?$+|\>\:\,?3+|\>\:\,?\/+|\>\:\,?\|+|\>\:\,?\\+|\>\:\^?D+|\>\:\^?d+|\>\:\^?p+|\>\:\^?P+|\>\:\^?v+|\>\:\^?$+|\>\:\^?o+|\>\:\^?O+|\>\:\^?$+|\>\:\^?3+|\>\:\^?\/+|\>\:\^?\|+|\>\:\^?\\+|\>\:\'?D+|\>\:\'?d+|\>\:\'?p+|\>\:\'?P+|\>\:\'?v+|\>\:\'?$+|\>\:\'?o+|\>\:\'?O+|\>\:\'?$+|\>\:\'?3+|\>\:\'?\/+|\>\:\'?\|+|\>\:\'?\\+|\>\:\"?D+|\>\:\"?d+|\>\:\"?p+|\>\:\"?P+|\>\:\"?v+|\>\:\"?$+|\>\:\"?o+|\>\:\"?O+|\>\:\"?$+|\>\:\"?3+|\>\:\"?\/+|\>\:\"?\|+|\>\:\"?\\+|\:\-?D+|\:\-?d+|\:\-?p+|\:\-?P+|\:\-?v+|\:\-?$+|\:\-?o+|\:\-?O+|\:\-?$+|\:\-?3+|\:\-?\/+|\:\-?\|+|\:\-?\\+|\:\,?D+|\:\,?d+|\:\,?p+|\:\,?P+|\:\,?v+|\:\,?$+|\:\,?o+|\:\,?O+|\:\,?$+|\:\,?3+|\:\,?\/+|\:\,?\|+|\:\,?\\+|\:\^?D+|\:\^?d+|\:\^?p+|\:\^?P+|\:\^?v+|\:\^?$+|\:\^?o+|\:\^?O+|\:\^?$+|\:\^?3+|\:\^?\/+|\:\^?\|+|\:\^?\\+|\:\'?D+|\:\'?d+|\:\'?p+|\:\'?P+|\:\'?v+|\:\'?$+|\:\'?o+|\:\'?O+|\:\'?$+|\:\'?3+|\:\'?\/+|\:\'?\|+|\:\'?\\+|\:\"?D+|\:\"?d+|\:\"?p+|\:\"?P+|\:\"?v+|\:\"?$+|\:\"?o+|\:\"?O+|\:\"?$+|\:\"?3+|\:\"?\/+|\:\"?\|+|\:\"?\\+|\=\-?D+|\=\-?d+|\=\-?p+|\=\-?P+|\=\-?v+|\=\-?$+|\=\-?o+|\=\-?O+|\=\-?$+|\=\-?3+|\=\-?\/+|\=\-?\|+|\=\-?\\+|\=\,?D+|\=\,?d+|\=\,?p+|\=\,?P+|\=\,?v+|\=\,?$+|\=\,?o+|\=\,?O+|\=\,?$+|\=\,?3+|\=\,?\/+|\=\,?\|+|\=\,?\\+|\=\^?D+|\=\^?d+|\=\^?p+|\=\^?P+|\=\^?v+|\=\^?$+|\=\^?o+|\=\^?O+|\=\^?$+|\=\^?3+|\=\^?\/+|\=\^?\|+|\=\^?\\+|\=\'?D+|\=\'?d+|\=\'?p+|\=\'?P+|\=\'?v+|\=\'?$+|\=\'?o+|\=\'?O+|\=\'?$+|\=\'?3+|\=\'?\/+|\=\'?\|+|\=\'?\\+|\=\"?D+|\=\"?d+|\=\"?p+|\=\"?P+|\=\"?v+|\=\"?$+|\=\"?o+|\=\"?O+|\=\"?$+|\=\"?3+|\=\"?\/+|\=\"?\|+|\=\"?\\+|\;\-?D+|\;\-?d+|\;\-?p+|\;\-?P+|\;\-?v+|\;\-?$+|\;\-?o+|\;\-?O+|\;\-?$+|\;\-?3+|\;\-?\/+|\;\-?\|+|\;\-?\\+|\;\,?D+|\;\,?d+|\;\,?p+|\;\,?P+|\;\,?v+|\;\,?$+|\;\,?o+|\;\,?O+|\;\,?$+|\;\,?3+|\;\,?\/+|\;\,?\|+|\;\,?\\+|\;\^?D+|\;\^?d+|\;\^?p+|\;\^?P+|\;\^?v+|\;\^?$+|\;\^?o+|\;\^?O+|\;\^?$+|\;\^?3+|\;\^?\/+|\;\^?\|+|\;\^?\\+|\;\'?D+|\;\'?d+|\;\'?p+|\;\'?P+|\;\'?v+|\;\'?$+|\;\'?o+|\;\'?O+|\;\'?$+|\;\'?3+|\;\'?\/+|\;\'?\|+|\;\'?\\+|\;\"?D+|\;\"?d+|\;\"?p+|\;\"?P+|\;\"?v+|\;\"?$+|\;\"?o+|\;\"?O+|\;\"?$+|\;\"?3+|\;\"?\/+|\;\"?\|+|\;\"?\\+|[a-zA-Z]+\'[a-zA-Z]+|(?i)Mr\.|(?i)Ms\.|(?i)Mrs\.|(?i)Dr\.|(?i)Prof\.|\b(?<!\.)(?:[A-Za-z]\.){2,}|[0-9]+|[a-zA-Z]+|\(+|$+|\<+|\!+|\?+|\.+|\,+|\/+|\\+|\'+|\\+|\"+|\-+|\+|\=+|\\+|\\+|\\xa7+|\|+|\\xb4+|\\u02c7+|\\xb0+|\[+|\]+|\<+|\>+|\{+|\}+|\~+|\$+|\^+|\&+|\*+|\;+|\:+|\%+|\++|\\+|x+|a+|3+|\\u20ac+|\`+|#+(?=#[a-zA-Z0-9]+)|@+(?=@[a-zA-Z0-9_]+)|#+|@+|[\U0001f300-\U0001f64f\U0001f680-\U0001f6ff\n\u2600-\u26ff\u2700-\u27bf]|.)'

Thanks

Is training dataset available?

I want to finetune BERT for emoji classification. So, in this repository all I could find is these benchmark datasets- https://github.com/bfelbo/DeepMoji/tree/master/data . Is original training dataset available?

cannot connect to remote server

yes

How to start training deepmoji on a new language corpus?

Hi Team!
Great work first of all!
I have gone through the code and implementation of Deepmoji. Sorry if I sound a little noob, but can you please give a little explanation on how to we feed data to deepmoji? What kind of file/format it desires ans what are the steps in line to be taken to start the training from scratch? KIndly help!

how to return impact associated to each word per sentence

Hi @bfelbo , and first of all thanks for sharing your great work.

I have a dataset in which the domain is a little bit different than twitter. I have a couple of questions and would really appreciate it if you could help me with this.

For the start, I fined tuned my dataset and got the accuracy. However, what is important for me is to be able to find out the impact of the words per sentence (The same highlight that you have in the demo)
For example:

"This disease is very dangerous"
Not only I have the label as negative but also it gives the weight associated with "dangerous".

I saw this PR (#8) is that what I need? if so could you please give some information on what I need to do in order to get what I want?

I changed the param attention_weight in attlayer script to TRUE but nothing happened in the output.

Again thanks so much for the great work!

Curious about the training set

I came across this question when reading the paper, and I cannot find answers from the code in this repository:

As I read from the paper, the pretraining data is 1.6 billion tweets, each one labeled with an emoji. Then when splitting the dataset, you sampled 10k tweets for each emoji type for both the validation set and test set, therefore each set has 640k different samples. Later you upsampled the remaining data, making sure each emoji type has the same number of samples. Does it mean you simply repeated the tweets from infrequent emojis? In that case, your training set somehow became tens of billions of samples, instead of 1.6 billions. Did I understand it correctly?

Because from what I have learned, repeated samples could lead to poor improvement with minibatches, therefore I'm curious whether I understood your method correctly or not. Thanks in advance!

Emotional Impact Data

The online demo highlights words based on their emotional impact. Is there a method to obtain numerical values or order of the impact of the words in the input?

Error when running encode_texts.py

After running encode_texts.py, I see the output to the terminal but I get this error before the output file is created.

Traceback (most recent call last): File "score_texts_emojis.py", line 20490, in <module> t_tokens = tokenized[i] IndexError: index 20407 is out of bounds for axis 0 with size 20407

bfelbo / deepmoji Goto Github PK

deepmoji's People

Contributors

Stargazers

Watchers

Forkers

deepmoji's Issues

Recommend Projects

Recommend Topics

Recommend Org