emoji-sentiment-dataset's Introduction

Motivation

Following the success of DeepMoji and TorchMoji (1, 2), we would like to leverage Twitter as an open source of self-annotated data to create a balanced multi-language "in-the-wild" sentiment dataset to test the quality of various NLP models and/or word/sub-word tokenization techniques.

Dataset

Name	Sample size	Word vocab	Ngram vocab	Family	Alphabet	Speakers L1, m
Korean (ko)	198,561	516,021	1,862,406	Koreanic	Hangul	77
Arabic (ar)	199,993	287,578	1,428,286	Afro-Asiatic	Arabic alphabet	300
Turkish (tr)	199,993	203,657	687,284	Turkic	Latin	80
Russian (ru)	241,117	172,653	812,315	Indo-European	Cyrillic	150
Spanish, Castilian (es)	299,995	117,629	498,977	Indo-European	Latin	480
Indonesian (id)	199,357	100,272	458,047	Austronesian	Latin	43
French (fr)	299,995	99,631	476,360	Indo-European	Latin	77
German (de)	184,109	99,213	516,005	Indo-European	Latin	90
English (en)	299,995	95,666	523,046	Indo-European	Latin	400
Italian (it)	210,703	95,604	398,091	Indo-European	Latin	69
Thai (th)	349,995	73,425	558,911	Tai–Kadai	Thai script	30

Downloads

Curated/pre-processed/balanced dataset - 540MB;
Raw dataset - 2.4 GB;

Methodology

Download and process tweet archives from archive team;
Filter Twitter-specific content (re-tweets, hashtags, citations, etc);
Predict language with FastText and select items with high confidence (80-90%+);
Select tweets that:
- Contain one of 64 emojis used in TorchMoji / DeepMoji;
- Do not contain other emojis;
- Have only one block of consecutive emojis;
- There is only one type of emoji per tweet;
Dataset pre-processing and balancing;
TODO

License

Dual license, cc-by-nc and commercial usage available after agreement with dataset authors.

emoji-sentiment-dataset's People

Contributors

Stargazers

Watchers

emoji-sentiment-dataset's Issues

'Feather file footer incomplete'

Hi, thank you for your contribution!
When I load the data with feather.read_dataframe() in Python, I get the error message "Feather file footer incomplete". I'm not quite familiar with feather files, and I'm not sure if I missed out something or if something's wrong with the file. Would you please check out the format or provide some more details for running? Thanks a lot!

process_tweets.py not found

Hi,

Hope you are all well !

Files referenced in the README are not available, for example:
https://github.com/snakers4/emoji_sentiment/blob/master/src/process_tweets.py

Any ways to get them online ?

Cheers,
X

Recommend Projects

snakers4 / emoji-sentiment-dataset Goto Github PK

emoji-sentiment-dataset's Introduction

Table of contents

Motivation

Dataset

Downloads

Methodology

License

emoji-sentiment-dataset's People

Contributors

Stargazers

Watchers

Forkers

emoji-sentiment-dataset's Issues

'Feather file footer incomplete'

process_tweets.py not found

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent