Following the success of DeepMoji and TorchMoji (1, 2), we would like to leverage Twitter as an open source of self-annotated data to create a balanced multi-language "in-the-wild" sentiment dataset to test the quality of various NLP models and/or word/sub-word tokenization techniques.
Name | Sample size | Word vocab | Ngram vocab | Family | Alphabet | Speakers L1, m |
---|---|---|---|---|---|---|
Korean (ko) | 198,561 | 516,021 | 1,862,406 | Koreanic | Hangul | 77 |
Arabic (ar) | 199,993 | 287,578 | 1,428,286 | Afro-Asiatic | Arabic alphabet | 300 |
Turkish (tr) | 199,993 | 203,657 | 687,284 | Turkic | Latin | 80 |
Russian (ru) | 241,117 | 172,653 | 812,315 | Indo-European | Cyrillic | 150 |
Spanish, Castilian (es) | 299,995 | 117,629 | 498,977 | Indo-European | Latin | 480 |
Indonesian (id) | 199,357 | 100,272 | 458,047 | Austronesian | Latin | 43 |
French (fr) | 299,995 | 99,631 | 476,360 | Indo-European | Latin | 77 |
German (de) | 184,109 | 99,213 | 516,005 | Indo-European | Latin | 90 |
English (en) | 299,995 | 95,666 | 523,046 | Indo-European | Latin | 400 |
Italian (it) | 210,703 | 95,604 | 398,091 | Indo-European | Latin | 69 |
Thai (th) | 349,995 | 73,425 | 558,911 | Tai–Kadai | Thai script | 30 |
- Download and process tweet archives from archive team;
- Filter Twitter-specific content (re-tweets, hashtags, citations, etc);
- Predict language with FastText and select items with high confidence (80-90%+);
- Select tweets that:
- Contain one of 64 emojis used in TorchMoji / DeepMoji;
- Do not contain other emojis;
- Have only one block of consecutive emojis;
- There is only one type of emoji per tweet;
- Dataset pre-processing and balancing;
- TODO
Dual license, cc-by-nc and commercial usage available after agreement with dataset authors.