Coder Social home page Coder Social logo

Dataset balancing about emoji_sentiment HOT 5 OPEN

Islanna avatar Islanna commented on May 30, 2024
Dataset balancing

from emoji_sentiment.

Comments (5)

snakers4 avatar snakers4 commented on May 30, 2024 1

stats.xlsx

@Islanna
updated file for article

from emoji_sentiment.

Islanna avatar Islanna commented on May 30, 2024

Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.

Languages

Full 2018 dataset size

en    3918333
ja    2603697
ar    1416102
es    1237730
pt    869292 
th    620532 
ko    493476 
fr    349677 
tr    302217 
tl    129997 
id    109838 
it    86488  
de    85671  
ru    84824  

Emoji merging

exclude_emojis = ['πŸ™Œ','πŸ‘Š','🎢','πŸ’','βœ‹','🎧','πŸ”«','πŸ™…','πŸ‘€','πŸ’―']

merge_dict = {
    'πŸ’•':'😍',
    '❀':'😍',
    'πŸ’™':'😍',
    'β™₯':'😍',
    'πŸ’œ':'😍',
    'πŸ’–':'😍',
    'πŸ’Ÿ':'😍',
    '😘':'😍',
    'πŸ˜‰':'😏',
    '😒':'😭',
    '😁':'😊',
    'πŸ˜„':'😊',
    '😌':'😊',
    '☺':'😊',
    'πŸ‘Œ':'πŸ‘',
    'πŸ‘':'πŸ‘',
    'πŸ’ͺ':'πŸ‘',
    '✨':'πŸ‘',
    '✌':'πŸ‘',
    'πŸ˜‹':'😜',
    '😐':'πŸ˜‘',
    'πŸ˜’':'πŸ˜‘',
    'πŸ˜•':'πŸ˜‘',
    '😠':'😑',
    'πŸ’€':'😑',
    '😀':'😑',
    '😈':'😑',
    '😩':'πŸ˜”',
    '😞':'πŸ˜”',
    'πŸ˜ͺ':'πŸ˜”',
    '😷':'πŸ˜”',
    '😴':'πŸ˜”',
    'πŸ™ˆ':'πŸ˜…',
    'πŸ™Š':'πŸ˜…',
    '😳':'πŸ˜…',
    '😫':'😣',  
    'πŸ˜“':'😣',
    'πŸ˜–':'😣',
    '😬':'😣',
    'πŸ™':'😣'
}

Emoji distribution

Distribution in the Russian 2018 dataset (~85k tweets)

{'πŸ˜‚': 21529,
 '😍': 17369,
 '😊': 8777,
 'πŸ‘': 6195,
 '😏': 5559,
 '😭': 4556,
 'πŸ˜…': 4336,
 'πŸ˜‘': 2542,
 'πŸ’”': 2481,
 '😣': 2065,
 'πŸ˜”': 1924,
 '😑': 1884,
 '😎': 1782,
 '😜': 1454}

Probably, we can further merge classes 😜 and 😏 , 😎 and 😊

Vocabulary distribution

Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample.
Words - processed text(no numbers and punctuation) split by spaces

Lang Stratified sample size Len word vocab Len ngram vocab
en 96544 50779 303809
ja 96036 224019 5380959
ar 96544 156335 751645
es 96544 63004 304999
pt 96544 45251 240593
th 94232 186816 935840
ko 93859 281594 2105074
fr 95571 53860 286240
tr 95135 122695 469954
tl 80123 52671 255014
id 76675 58604 296513
it 82705 57282 269276
de 79234 56615 332311
ru 84824 93778 492304

Cover

Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.

N%:[10%,...,90%]

Chars

Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.
full_chars
sample_chars

Japanese and Korean chars look much more like ngrams.

Ngrams

Only for the sample. Calculation for the full dataset is time-consuming.
sample_ngrams

Words

full_words

sample_words

from emoji_sentiment.

Islanna avatar Islanna commented on May 30, 2024

Nonstandard languages

  • ko - found library to split syllables to chars jamotools. Well, no dramatical changes, the word vocabulary is the same size(obviously), the ngram vocabulary is 2x smaller, but still is >1mln. But plot for ngram distribution looks much better.

koupd

  • ar - removed all short vowels and other symbols (harakat, tashkeel?) that interfere. Only 4% of the whole dataset has changed. Word and ngram vocabs are pretty much the same.
  • th - found out that I've dropped some necessary symbols occasionally during preprocessing: r'\W+' also contains some thai accents. Changed the preprocessing to remain only thai and english chars, added a right tokenization from pythainlp. Final word vocab size is ~20k and ~180k for ngram vocabulary.
  • tr - is an agglutinative language. Probably it explains why word vocab is larger than Russian, but ngram vocab is comparable. No special tools for tokenization.
  • ja - removed it from final data.

Russian normalized dataset

Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k

from emoji_sentiment.

Islanna avatar Islanna commented on May 30, 2024

Final dataset distribution

Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian.
Removed Japanese and Tagalog.

Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather

Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather

Emoji distribution

Similar to Russian merged distribution, but can differ a little:

'πŸ˜‚': 0.25,
'😍': 0.23,
'😊': 0.13,
'😏': 0.08,
'😭': 0.07,
'πŸ‘': 0.07,
'πŸ˜…': 0.05,
'πŸ˜‘': 0.04,
'πŸ˜”': 0.03,
'😣': 0.03,
'😑': 0.02

The smallest class '😑' in Indonesian is only 3.8k. In Russian it's ~6k.

Vocabs

Preprocessing

Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.

Regular expressions for removing extra chars:

lang_unuse = {'en':'[^a-zA-Z]',
             'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
             'es':'[^a-zA-ZΓ‘Γ©Γ­Γ³ΓΊΓΌΓ±ΓΓ‰ΓΓ“ΓšΓœΓ‘]',
             'th':'[^\u0E00-\u0E7Fa-zA-Z]',
             'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
             'fr':'[^a-zA-ZΓ€-ΓΏ]',
             'tr':'[^a-zA-ZΔŸΕŸΓΆΓ§ΔžΕžΓ–Γ‡Δ±IiΔ°uUüÜ]',
             'id':'[^a-zA-Z]',
             'it':'[^a-zA-Z]',
             'de':'[^a-zA-ZΓ€-ΓΏ]',
             'ru':'[^a-zA-ZΠ°-яА-ЯЁё]'}

Balancing

Sizes in final dataset:

Lang Lang size Word vocab Ngram vocab
en 299995 95640 522879
ar 199993 253023 1127338
es 299995 117597 498495
th 349995 46542 331081
ko 198561 515949 1859535
fr 299995 99587 475570
tr 199993 201967 671532
id 199357 100246 457841
it 210703 95578 397849
de 184109 99169 515266
ru 241117 172594 810772

All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.

Ngram vocab cut
ngram vocab

Word vocab cut
word vocab

from emoji_sentiment.

snakers4 avatar snakers4 commented on May 30, 2024

@Islanna
Some formatting ideas to paste data into an article for easier storytelling
Also describing what you did with words will also help

stats.xlsx

from emoji_sentiment.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.