Comments (5)
@Islanna
updated file for article
from emoji_sentiment.
Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.
Languages
Full 2018 dataset size
en 3918333
ja 2603697
ar 1416102
es 1237730
pt 869292
th 620532
ko 493476
fr 349677
tr 302217
tl 129997
id 109838
it 86488
de 85671
ru 84824
Emoji merging
exclude_emojis = ['π','π','πΆ','π','β','π§','π«','π
','π','π―']
merge_dict = {
'π':'π',
'β€':'π',
'π':'π',
'β₯':'π',
'π':'π',
'π':'π',
'π':'π',
'π':'π',
'π':'π',
'π’':'π',
'π':'π',
'π':'π',
'π':'π',
'βΊ':'π',
'π':'π',
'π':'π',
'πͺ':'π',
'β¨':'π',
'β':'π',
'π':'π',
'π':'π',
'π':'π',
'π':'π',
'π ':'π‘',
'π':'π‘',
'π€':'π‘',
'π':'π‘',
'π©':'π',
'π':'π',
'πͺ':'π',
'π·':'π',
'π΄':'π',
'π':'π
',
'π':'π
',
'π³':'π
',
'π«':'π£',
'π':'π£',
'π':'π£',
'π¬':'π£',
'π':'π£'
}
Emoji distribution
Distribution in the Russian 2018 dataset (~85k tweets)
{'π': 21529,
'π': 17369,
'π': 8777,
'π': 6195,
'π': 5559,
'π': 4556,
'π
': 4336,
'π': 2542,
'π': 2481,
'π£': 2065,
'π': 1924,
'π‘': 1884,
'π': 1782,
'π': 1454}
Probably, we can further merge classes π and π , π and π
Vocabulary distribution
Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample.
Words - processed text(no numbers and punctuation) split by spaces
Lang | Stratified sample size | Len word vocab | Len ngram vocab |
---|---|---|---|
en | 96544 | 50779 | 303809 |
ja | 96036 | 224019 | 5380959 |
ar | 96544 | 156335 | 751645 |
es | 96544 | 63004 | 304999 |
pt | 96544 | 45251 | 240593 |
th | 94232 | 186816 | 935840 |
ko | 93859 | 281594 | 2105074 |
fr | 95571 | 53860 | 286240 |
tr | 95135 | 122695 | 469954 |
tl | 80123 | 52671 | 255014 |
id | 76675 | 58604 | 296513 |
it | 82705 | 57282 | 269276 |
de | 79234 | 56615 | 332311 |
ru | 84824 | 93778 | 492304 |
Cover
Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.
N%:[10%,...,90%]
Chars
Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.
Japanese and Korean chars look much more like ngrams.
Ngrams
Only for the sample. Calculation for the full dataset is time-consuming.
Words
from emoji_sentiment.
Nonstandard languages
- ko - found library to split syllables to chars
jamotools
. Well, no dramatical changes, the word vocabulary is the same size(obviously), the ngram vocabulary is 2x smaller, but still is >1mln. But plot for ngram distribution looks much better.
- ar - removed all short vowels and other symbols (harakat, tashkeel?) that interfere. Only 4% of the whole dataset has changed. Word and ngram vocabs are pretty much the same.
- th - found out that I've dropped some necessary symbols occasionally during preprocessing:
r'\W+'
also contains some thai accents. Changed the preprocessing to remain only thai and english chars, added a right tokenization frompythainlp
. Final word vocab size is ~20k and ~180k for ngram vocabulary. - tr - is an agglutinative language. Probably it explains why word vocab is larger than Russian, but ngram vocab is comparable. No special tools for tokenization.
- ja - removed it from final data.
Russian normalized dataset
Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k
from emoji_sentiment.
Final dataset distribution
Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian.
Removed Japanese and Tagalog.
Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather
Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather
Emoji distribution
Similar to Russian merged distribution, but can differ a little:
'π': 0.25,
'π': 0.23,
'π': 0.13,
'π': 0.08,
'π': 0.07,
'π': 0.07,
'π
': 0.05,
'π': 0.04,
'π': 0.03,
'π£': 0.03,
'π‘': 0.02
The smallest class 'π‘' in Indonesian is only 3.8k. In Russian it's ~6k.
Vocabs
Preprocessing
Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.
Regular expressions for removing extra chars:
lang_unuse = {'en':'[^a-zA-Z]',
'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
'es':'[^a-zA-ZÑéΓΓ³ΓΊΓΌΓ±ΓΓΓΓΓΓΓ]',
'th':'[^\u0E00-\u0E7Fa-zA-Z]',
'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
'fr':'[^a-zA-ZΓ-ΓΏ]',
'tr':'[^a-zA-ZΔΕΓΆΓ§ΔΕΓΓΔ±IiΔ°uUΓΌΓ]',
'id':'[^a-zA-Z]',
'it':'[^a-zA-Z]',
'de':'[^a-zA-ZΓ-ΓΏ]',
'ru':'[^a-zA-ZΠ°-ΡΠ-Π―ΠΡ]'}
Balancing
Sizes in final dataset:
Lang | Lang size | Word vocab | Ngram vocab |
---|---|---|---|
en | 299995 | 95640 | 522879 |
ar | 199993 | 253023 | 1127338 |
es | 299995 | 117597 | 498495 |
th | 349995 | 46542 | 331081 |
ko | 198561 | 515949 | 1859535 |
fr | 299995 | 99587 | 475570 |
tr | 199993 | 201967 | 671532 |
id | 199357 | 100246 | 457841 |
it | 210703 | 95578 | 397849 |
de | 184109 | 99169 | 515266 |
ru | 241117 | 172594 | 810772 |
All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.
from emoji_sentiment.
@Islanna
Some formatting ideas to paste data into an article for easier storytelling
Also describing what you did with words will also help
from emoji_sentiment.
Related Issues (4)
- Dataset EDA HOT 2
- Models for experiments
- Models for experiments HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emoji_sentiment.