Coder Social home page Coder Social logo

emoji-sentiment-dataset's Introduction

Table of contents

Motivation

Following the success of DeepMoji and TorchMoji (1, 2), we would like to leverage Twitter as an open source of self-annotated data to create a balanced multi-language "in-the-wild" sentiment dataset to test the quality of various NLP models and/or word/sub-word tokenization techniques.

Dataset

Name Sample size Word vocab Ngram vocab Family Alphabet Speakers L1, m
Korean (ko) 198,561 516,021 1,862,406 Koreanic Hangul 77
Arabic (ar) 199,993 287,578 1,428,286 Afro-Asiatic Arabic alphabet 300
Turkish (tr) 199,993 203,657 687,284 Turkic Latin 80
Russian (ru) 241,117 172,653 812,315 Indo-European Cyrillic 150
Spanish, Castilian (es) 299,995 117,629 498,977 Indo-European Latin 480
Indonesian (id) 199,357 100,272 458,047 Austronesian Latin 43
French (fr) 299,995 99,631 476,360 Indo-European Latin 77
German (de) 184,109 99,213 516,005 Indo-European Latin 90
English (en) 299,995 95,666 523,046 Indo-European Latin 400
Italian (it) 210,703 95,604 398,091 Indo-European Latin 69
Thai (th) 349,995 73,425 558,911 Tai–Kadai Thai script 30

Downloads

  • Curated/pre-processed/balanced dataset - 540MB;
  • Raw dataset - 2.4 GB;

Methodology

  • Download and process tweet archives from archive team;
  • Filter Twitter-specific content (re-tweets, hashtags, citations, etc);
  • Predict language with FastText and select items with high confidence (80-90%+);
  • Select tweets that:
    • Contain one of 64 emojis used in TorchMoji / DeepMoji;
    • Do not contain other emojis;
    • Have only one block of consecutive emojis;
    • There is only one type of emoji per tweet;
  • Dataset pre-processing and balancing;
  • TODO

License

Dual license, cc-by-nc and commercial usage available after agreement with dataset authors.

emoji-sentiment-dataset's People

Contributors

snakers4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

emoji-sentiment-dataset's Issues

'Feather file footer incomplete'

Hi, thank you for your contribution!
When I load the data with feather.read_dataframe() in Python, I get the error message "Feather file footer incomplete". I'm not quite familiar with feather files, and I'm not sure if I missed out something or if something's wrong with the file. Would you please check out the format or provide some more details for running? Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.