Coder Social home page Coder Social logo

melisa's Introduction

MeLiSA: Mercado Libre for Sentiment Analysis

** NOTE: THIS FILE IS UNDER CONSTRUCTION **

This is the main repository of the MeLiSA dataset, which is designed to perform research in Latin American Spanish and Portuguese text classification. The dataset can be accessed with the 🤗 Datasets library with a few lines of Python code:

from datasets import load_dataset

dataset = load_dataset("lpsc-fiuba/melisa", "all_languages")
# you can use any of the following config names as a second argument: 
# "all_languages", "es", "pt"

For more details on how to access to the data visit our Huggingface organization page. This page also contains finetunned models on this dataset.

The code used to download the data is also provided in this repository.

Dataset Description

We provide a Mercado Libre product reviews dataset for spanish and portuguese text classification. The dataset contains reviews in these two languages collected between August 2020 and January 2021. Each record in the dataset contains the review content and title, the star rating, the country where it was pubilshed and the product category (arts, technology, etc.). The corpus is roughly balanced across stars, so each star rating constitutes approximately 20% of the reviews in each language.

Spanish Portugese
Train Validation Test Train Validation Test
1 88.425 4.052 5.000 50.801 4.052 5.000
2 88.397 4.052 5.000 50.782 4.052 5.000
3 88.435 4.052 5.000 50.797 4.052 5.000
4 88.449 4.052 5.000 50.794 4.052 5.000
5 88.402 4.052 5.000 50.781 4.052 5.000

Table shows the number of samples per star rate in each split. There is a total of 442.108 training samples in spanish and 253.955 in portuguese. We limited the number of reviews per product to 30 and we perform a ranked inclusion of the downloaded reviews to include those with rich semantic content. In these ranking, the lenght of the review content and the valorization (difference between likes and dislikes) was prioritized. For more details on this process, see (CITATION).

Reviews in spanish were obtained from 8 different Latin Amercian countries (Argentina, Colombia, Peru, Uruguay, Chile, Venezuela and Mexico), and portuguese reviews were extracted from Brasil. To match the language with its respective country, we applied a language detection algorithm based on the works of Joulin et al. (2016a and 2016b) to determine the language of the review text and we removed reviews that were not written in the expected language.

Data Fields

  • country: The string identifier of the country. It could be one of the following: MLA (Argentina), MCO (Colombia), MPE (Peru), MLU (Uruguay), MLC (Chile), MLV (Venezuela), MLM (Mexico) or MLB (Brasil).
  • category: String representation of the product's category. It could be one of the following:
    • Hogar / Casa
    • Tecnologı́a y electrónica / Tecnologia e electronica
    • Salud, ropa y cuidado personal / Saúde, roupas e cuidado pessoal
    • Arte y entretenimiento / Arte e Entretenimiento
    • Alimentos y Bebidas / Alimentos e Bebidas
  • review_content: The text content of the review.
  • review_title: The text title of the review.
  • review_rate: An int between 1-5 indicating the number of stars.

Data Splits

Each language configuration comes with it's own train, validation, and test splits. The all_languages split is simply a concatenation of the corresponding split across all languages. That is, the train split for all_languages is a concatenation of the train splits for each of the languages and likewise for validation and test.

Personal and Sensitive Information

Mercado Libre Reviews are submitted by users with the knowledge and attention of being public. The reviewer ID's included in this dataset are anonymized, meaning that they are disassociated from the original user profiles. However, these fields would likely be easy to deannoymize given the public and identifying nature of free-form text responses.

Discussion of Biases

The data included here are from unverified consumers. Some percentage of these reviews may be fake or contain misleading or offensive language.

Other Known Limitations

The dataset is constructed so that the distribution of star ratings is roughly balanced. This feature has some advantages for purposes of classification, but some types of language may be over or underrepresented relative to the original distribution of reviews to acheive this balance. [More Information Needed]

Additional Information

Published by Lautaro Estienne, Matías Vera and Leonardo Rey Vega. Managed by the Signal Processing in Comunications Laboratory of the Electronic Department at the Engeneering School of the Buenos Aires University (UBA).

Licensing Information

[More Information Needed]

Citation Information

Please cite the following paper if you found this dataset useful:

(CITATION) [More Information Needed]

melisa's People

Contributors

lautaroest avatar

Watchers

Laboratorio de Procesamiento de Señales en Comunicaciones avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.