Coder Social home page Coder Social logo

franciellevargas / hatebr Goto Github PK

View Code? Open in Web Editor NEW
23.0 3.0 5.0 2.96 MB

HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.

natural-language-processing text-classification machine-learning dataset hatespeech-detection brazilian-portuguese

hatebr's Introduction

DOI

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese


HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection on the web and social media. The HateBR was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and 9 (nine) hate speech targets (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the ************, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of the F1-score outperforming the current literature dataset baselines for the Portuguese language. We hope that the proposed expert annotated dataset may foster research on hate speech detection in the Natural Language Processing area.


This repository contains the corpus and the best models presented in the paper (see section "citing"). HateBr.csv file provides 4 (four) columns as described above:

  • 1st column: Instagram comments.
  • 2nd column: Offensive language classification is divided into offensive comments versus non-offensive comments.
  • 3rd column: Offensiveness-level classification is divided into highly offensive, moderately offensive, and slightly offensive.
  • 4th column: Hate speech classification is divided into 9 (nine) different hate speech targets: antisemitism, apology for the ************, fatphobia, homophobia, partyism, racism, religious intolerance, sexism, and xenophobia. At last, offensive & no hate speech comments were also classified.

The following table describes in detail the labels for each proposed layer of annotation:

Offensive LanguageOffensiveness LevelsHate Speech
class label total
offensive 1 3,500
non-offensive 0 3,500
Total 7,000
class label total
highly 3 778
moderately 2 1,044
slightly 1 1,678
non-offensive 0 3,500
Total 7,000
class label total
antisemitism 1 2
apology for the ************ 2 32
fatphobia 3 27
homophobia 4 17
partyism 5 496
racism 6 8
religious intolerance 7 47
sexism 8 97
xenophobia 9 1
offensive & non-hate speech -1 2,773
non-offensive 0 3,500
Total 7,000

In addition, we also provide baseline machine learning results for both tasks: offensive language and hate speech detection. The best-obtained models are available here in .pkl files. File names are organized as [classification (offensive or hate)_representation (ngram or tfidf)_algorithms (nb, svm, mlp or lr)]. For example, the file offensive_tfidf_svm.pkl presents the model of offensive detection with tf-idf representation using the support vector machine algorithm.


CITING

Vargas, F., Carvalho, I., Góes, F. R., Pardo, T.A.S., Benevenuto, F. (2022). HateBR: large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), pp.7174–7183. Marseille, France. https://aclanthology.org/2022.lrec-1.777/


Vargas, F., Carvalho, I., Pardo, T.A.S., Benevenuto, F. (2024). Context-Aware and Expert Data Resources for Brazilian Portuguese Hate Speech Detection. Natural Language Processing Journal - Cambridge Core. pp.1-23. To appear


BIBTEX

@inproceedings{vargas-etal-2022-hatebr, title = "{H}ate{BR}: A Large Expert Annotated Corpus of {B}razilian {I}nstagram Comments for Offensive Language and Hate Speech Detection", author = "Vargas, Francielle and Carvalho, Isabelle and Rodrigues de G{\'o}es, Fabiana and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio", booktitle = "Proceedings of the 13th Language Resources and Evaluation Conference", year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.777", pages = "7174--7183", }


FUNDING

SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171


hatebr's People

Contributors

franciellevargas avatar isabellecarvalho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hatebr's Issues

Falta do TF-IDF featurizer para utilização dos modelos

Olá!

Estou interessado em usar os modelos apresentados nesse repositório para um estudo que estou fazendo, que visa aplicar técnicas de explicabilidade aos modelos. Para isso gostaria de utilizar os modelos da forma mais próxima à reportada no artigo.

Percebi que o transformador TF-IDF ajustado à base não está no repositório (até onde eu procurei). Acredito que sem ele seja impossível restituir os modelos que entendi se pretender compartilhar, pois os modelos devem ter sido ajustados de acordo com uma representação td-idf específica, sem a qual não é possível construir as entradas para os modelos exatamente da mesma forma que os modelos foram avaliados.

Perguntas:

  1. Teria como compartilhar o transformador mencionado? Caso tenha sido implementado com sklearn, acredito que se possa compartilhá-lo da mesma forma que os modelos. Ficaria feliz em ajudar nesse processo, se for do interesse de vcs.
  2. Caso não, teria como compartilhar os detalhes de pré-processamento (além dos encontrados no artigo)? Mais especificamente:
    • Pipeline de pre-processamento de texto usado, se presente (e.g. conversão para lowercase, remoção de tokens repetidos das sequência, remoção de caracteres especiais !, ?, ..., ...)
    • Se o pré-processamento e as características TF-IDF dos tokens foram extraídos de todo o dataset ou apenas da base de treino, e se é possível recuperar as bases de dados usadas para treino e teste.

Obrigado pelo esforço em compor esse dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.