Coder Social home page Coder Social logo

elmo's Introduction

Portuguese Language Models and Word Embeddings

This repository has primarily been designed to assess the quality of the Portuguese ELMo representations made available through the AllenNLP library in comparison with the language models and word embeddings currently available for the Portuguese language.

This source code can reproduce the experiments mentioned in our paper Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks. It's designed to evaluate all word embeddings from nathanshartmann/portuguese_word_embeddings on the semantic textual similarity tasks of the ASSIN datasets and also compare them with the results achieved by ELMo and BERT. Some of our tests will concatenate ELMo and word embeddings from the said repository.

Benchmarks

Our full benchmarks are available under reports/evaluation.csv. The most relevant benchmarks for the semantic textual similarity task are reproduced below.

Dataset Model Embedding Architecture Dimensions PCC MSE
ASSIN 1 ( pt-BR ) ELMo - wiki (reduced) 0.62 0.47
ELMo - wiki (reduced) word2vec CBOW 1000 0.62 0.47
portuguese-BERT 0.53 0.55
BERT-multilingual (cased) 0.51 1.94
ASSIN 1 ( pt-PT ) ELMo - wiki (reduced) 0.63 0.73
ELMo - wiki (reduced) word2vec CBOW 1000 0.64 0.73
portuguese-BERT 0.53 0.88
BERT-multilingual (cased) 0.52 0.90
ASSIN 2 ELMo - wiki (reduced) 0.57 1.94
ELMo - wiki (reduced) word2vec CBOW 1000 0.59 1.88
portuguese-BERT 0.64 1.69
BERT-multilingual 0.51 1.94

In our benchmarks, the ELMo model labelled as wiki is the first public Portuguese ELMo model that was made available through the AllenNLP library website. Since then it has been replaced on the website by wiki (reduced).

The BRWAC model was trained on brWaC, and the wiki (reduced) was trained on the same dataset as wiki after words with word frequency below four occurrences were eliminated from the dataset.

Installation

Assuming you have installed Docker and nvidia-docker, the command below will reproduce all test results on this repository.

sudo bash scripts/quickstart.sh

Running this command will generate the ruanchaves/elmo:2.0 docker image, if it doesn't exist yet, and also download all NILC embeddings, if they still haven't been downloaded to the embeddings/NILC folder.

If you would also like to run BERT, extract your Tensorflow checkpoint files under the folder embeddings/bert/portuguese. It must be provided as a model checkpoint that can be understood by bert-as-service: you may have to rename some of the files in order to comply. Move sentence_similarity/bert.yaml to settings/bert.yaml and then recompile scripts/quickstart.sh by running python generate_start.py.

Your results will be stored in the folder sentence_similarity/results by default.

Associated Repositories

Citation

@inproceedings{rodrigues_propor2020,
  author = {Ruan Chaves Rodrigues and Jéssica Rodrigues da Silva and Pedro Vitor Quinta de Castro and Nádia Félix Felipe da Silva and Anderson da Silva Soares },
  title = {Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks},
  editor = { Paulo Quaresma and Renata Vieira and Sandra Aluísio and Helena Moniz and Fernando Batista and Teresa Gonçalves },
  booktitle = { Computational Processing of the Portuguese Language },
  note = { 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings },
  publisher = { Springer International Publishing },
  address = { Springer Nature Switzerland AG },
  doi = {10.1007/978-3-030-41505-1},
  year = {2020}}

elmo's People

Contributors

ruanchaves avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.