Coder Social home page Coder Social logo

lirondos / lazaro Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 2.0 125.1 MB

An observatory of anglicism usage in the Spanish press

Home Page: http://observatoriolazaro.es/

License: Other

Python 96.97% Shell 3.03%
crf-model corpus spanish-newswire anglicisms borrowings bilstm-crf mbert spanish linguistics

lazaro's Introduction

Observatorio Lázaro

This is the code repository of Observatorio Lázaro website, an observatory of anglicism usage in the Spanish press. The purpose of this project is to apply a data-driven approach to the study of anglicisms (ie, unadapted lexical borrowings from English) in Spanish newspapers. Every day, Observatorio Lázaro collects the latests news published in 22 Spanish news sources, analyzes them and extracts the anglicisms that have been used in the daily news.

The core of the project is a Machine Learning model that extracts unadapted lexical borrowings (especially English lexical borrowings or anglicisms) from Spanish articles. The model is a BiLSTM-CRF model fed with word and subword embeddings. More information on the model can be found in the paper Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling. More info on the motivation behind the project can be found at the About section in Observatorio Lázaro website.

The name of this project, Lázaro, is an homage to Spanish philologist Fernando Lázaro Carreter, whose columns admonishing against the usage of anglicisms on the Spanish press became extremely popular during the decades of 1980s and 1990s.

Observatorio Lázaro website

The output of Observatorio Lázaro, along with graphs, visualizations and aggregated info on each anglicism registered by Lázaro can be seen at Observatorio Lázaro website.

Python library and models

Previous versions

A previous version of the Observatorio ran on a CRF model fed with handcrafted features and tracked 8 Spanish newspapers.

Citation

If you use the Observatory, please cite the following references:

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}
@masterthesis{ÁlvarezMelladoElena2020LAEo,
title = {Lázaro: An Extractor of Emergent Anglicisms in Spanish Newswire},
abstract = {The use of lexical borrowings from English (often called anglicisms) in the Spanish press evokes great interest, both in the Hispanic linguistics community and among the general public. Anglicism usage in Spanish language has been previously studied within the field of corpus linguistics. Prior work has traditionally relied on manual inspection of corpora, with the limitations that implies. This thesis proposes a model for automatic extraction of unadapted anglicisms in Spanish newswire. This thesis introduces: (1) an annotated corpus of 21,570 newspaper headlines (325,665 tokens) written in European Spanish annotated with unadapted anglicisms and (2) two sequencelabeling models to perform automatic extraction of unadapted anglicisms: a conditional random field model with handcrafted features and a BiLSTM-CRF model with word and character embeddings. The best results are obtained by the CRF model, with an F1 score of 89.60 on the development set and 87.82 on the test set. Finally, a practical application of the CRF model is presented: an automatic pipeline that performs daily extraction of anglicisms from the main national newspapers of Spain.},
author = {Álvarez Mellado, Elena},
keywords = {anglicism detection;lexical borrowing;Spanish newswire},
language = {eng},
school = {Brandeis University, Graduate School of Arts and Sciences},
year = {2020},
}

lazaro's People

Contributors

lirondos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.