Coder Social home page Coder Social logo

albarji / sepln-spanish-se-constructions Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 1.0 422 KB

Public repository for the paper "Classifying Spanish se constructions: from bag of words to language models" presented at SEPLN journal 2020

License: MIT License

Jupyter Notebook 100.00%

sepln-spanish-se-constructions's Introduction

 .----------------.  .----------------. 
| .--------------. || .--------------. |
| |    _______   | || |  _________   | |
| |   /  ___  |  | || | |_   ___  |  | |
| |  |  (__ \_|  | || |   | |_  \_|  | |
| |   '.___`-.   | || |   |  _|  _   | |
| |  |`\____) |  | || |  _| |___/ |  | |
| |  |_______.'  | || | |_________|  | |
| |              | || |              | |
| '--------------' || '--------------' |
 '----------------'  '----------------' 

Classifying Spanish se constructions: from bag of words to language models

This repository contains a reduced version of the SE-corpus, created for the paper Classifying Spanish se constructions: from bag of words to language models by Nuria Aldama and Álvaro Barbero, published at the SEPLN journal 2021 march edition. A notebook to reproduce the paper experiments is also included.

SE-corpus reduced version

The SE-corpus reduced version (from now on SE-corpus) is composed of 2140 sentences that come from CORPES XXI. The sentence selection procedure starts picking up, from the whole CORPES XXI, every sentence that contains the word se and belongs to the news, leisure and daily life domain in the European Spanish variant. From the output of this retrieval query, 3000 sentences (complete SE-corpus) are randomly selected. Through the last filter, those sentences having more than one instance of se are eliminated. Summing-up, the corpus used to carry out this research is composed of 2140 sentences containing a single instance of se. The corpus is representative of the news, leisure and daily life domain in the European Spanish variant because it maintains real usage distribution of se constructions.

The annotation process is carried out following the next annotation criteria (examples follow)

  • se-mark: cases of valency reduction (examples 6-9).
  • expl: pure pronominal predicates or emphatic contexts (examples 4–5).
  • iobj: se as indirect object of the main predicate (examples 1 and 3).
  • obj: se as direct object of the main predicate (example 2).

Corpus examples

(1)
Se      lo     dije         a  Juan ayer      .
Him-DAT it-ACC tell-PST.1SG to Juan yesterday .
‘I told it to Juan yesterday.’

(2)
Juan se          peina        .
Juan himself-ACC comb-PRS.3SG .
‘Juan combs himself.’

(3)	
Ellos se       envían       cartas.
They  them-DAT send-PRS.3PL letters.
‘They send letters to each other.’
    
(4)
Juan se  desmayó       de repente .
Juan him faint-PST.3SG suddenly   .
‘Juan suddenly fainted.’

(5)	
Juan (se) comió       un bocadillo.
Juan him  eat-PST.3SG a  sandwich.
‘Juan ate a sandwich.’
    
(6)	
El  jarrón se  rompió        .
The vase   -   break-PST.3SG .
‘The vase broke.’

(7)	
La  camisa se lava         muy	bien .
The shirt  -  wash-PRS.3SG very well .
‘The shirt washes very well.’

(8)	
Se buscan           camareros  .
-  look.for-PRS.3PL bartenders .
‘Bartenders are required.’

(9)	
Se come        bien aquí .
-  eat-PRS.3SG well here .
‘It's a good place to eat in here.'

Corpus download

The corpus is available as three files:

They are tab-separated files with header and one row per text. The columns are:

  • id: unique identifier of the text.
  • text
  • se_tag: "se" usage, among se-mark, expl, obj, iobj.

Notebook

The experiments can be reproduced by using the provided multiclass_classification_SE.ipynb notebook. We recommend to use a Google Colab instance with GPU.

Citing us

If you use this corpus or the notebook in this repository, please cite us as

Aldama-García, N., Barbero, A. (2021). Classifying Spanish se constructions: from bag of words to language models. Procesamiento del Lenguaje Natural 66, 153-164. ISSN 1989-7553. DOI 10.26342/2021-66-13.

Acknowledgements

We would like to thank Cristina Sánchez López and Amaya Mendikoetxea Pelayo for their help and dedication in the development of this study.

The authors acknowledge financial support from PID2019-106827GB-I00 / AEI / 10.13039/501100011033 and from the European Regional Development Fund and from the Spanish Ministry of Economy, Industry, and Competitiveness - State Research Agency, project TIN2016-76406-P (AEI/FEDER, UE).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.