Coder Social home page Coder Social logo

fellipefrancocouto / multinomial-classification-enem-2019 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 126 KB

The Exame Nacional do Ensino Médio (also known as ENEM) is a national Brazilian standardized test that allows students to conquer a spot in universities in the country and abroad (Inep, 2016). With millions of examinees from different social backgrounds, this paper aims to use the socio-economic data gathered in the 2019 exam application to predict which social class (A to E, following the methodology explained by Carneiro (2021), and used by IBGE) a given applicant belongs. The micro-data can be retrieved here: https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem (Inep, 2020). Summarily, 24 questions ask specific information about goods, education, or work (e.g., number of cars a family has, if any; level of education of father; type of job the mother does), and the objective of the algorithm is to use all this data and classify an applicant’s social strat among the five possibilities.

Jupyter Notebook 100.00%
multinomial-logistic-regression education python3 classification

multinomial-classification-enem-2019's Introduction

multinomial-classification-enem-2019

Problem Definition
The Exame Nacional do Ensino Médio (also known as ENEM) is a national Brazilian standardized test that allows students to conquer a spot in universities in the country and abroad (Inep, 2016). With millions of examinees from different social backgrounds, this paper aims to use the socio-economic data gathered in the 2019 exam application to predict which social class (A to E, following the methodology explained by Carneiro (2021), and used by IBGE) a given applicant belongs. The micro-data can be retrieved here: https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem (Inep, 2020). Summarily, 24 questions ask specific information about goods, education, or work (e.g., number of cars a family has, if any; level of education of father; type of job the mother does), and the objective of the algorithm is to use all this data and classify an applicant’s social strat among the five possibilities.
Solution Specification
The chosen algorithm to perform this multinomial classification task was a Multinomial Logistic Regression (hereby referred to as MLR). Several reasons explain why this solution was favored, such as not having to assume normality or homoscedasticity among the dependent variables (Starkweather & Moske, 2011) and enjoying reasonable accuracy for a “simple” model that deals reasonably well with unbalanced data (there are way more members of classes A and B). Despite that, the main reason to choose this model is the computational constraint. Given the size of the explored dataset (1 million entries), more complex solutions that would potentially lead to higher accuracy were attempted but did not run at all with the available resources. To satisfy the computational constraint and still be able to explore the most significant amount of data, A MLR was the chosen strategy. Even with that choice, the complete code took roughly 15 hours to run.
More specifically, the performed MLR used an lbfgs solver (because this one is able to handle multinomial loss), with a penalty of L2 (adding the squared magnitude of the coefficient to the loss function to avoid expected overfitting in this high-dimensional dataset), and an inverse regularization parameter equal to 0.5 (this choice is better explained when considering the cross-validation process, justified in the section below).
Testing and Analysis
The first step in building the algorithm was loading and cleaning the data. Thus, all 5,095,270 entries of interest (Questions Q001 until Q025) were checked for missing data, which was nonexistent in the aforementioned columns. Further, all columns had their entries encoded to a numerical value. The target column (with income information) had its information processed to be following the IBGE model of social classification for that year (in terms of the number of minimum wages a household earns) and was further on separated from the other questions and renamed. Because of computational constraints, a subset of 1 million entries was randomly selected from the dataset. Then, this subset was divided into 75% training set and the rest for the test set.
References
Carneiro, T. R. A. (2021, December 10). Faixas Salariais x Classe Social—Qual a sua classe social? A vida é feita de Desconto. https://thiagorodrigo.com.br/artigo/faixas-salariais-classe-social-abep-ibge/
Inep. (2016, November 7). Sobre o Enem—Inep. https://web.archive.org/web/20161107012729/http://portal.inep.gov.br/web/enem/sobre-o-enem
Inep. (2020, November 17). Enem. Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira | Inep. https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem
Starkweather, J., & Moske, K. (2011). Multinomial Logistic Regression. https://it.unt.edu/sites/default/files/mlr_jds_aug2011.pdf

multinomial-classification-enem-2019's People

Contributors

fellipefrancocouto avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.