ml-project-2-fmz

Project description:

The aim of this project was to define a pipeline using machine learning methods in order to identify whether a job advertisement is green or not. The used database contains one million job vacancies published in Switzerland. Jobs ads are divided in title and content, can be written in different languages and are non labeled. The first step of the proposed pipeline is a data preprocessing composed of a language splitting, a cleaning on the entire databse and a tokenization of english data. Then, considering the english data only, we divided it into train and validation subsets for both titles and contents separately and into a test subset containing both. A manual or fast labeling process is applied based on the United Nations Sustainable Development Goals. Finally Ridge logistic regression, Convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformer (BERT) were implemented. The best result are obtained using Ridge logistic regression.

Our final pipeline:

Organisation of repository:

Please note that we ran our code on Colab to have access to free GPU. To reproduce the same code, some change of directories may be needed. We also suggest using GPU in order to the running time shorter.

Also note that we have signed an NDA stating that the data used is confidential. As a result, our repository does not contain any database. Our notebooks however contain some printings (still at most two lines per cell, as agreed with our hosting lab). If you want to have access to more details, please contact directly [email protected] from the IIPP lab.

Code
- Cleaning and Labeling: folder that contains the notebook we used to to clean and fast label our data.
- Ridgdelogreg: code to implement Ridge logistic regression for our classification of both the title and the content.
- BERT: code to implement BERT for our classification of both the title and the content.
- CNN: code to implement CNN for our classification of both the title and the content.
- Final Binary: code to join the results of the classification of both title and content to a single binary classification.
Datasets
- Labeling: contains the list of words we based our fast labeling on.
README.md

zinebag / identifyinggreenjobs Goto Github PK

identifyinggreenjobs's Introduction

ml-project-2-fmz

Project description:

Organisation of repository:

identifyinggreenjobs's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent