Coder Social home page Coder Social logo

identifyinggreenjobs's Introduction

ml-project-2-fmz

Project description:

The aim of this project was to define a pipeline using machine learning methods in order to identify whether a job advertisement is green or not. The used database contains one million job vacancies published in Switzerland. Jobs ads are divided in title and content, can be written in different languages and are non labeled. The first step of the proposed pipeline is a data preprocessing composed of a language splitting, a cleaning on the entire databse and a tokenization of english data. Then, considering the english data only, we divided it into train and validation subsets for both titles and contents separately and into a test subset containing both. A manual or fast labeling process is applied based on the United Nations Sustainable Development Goals. Finally Ridge logistic regression, Convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformer (BERT) were implemented. The best result are obtained using Ridge logistic regression.

Our final pipeline: alt text

Organisation of repository:

Please note that we ran our code on Colab to have access to free GPU. To reproduce the same code, some change of directories may be needed. We also suggest using GPU in order to the running time shorter.

Also note that we have signed an NDA stating that the data used is confidential. As a result, our repository does not contain any database. Our notebooks however contain some printings (still at most two lines per cell, as agreed with our hosting lab). If you want to have access to more details, please contact directly [email protected] from the IIPP lab.

  • Code
    • Cleaning and Labeling: folder that contains the notebook we used to to clean and fast label our data.
    • Ridgdelogreg: code to implement Ridge logistic regression for our classification of both the title and the content.
    • BERT: code to implement BERT for our classification of both the title and the content.
    • CNN: code to implement CNN for our classification of both the title and the content.
    • Final Binary: code to join the results of the classification of both title and content to a single binary classification.
  • Datasets
    • Labeling: contains the list of words we based our fast labeling on.
  • README.md

identifyinggreenjobs's People

Contributors

zinebag avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.