Coder Social home page Coder Social logo

andrii0yerko / imdb-sentiment-with-vowpal-wabbit Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 327 KB

Attempt to come up with Vowpal Wabbit for the classic task of IMDB reviews sentiment classification + web API for model.

Jupyter Notebook 98.04% Dockerfile 0.14% Python 1.82%
vowpal-wabbit sentiment-analysis machine-learning

imdb-sentiment-with-vowpal-wabbit's Introduction

IMDB sentiment with Vowpal Wabbit

Attempt to come up with Vowpal Wabbit for the classic task of IMDB reviews sentiment classification (Kaggle). Training was performed on the IMDb Largest Review Dataset, which originally comes in form of 7GB JSON files.

Achieved result on the test set (public LB): 0.9925 ROC AUC

Developed a simple api for deployment vw sentiment models.

Notebooks

  • Raw data preprocessing – Kaggle kernelnbviewer

    Preprocessing consisted of json parsing, creating labels for binary sentiment, standard stop-words & non-words removing, stemming and saving the result in VW format.

  • Model training – Kaggle kernelnbviewer

    Trained SVM for binary classification, following hyperparameter tuning performed: comparison of different hash bit sizes, ngrams order and their combinations, the most appropriate one was --bit_precision=28 and --ngram=2 --ngram=3, attempt to introduce a l1/l2 regularization, which was redundant for such a sparse feature space and didn't give a better result.

Api server

Api developed with Flask and Docker, and based on sklearn wrapper of vowpal-wabbit.

Notice that API developed assuming probability prediction, and the sigmoid function will be applied to the linear output.

Beware of learning your models with losses different than logistic, for the correct work.

Models preparing:

Before deployment, Vowpal Wabbit models should be converted to the one used by the application, which can be done with create_vw_pipeline function from the application.create_pipeline.

def create_vw_pipeline(vw_model_path, output_path=None, tag='0.0', comment=None)
    '''
    Creates and serializes application pipeline from the Vowpal Wabbit model file.

    Parameters
    ----------
    vw_model_path : str
        Path to file of vowpal wabbit saved model
    output_path : str, optional
        Path where pipeline will be saved
        Extension of the output file will be .jl
        Default is 'models/pipeline-v{tag}'
    tag : str, optional
        The version of the outputting pipeline
        Default is '0.0'
    comment : str, optional
        Any additional information, that will be added to resulting file
        Default is None
        
    Produced file is a joblib dump of a dictionary in the following format
    {
        'pipeline': pipeline  # sklearn.pipeline.Pipeline
                              # containing preprocessing transformers
                              # and the classifier created from the vw file
        'tag': tag,
        'comment': comment
    }
    '''

Notice, that created pipelines should be placed in the ./models folder, as it is done by default, and have a .jl extension.

Docker image:

Create all the needed pipelines, and build the image with the Dockerfile.

Image includes Vowpal Wabbit installation as well, which can be used for training new models within the running container.

Endpoints:

All the pipelines will be loaded on application start and the following endpoints produced for each of them:

  • GET /api/v{tag}/

    Response: 200 OK, {"info": additional information (comment), "version": tag}

  • POST /api/v{tag}/predict

    Request body: {"text": "some text"}

    Response: 200 OK, {"positive": probability, "negative": probability}

  • POST /api/v{tag}/weight

    Request body: {"text": "some text"}

    Response: 200 OK, {"word1": weight, "word2": weight, "n gram1": weight, ...}

imdb-sentiment-with-vowpal-wabbit's People

Contributors

andrii0yerko avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.