Coder Social home page Coder Social logo

bloomtech-labs / groa-ds Goto Github PK

View Code? Open in Web Editor NEW
9.0 8.0 14.0 910.64 MB

Open-source movie recommendation engine

License: MIT License

Python 83.66% Jupyter Notebook 16.21% Dockerfile 0.01% Batchfile 0.02% Shell 0.03% PowerShell 0.02% HTML 0.05%
labs21 labs23 labs19

groa-ds's Introduction

You can check out the live demo of Gróa here.

Contributors

Labs19

Michael Gospodinoff Gabe flomo Jeff Rowe Coop Williams Eric Smith

Labs 21

Niki Bhatt Riley Jones

Labs 23

Doina Langille Ben de Vera Erik Cowley Chase Goldfeld Benjamin Bishop

Project Overview

Trello Board We use Trello as a quick wireframe tracker through the first stages of development. As the product moves past releases 1.1 and 1.2 we will transition away from Trello and into the git ecosystem entirely.

Product Canvas This notion document serves as a solid resource if you want to learn more about our motivations for creating this product and the general development direction it is taking.

Project Description

  • We trained Word2Vec on positive user ratings histories to create a user-based collaborative filtering recommender. The algorithm embeds over 97,000 movie IDs into a 100-dimensional vector space according to their co-occurence in a user's positive ratings history. The ID for each movie is a key for its vector, which can be called from the model and compared with any other vector in that space for cosine-similarity. To provide recommendations given a new user's watch history, we simply find the vector average of the user's choice of "good movies" and find the top-n cosine-similar vectors from the model. We can improve the recommendations by subtracting a "bad movies" vector from the "good movies" vector before inferencing. Models trained in this way can be tested by treating a user's watchlist (unwatched movies saved for later) as a validation set.

  • The above model fulfills most requirements for a general-purpose movie recommender system. However, it is unable to make riskier recommendations for movies that a majority of reviewers do not enjoy (cult movies). To satisfy users who seek underrated movies, we also trained Doc2Vec on user review histories to create a review-based collaborative filtering model. This model does not recommend movies, but finds reviewers who write similarly to a new user. We then query the review database for positive reviews from these users, both in cases where the ratings count is 1k-10k (hidden gems), and where the reviewer rates a movie 3 stars more than the average.

The lightning-fast inferencing of the Word2Vec/Doc2vec algorithms allows us to incorporate user feedback into progressively updating recommendations. If the user elects to approve or disapprove of a movie, its corresponding vector is added to, or subtracted from, the user's overall taste vector. Weighting these feedback vectors by a factor like 1.2 increases the influence of that feedback on the user's taste vector, and this factor can be tweaked to change the effective "learning rate" of the re-recommendations process.

Tech Stack

AWS:

  • EC2
  • S3 Bucket
  • RDS
  • Elastic Beanstalk

Machine Learning:

Data Collection/Manipulation:

Predictions

Based on the user's movie ratings and reviews, provide recommendations for movies to watch that they have never before considered watching. We can do this by vectorizing the user's Letterboxd or IMDb reviews and finding cosine-similar matches from 22GB worth of movie reviews. Results can be filtered to remove movies the user has already watched, so long as they provide their data exported from one of those sites.

Contributing

When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.

Please note we have a code of conduct. Please follow it in all your interactions with the project.

Issue/Bug Request

If you are having an issue with the existing project code, please submit a bug report under the following guidelines:

  • Check first to see if your issue has already been reported.
  • Check to see if the issue has recently been fixed by attempting to reproduce the issue using the latest master branch in the repository.
  • Create a live example of the problem.
  • Submit a detailed bug report including your environment & browser, steps to reproduce the issue, actual and expected outcomes, where you believe the issue is originating from, and any potential solutions you have considered.

Feature Requests

We would love to hear from you about new features which would improve this app and further the aims of our project. Please provide as much detail and information as possible to show us why you think your new feature should be implemented.

Pull Requests

If you have developed a patch, bug fix, or new feature that would improve this app, please submit a pull request. It is best to communicate your ideas with the developers first before investing a great deal of time into a pull request to ensure that it will mesh smoothly with the project.

Remember that this project is licensed under the MIT license, and by submitting a pull request, you agree that your work will be, too.

Pull Request Guidelines

  • Ensure any install or build dependencies are removed before the end of the layer when doing a build.
  • Update the README.md with details of changes to the interface, including new plist variables, exposed ports, useful file locations and container parameters.
  • Ensure that your code conforms to our existing code conventions and test coverage.
  • Include the relevant issue number, if applicable.
  • You may merge the Pull Request in once you have the sign-off of two other developers, or if you do not have permission to do that, you may request the second reviewer to merge it for you.

Attribution

These contribution guidelines have been adapted from this good-Contributing.md-template.

Documentation

See Backend Documentation for details on the backend of our project.

See Front End Documentation for details on the front end of our project.

groa-ds's People

Contributors

aufeld avatar bendevera avatar benjamin1118 avatar cmgospod avatar coopwilliams avatar doinalangille avatar five-hundred-eleven avatar gabe-flomo avatar levi-huynh avatar malexmad avatar moviedatascience avatar nikibhatt avatar rileythejones avatar rowebyrowe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

groa-ds's Issues

Data Format for Ingestion Is Not Uniform

Expected Behavior

Data accepted by the endpoint should be accepted and processed into a format(s) that the model will ingest.

Current Behavior

The processor.py file accepts CSV data and converts it to a pandas dataframe.

The model only accepts a single nested numpy array.

Therefore if you try to give the sagemaker container a numpy array it will throw an error due to it only accepting csv data and if you provide csv data the model will error due to only accepting a nested array.

Additionally, the method for inferencing is currently giving an error due to the import being directly tied to a tensorflow based model.

Possible Solution

Reconfigure processor.py so that it accepts csv data but converts that data into a nested numpy array.

Additionally we must make use of Sagemaker's SDK Predictors class.

Steps to Reproduce

  1. Deploy endpoint
  2. Using the fulldeploy notebook, run all cells
  3. Observe type error from accepting a csv

Context (Environment)

This is a core functionality blocker.

Update Sagemaker Container from Python 2 to Python 3

Expected Behavior

Docker container for Sagemaker launches and runs without error in Python 3.

Current Behavior

Currently if the Docker file is changed to accommodate Python 3 there are unknown areas of the flask infrastructure which will throw error messages.

Possible Solution

Adjust the flask infrastructure so that it is completely compatible with Python 3.

Steps to Reproduce

  1. Change Docker file from utilizing Python 2 to Python 3
  2. Launch the container within a Sagemaker instance
  3. Observe failed health check for the Endpoint from Cloudwatch.

Context (Environment)

This is not a core functionality blocker. Currently all the inferencing and hosting we need with the container and flask app are backwards compatible with Python 2.

However, if we want to transition into using the docker container for training our model(s) we will need to have a Python 3 container.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.