bloomtech-labs / betterreads-ds Goto Github PK

View Code? Open in Web Editor NEW

1.0 9.0 3.0 121.94 MB

License: MIT License

Python 7.09% Jupyter Notebook 92.84% Dockerfile 0.07%

labs23 labs21

betterreads-ds's Introduction

Readrr Recommendations API

Visit the Readrr web application
Primary API Documention
Labs 21 Search API Documentation

DS teams

Labs 23

Developer	Github	LinkedIn	Portfolio
Patrick Wolf			🤷
Ryan Zernach			💼
Michael Rowland			💼
Jose Marquez			💼

Labs 21

Developer	Github	LinkedIn	Portfolio
Claudia Chajon			🤷
Enrique Collado			🤷
Dylan Nason			🤷
Kumar Veeravel			🤷

Project Overview

Deployed Front End

Trello Board

Product Canvas

The aim of this project is to provide a clean, uncluttered user interface that allows a user to track books, in a similar fashion to something like GoodReads. More details can be found in the product vision document (PVD accessible only to team members)

The core DS role on this project is to provide recommendations. If there are other DS utilties that you would like to add, communicate with Web, UI and iOS in order to get UI design input on the feature and identify the necessary data.

Tech Stack

Currently, the application uses a simple nearest-neighbors-based search engine which funnels title matches into a system that references a cosine similarity matrix. The matrix is born out of a combination of collaborative and content based recommendation approaches. This method ultimately provides the best recommendations we have encountered to date. Unfortunately, the current data is limited to less than 10k books; in order to prevent empty recommendations where data for a book is non-existent, the hybrid engine falls back to a purely description-based recommendation wherever necessary. This means that there are two recommendation engines working together to provide a seamless experience.

The hybrid engine is an aggregation of cosine similarities from a collaborative filtering method and a content-based one, using descriptions. Alternatively, the content-based system uses a combination of spacy for tokenization, tfidf for vectorization and a scikit-learn nearest-neighbors model to find the closest matches to a book in question.

All of these techniques are served to Web and iOS through a Flask application, with a gunicorn HTTP server, deployed inside of a Docker container to AWS elastic beanstalk.

Data Sources

10k Books, 6m Ratings
Book Crossing Dataset (Mostly used for publishers to populate database)

Python Notebooks

Collaborative Filtering

Description Based Recommendations

Hybrid Model and Title Search

Connecting to the web API

Details on how to connect to the Web API are located at the top of this document.

Connecting to the DS API

Currently, the domain for the data science API is dsapi.readrr.app

The account used for the postman collection is the [email protected] account (Sign in using google). See TL or SL for login credentials (you can also get login credentials for AWS, which the api is deployed on).

Contributing

When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.

Please note we have a code of conduct. Please follow it in all your interactions with the project.

Issue/Bug Request

If you are having an issue with the existing project code, please submit a bug report under the following guidelines:

Check first to see if your issue has already been reported.
Check to see if the issue has recently been fixed by attempting to reproduce the issue using the latest master branch in the repository.
Create a live example of the problem.
Submit a detailed bug report including your environment & browser, steps to reproduce the issue, actual and expected outcomes, where you believe the issue is originating from, and any potential solutions you have considered.

Feature Requests

We would love to hear from you about new features which would improve this app and further the aims of our project. Please provide as much detail and information as possible to show us why you think your new feature should be implemented.

Pull Requests

If you have developed a patch, bug fix, or new feature that would improve this app, please submit a pull request. It is best to communicate your ideas with the developers first before investing a great deal of time into a pull request to ensure that it will mesh smoothly with the project.

Remember that this project is licensed under the MIT license, and by submitting a pull request, you agree that your work will be, too.

Pull Request Guidelines

Ensure any install or build dependencies are removed before the end of the layer when doing a build.
Update the README.md with details of changes to the interface, including new plist variables, exposed ports, useful file locations and container parameters.
Ensure that your code conforms to our existing code conventions and test coverage.
Include the relevant issue number, if applicable.
You may merge the Pull Request in once you have the sign-off of two other developers, or if you do not have permission to do that, you may request the second reviewer to merge it for you.

Attribution

These contribution guidelines have been adapted from this good-Contributing.md-template.

More info on using badges here

betterreads-ds's People

Contributors

Stargazers

Watchers

Forkers

mvkumar14 patrickjwolf jose-marquez89

betterreads-ds's Issues

No error handling for null google books data

When a google books api call returns with no books, the 'items' key will be missing from the json response. This causes a KeyError in recommender.py. Error handling here is necessary:

 def db_insert(self, isbn=None):
        api = GBWrapper()
        if isbn is not None:
            google_books_response = api.search(isbn)
        else:
            google_books_response = api.search(self.googleId)

        # INSERTS GB_QUERY INTO DATABASE
        logging.debug("GETTING API DATA...")
        # add error handling for the line below:
        api_data = get_value(google_books_response['items'][0])
        if isbn is not None:
            gid = api_data[0]
            details = retrieve_details(google_books_response)
        else:
            gid = None
            details = None
        # execute_queries(api_data, self.conn, self.cursor)
        execute_queries(api_data, self.conn)
        return gid, details

New York Times Endpoint Output

Some of the keys/property names (googleId, pageCount) from the NYT endpoint differ from main recommendations output (see below). This causes an issue for iOS. The NYT endpoint needs to be changed to match the recommendations output.
Recommendations Output:

{
        "authors": [
          "George Orwell"
        ],
        "averageRating": 4,
        "categories": [
          "Fiction"
        ],
        "description": "Portrays life in a future time when a totalitarian government watches over all citizens and directs all activities",
        "googleId": "_NBZPgAACAAJ",
        "industryIdentifiers": [
          {
            "identifier": "143527704X",
            "type": "ISBN"
          }
        ],
        "isEbook": false,
        "language": "en",
        "pageCount": 294,
        "publishedDate": "1983",
        "publisher": "Paw Prints",
        "smallThumbnail": "http://books.google.com/books/content?id=_NBZPgAACAAJ&printsec=frontcover&img=1&zoom=5&source=gbs_api",
        "textSnippet": "Portrays life in a future time when a totalitarian government watches over all citizens and directs all activities",
        "thumbnail": "http://books.google.com/books/content?id=_NBZPgAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api",
        "title": "1984",
        "webReaderLink": "http://play.google.com/books/reader?id=_NBZPgAACAAJ&hl=&printsec=frontcover&source=gbs_api"
      }

NYT Output:

{
      "authors": [
        "'John Grisham'"
      ],
      "averagerating": 3.0,
      "categories": [
        "'Fiction'"
      ],
      "description": "#1 New York Times bestselling author John Grisham returns to Camino Island in this irresistible page-turner that's as refreshing as an island breeze. In Camino Winds, mystery and intrigue once again catch up with novelist Mercer Mann, proving that the suspense never rests--even in paradise.",
      "googleid": "4gBqygEACAAJ",
      "isbn": "0385545932",
      "isebook": false,
      "lang": "en",
      "maturityrating": "NOT_MATURE",
      "pagecount": 304,
      "publisheddate": "2020-04-28",
      "publisher": "Doubleday",
      "ratingscount": 1,
      "smallthumbnail": "http://books.google.com/books/content?id=4gBqygEACAAJ&printsec=frontcover&img=1&zoom=5&source=gbs_api",
      "subtitle": null,
      "textsnippet": "In Camino Winds, mystery and intrigue once again catch up with novelist Mercer Mann, proving that the suspense never rests--even in paradise.",
      "thumbnail": "http://books.google.com/books/content?id=4gBqygEACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api",
      "title": "Camino Winds",
      "webreaderlink": "http://play.google.com/books/reader?id=4gBqygEACAAJ&hl=&printsec=frontcover&source=gbs_api"
    }

Google Books Search Query

https://github.com/Lambda-School-Labs/betterreads-ds/blob/235415c7684ea23f9a25e55690f9b12886c720ab/readrr_api/route_tools/gb_search.py#L30

I ran into an issue when I declare the search method when instantiating this class, the search method is at the end of the request.

For example:

gb = GBWrapper(method='isbn')
gb.search('9781524763138')

This request made from above, as it is written now : https://www.googleapis.com/books/v1/volumes?q=9781524763138isbn
The correct request: https://www.googleapis.com/books/v1/volumes?q=isbn:9781984801258
Per the Google Books API Documentation here.

[Labs 23 READ THIS FIRST] Current Deployment Status

The DS API is currently deployed from a personal heroku account. A team heroku account has been set up, but code has not been deployed from there yet, due to access permissions regarding

You can :

Keep the current setup and explore a docker deployment
OR
Set up with the team heroku account deploying from this (betterreads-ds) repo. This requires setting up a github for the team [email protected] account, and giving that account admin privelages to this repo, OR having a TL or SL deploy the app using their admin privelages, OR designating one team member to deploy with admin privelages from their personal github. Work this out with your TL/SL.

The advantage of the second option is that you can quickly upload and test new code. I (personally) don't know enough about docker to say how long docker deployment will take, but the docker solution might be better for the long term. (See issue #8)

To access the betterreadslabs21 heroku account you MUST sign in using the google sign in, and not a e-mail password combination. You can either deploy to the existing dyno from a personal repo, or you can request admin access to the Lambda-School-Labs/betterreads-ds repo using the [email protected] account, and then deploy from the heroku_deployment branch

Note here that one of the environment (otherwise known as config variable) variables that has to be set to deploy properly is the GOOGLE_KEY. The current dyno in the heroku account already has this variable set (with an API key associated with the betterreadslabs21 gmail account), but if you want to deploy in a new environment the GOOGLE_KEY variable has to be set.

Google API keys can be found:
https://console.cloud.google.com/apis/dashboard (signing in with the gmail account)

Metric to evaluate recommendations

There is currently no metric to evaluate the quality of recommendations, so you will have to get a sense for how "good" the recommendations are based on how people react to the recommendations.

We need to set up a framework for A/B testing models with users. We want the framework to be ready when the app has users, so iteration can be done quickly.

Refactoring: Recommendation Engine

The recommendation engine should be isolated to it's own file and class.

This would eliminate data processing from within the API (routes) file, and make our code generally more readable. This would also mean development and iteration on the recommendation engine would be contained to the recommendation file.

When we serve our recommendations, we would simply call the recommendation class with our input data and return our recommendations.

https://github.com/Lambda-School-Labs/betterreads-ds/blob/2496f95424cedcce8512c7d476017ab818698944/readrr_api/routes/recommendations.py#L37

Git Secrets

Currently, the master branch reads environment variables from a local .env file. However, if you know where to look, deep enough in the commit history, there are API keys exposed. These are the ones I could find after a quick search, but there are likely many more:

We didn't have time to clean this up. One tool that was mentioned by the Labs staff was Git Secrets, which may be useful in solving this issue.

Open Library Data Cleaning + Upload to AWS RDS

The OpenLibrary data is largely unusable for the following reason: the works and editions contain entries that are not informative / are garbage data. This is caused by bots creating duplicate 'works' and 'editions' entries in the original OpenLibrary data that have mostly null values except for small changes, such as a different publisher or description.

However, there is a solution to this issue. The editions data entries contain the text array labeled 'Works' which has a link/key to the work that the edition is supposed to be associated with. You can use this information to eliminate "garbage" entries in the works table.

See this branch for more details, and some of the work that has been done with the OpenLibrary API and AWS RDS database: https://github.com/Lambda-School-Labs/betterreads-ds/tree/database-management/Database-management

Here is the trello card associated with this issue: https://trello.com/c/RMeNuIes

Deployment on Docker

The team had to switch from deployment on AWS to deployment on Heroku in the last few days of the project due to CORS issues. The current AWS solution launches a Flask app. These CORS issues may be resolved by using an AWS elastic beanstalk instance launching a docker container. Look into the docker solution, to deploy the recommendation model.

Here are the options:

Heroku
Pros: CORS is functioning properly
Cons: Filesize issue (data must be uploaded to the central repository before recommendation models can function). Also admin privileges to Lambda-School-Labs/betterreads-ds are required to deploy from the heroku_deployment branch

AWS Flask
Pros: You don't have to worry about filesize
Cons: CORS doesn't work

AWS Docker
This seems to be the best solution. You can upload large files if needed, and in the long term code can be refactored to access the central database.

Staging Server

The DS team needs a staging server

Testing, Deployment, CI/CD

Testing

There is very little (if any) testing coverage for our codebase. Outside of running locally, we have no way of checking code before pushing to master. Obviously this is less than ideal, and is something that needs to be addressed.

Outside of the Python standard library, Ryan Herr suggested experimenting with the following libraries:

Deployment

We do not have a way for the code that is deployed to AWS to be updated once we merge changes into the master branch on Github.

Ideally these two processes, testing and deployment, could be done automatically, together. My understanding is this is the idea behind the continuous integration, continuous deployment practices (CI/CD). More info on this here.

I am not too familiar with how to implement CI/CD, but I should be possible (via GitHub Actions, AWS CodePipeline, and/or other AWS Services)

The frontend team has something similar implemented inside their repository and may be a useful resource in solving this issue.

https certificate

The main server needs a properly configured https certificate.

Further Development Opportunities

Beyond the issues already posted, there are a number of "stretch goal" type ideas we had:

Recommendation performance. Speed up recommendation latency times. There is currently several seconds between a call to our API, and displaying on web and iOS. I would guess there are a number of performance gains to be made by restructuring how our code runs.
FastAPI. Ryan Herr suggests using FastAPI, however, we deployed via Flask primarily due to our familiarity with the framework. FastAPI has several advantages (documentation, speed) which will likely help with the performance aspect mentioned above.
CodeClimate. This was required for the Web team, but optional for DS. CodeClimate provides good information on code test coverage, maintainability, styling, etc.

Feel free to reach out to me on Slack for any clarification!

Improve Recommendations

Right now recommendations are Hardcoded into the Heroku endpoint. See the deployment issue () for more details on why we switched to Heroku in the last few days of the project, as well as why we have a hardcoded endpoint.

Setting aside deployment issues; recommendations can improve in two distinct ways:

Allow the model to take in any book, and output a list of recommendations based on:
a) the top 10,000 books, and/or
b) every book (dependent on status of OpenLibrary Data)
These two options (a and b) for improving the model in this way can be worked on concurrently, or sequentially.
Add data to the model to "improve" recommendations. Possible sources of data are listed here: https://trello.com/c/sxs8x9zJ. Check out existing branches for some starter code, and feel free to extend the branches, or create your own to add to the AWS database.

The trello card associated with this issue can be found here: https://trello.com/c/2XJQPkRi

Evaluating Model Performance

One of the biggest challenges we faced throughout our Labs experience was finding a way to best evaluate our models performance.

If our model was to take a pure collaborative filtering approach, which is based solely on user reviews, we would have been limited to only the 10,000 books in the Goodbooks dataset. Recommendation, more often than not, fell outside this dataset, causing no recommendations to be generated. We considered this approach too narrow.

However, one advantage to recommendations based on user reviews, is this turns our modelling approach into a regression problem, where you can take a train-test split/CV and where standard evaluation metrics (MAE, RMSE) are applicable. It may be possible to find additional user review data, beyond the Goodbooks dataset (see here and here).

Setting aside user review data, it wasn't clear to us at first how to generate these evaluation metrics, given we do not have a clear indicator of "success" (or dependent y-variable). As we found out, this is the main challenge of unsupervised learning.

We opted to survey the other members of our team for books they had read, then manually generate recommendations, and then create surveys for feedback on different model iterations. This was very time intensive and does not scale well. Creating an automated way to evaluate the recommendations and generate feedback would be the first thing I would do, if I was starting Labs over again. I experimented briefly with a rough HTML form to do this, but ran out of time to implement it (see below). Another idea for generating feedback that was suggested was to implement a "Tinder style" approach, where users can swipe right or left.