Coder Social home page Coder Social logo

bridgecrew-perf6 / wholefoods-datascraping-project-deployment Goto Github PK

View Code? Open in Web Editor NEW

This project forked from youssefsultan/wholefoods-datascraping-project-deployment

0.0 0.0 1.0 85.65 MB

This data is scraped from Whole Foods Featured On Sale Web Page. Features EDA, Amazon Prime to Non-Amazon Prime membership discounts on sale products as well as app deployment to show live insights.

License: MIT License

Python 4.57% Jupyter Notebook 95.43%

wholefoods-datascraping-project-deployment's Introduction

Live Whole Foods 'On-Sale' Product Insights and Recommendation System Web Application

all scraped items are 'on-sale/discounted' only, if the item is not on sale for regular customers or prime members it will not be in the queried dataset

The point of this app is to help Whole Foods shoppers make better purchasing decisions at their local store to have a better shopping experience and save money, with specifics that are not on the website.

How this app's data is collected:

  • A user inputs their zipcode
  • A script scrapes unstructured product data from each category on the Whole Foods website pertaining to the user's inputted zipcode/store and then structures all of the data in a DataFrame (similar to an Excel spreadsheet)

What this app does:

  • It shows many graphs of the queried data of all 'on-sale/discounted' items from the user's store or other users' stores to understand how much of each product/category is on sale

  • It generates a shopping cart of items 'on-sale/discounted' based on user keyword input ('chocolate, pasta'...) and selected optimization parameter ('random, price, discount')

  • It recommends products to the user based on Instacart customer data using a collaborative filtering approach and the users generated shopping cart
  • For more information on the intuition behind the recommendation system click here or view the blog post

Extra app features:

  • Search the queried dataset based on keywords for anything specific
  • Download the queried dataset as a CSV

Dataset information:

  • Any user queried data gets wrangled/cleaned/manipulated to fit all edge cases for website element changes, product title mismatches and other errors that might arise when scraping product information, then structured with the following columns (any column with a * is a created feature):
feature                    dtype description
---------------------------------------------------------------------------
company                   object [product company name]
product                   object [product title]
regular                  float64 [regular product price]
sale                     float64 [on-sale product price] 
prime                    float64 [on-sale product price for prime members]
category                  object [Whole Foods category]
sale_discount            float64 [sale discount percentage] *
prime_discount           float64 [prime discount percentage] *
prime_sale_difference    float64 [prime discount - sale discount] *
discount_bins             object [discount bins I.E. 0% Off to 10% off] *

Recommendation system using collaborative filtering:

  • Recommendations are driven by parsing products into categories

    • rule-based data parsing/cleaning/lemmatization
    • word embedding (parsing) using spaCy pre-trained model
    • designing the taxonomy (categories) from scratch to have a unique signature (1400 avg items per data set --> 99 categories)
    • all of which is automated and preprocessed using a transformer with the help of sci-kit learn's BaseEstimator & TransformerMixin
  • Instacart's public datasets of 3M customer orders and other tables are joined collectively, used and built to match the taxonomy design of the signature above (99 categories)

  • Apriori algorithm is applied to the designed dataset

  • Recommendations on the app are provided to the user based on association rules of Instacart customer data as well as the input of the user

  • Recommendations are based on a random selection of a category within the top 10 confidence values (measure of the percentage of times that item in category B is purchased, given that item in category A was purchased.) this reduces bias by mitigating a selection of a category solely by the highest confidence.

image
*if a user were to have chocolate in their shopping cart, there is an equal chance that a product within the top 10 confidence values in each category of item_B is recommended

This project is deployed via Streamlit which uses a debian based linux image on their cloud, a big thanks to them for allowing many to use their platform with ease for data scientists like myself.

wholefoods-datascraping-project-deployment's People

Contributors

ianyu-gbi avatar youssefsultan avatar

Forkers

ecogit-stage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.