Coder Social home page Coder Social logo

kescardoso / datasetbucket Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 2.0 10.81 MB

A dataset bucket with a machine learning bias auditor. Built with Python-Flask, MaterializeCSS and the Kaggle API.

Home Page: https://datasetbucket.herokuapp.com

License: MIT License

Python 53.37% CSS 1.92% JavaScript 0.87% HTML 43.84%
python python3 flask flask-application mongodb mongodb-database machine-learning unbiased

datasetbucket's Introduction

DATASET BUCKET & BIAS AUDITOR

label

Check out our project on Heroku!

About

A dataset bucket and a machine learning bias auditor ๐Ÿ“ˆ, fully responsive web-app built on Python, with Flask, the MaterializeCSS UI grid system and the Kaggle API.

Based on a CRUD (Create, Read, Update and Delete) data-base system to generate, store and display dataset structures.

You will be able to find and read reports from a wiki styled list of information about data containing population and demographic subjects.

๐Ÿ‘ฉ ๐Ÿ‘ณ๐Ÿพโ€โ™‚๏ธ ๐Ÿ‘ฑ๐Ÿปโ€โ™€๏ธ ๐Ÿง”๐Ÿพ ๐Ÿ‘ฉ๐Ÿผโ€๐Ÿฆฐ ๐Ÿ‘จ๐Ÿฟโ€๐Ÿฆณ

Motivation

The whole world is data-driven.

However, data can often be misleading, inaccurate, or unrepresentative. When this biased data used in analytics or ML models, it not only produces inaccurate results, but also results in disastrous implications for minority groups and classes.

To confront this dangerous problem, we built a web app that analyzes a dataset for bias, and also suggests possible changes you can make to improve the quality of your dataset. ๐Ÿ“Š

Technologies used

  • MongoDB - a document database (stores data in JSON-like documents) with a horizontal, scale-out architecture that can support huge volumes of both data and traffic.
  • Materialize - a modern front-end framework (responsive and mobile-first, similar to Bootstrap) that helps developers build a stylish and responsive application.
  • Python - an interpreted, high-level and general-purpose programming language, great for data base structured projects.
  • Pip - a package manager for Python, that allows developers to install and manage additional libraries and dependencies that are not distributed as part of the standard library.
  • Flask - a Python framework that depends on the Jinja template engine and the Werkzeug WSGI toolkit.
  • Heroku - used for the app deployment, Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.
  • Git - a version control system for for source code management; it allows tracking file changes and coordinating work on those files among multiple people and machines.
  • GitHub - a open-source code hosting platform for version control and collaboration. It lets developers work remotely and together on projects from anywhere.
We used a lot of python libraries for building this project. Know more about them from LIBRARIES.md.

App Walkthrough

responsive

1. OPEN THE APP

Head on to our app deployed on Heroku.

image

You will see a WELCOME screen, it has the same basic instructions to get started and what you can expect from the app.

2. DATATAGS

In the data tags tab, you can find various tags associated with the reports uploaded on the app.

image

By clicking the view button on any of the available tags, you can see the dataset, analytical reports, and other information about the TAG.

image

3. DATASETS

The datasets tab has the list of all the datasets, to which an analytical report was generated. It is presented in the form of an accordion collapsable styled list, so you can click on any dataset you wish to explore and all the information related to that particular dataset will be displayed.

image

You can view the author, the development status, tags associated, and an option to download the analytical report.

4. REGISTER / LOG IN

Using the register tab you will land on the registration page, where you can create an account on this app.

image

If you are already a user of our app, head on the log in page.

๐Ÿ’Ÿ By being a registered user of our app, you will have access to upload new datasets to the app and generate reports for those.

5. ANALYSE

After logging into the app go to the * analyze* tab. You will see a menu to enter the kaggle URL. After adding a valid Kaggle dataset URL. Click on GET ANALYSIS REPORT.

image

Currently we are only accepting .json, .csv, .png, .jpeg and .jpg files for ananlysis.

You will see a progress bar till the report gets generated. Once it stops, the report gets downloaded automatically by the name of report.pdf.

IMAGES: .png, .jpeg, .jpg

image

CSV FILES: .csv

image

It would have all the details related to your dataset and what all improvements are possible.

๐Ÿ“ฉ Project Installation and Local Deployment

Prerequisites

Install python3 and pip3 in your machine

Create an account on Kaggle

Installation

  1. Download or clone this project into your local workspace

  2. Create a virtual environment using the command: Python3 -m venv venv

    After running this command, the folders will be automatically set up on your workspace.

  3. Activate your python interpretor using the command:

    source venv/bin/activate : mac

    .\venv\Scripts\activate : windows

  4. Install Flask using the command:

    pip3 install Flask

    and all the required dependencies with:

    pip3 install -r requirements.txt

Local Deployment

  1. Create an env.py file to keep your sensitive data secret.

  2. Open env.py and enter the following:

    import os
    
    os.environ.setdefault("SECRET_KEY", "secret_key_here")
    os.environ.setdefault("MONGO_URI", "value_from mongoDB_here")
    os.environ.setdefault("MONGO_DBNAME", "value_from mongoDB_here")
  3. Wire up Kaggle

    Kaggle API allows the developer to download datasets directly from the terminal.

    1. Make sure you have a kaggle account

    2. Follow these steps to download kaggle.json files, which helps to run the API:

      • Go to the Account tab in your Kaggle profile and scroll to the API section.

      • Click Create new API Token. This will download the kaggle.json file, add it to the /.kaggle path.

      For more complete and detailed instructions on how to use the Kaggle API, visit Kaggle's documentation.

      If you are on macOS/linux, and you need help getting to your ~/.kaggle/ folder, follow these instructions: Kaggle installation on macOS/Linux

  4. Wire up MongoDB and its functionalities with Flask, by installing flask-pymongo and dnspython. Use:

     `pip3 install flask-pymongo`
    
     `pip3 install dnspython`
    
  5. Run the app.py in debug mode as a flask application or use the following command:

     `python3 -m flask run`
    
     to see the project in your locally deployed `http` address.
    

Heroku Deployment

  1. Fulfil all Heroku requirements by chequing and freezing your dependencies and by creating a Procfile:

    You can run these two commands to fulfill the purpose:

    pip3 freeze > requirements.txt

    echo web: python app.py > Procfile

You may need to install gunicorn. For a good tutorial, check this youtube tutorial

Commit and push your changes.

  1. Create a new app from your Heroku dashboard.

  2. Add your environmental variables to Heroku, by going to Settings, and then to Config Vars, and enter the sensitive information from your MongoDB and your env.py file:

    import os
    
    os.environ.setdefault("SECRET_KEY", "your_secret_key_here")
    os.environ.setdefault("MONGO_URI", "value_from mongoDB_here")
    os.environ.setdefault("MONGO_DBNAME", "value_from mongoDB_here")
    
  3. From your Heroku dashboard, wire up your Heroku app to your git repository, by going to:

    • Deploy > Deployment Method > click: Connect to GitHub

    • Search your repo from the dropdown, and connect

    • Choose a branch to deploy your changes

    • Deploy your branch and view the app

References

๐Ÿ”ธ If you want to test run the project on your local computer, follow the guidelines in installation guide.

๐Ÿ”ธ If you wish to contribute to the existing project, follow the guidelines in CONTRIBUTION.md.

Challenges

  • Encountering bugs when deploying on Heroku
  • Accounting for different formatting of different datasets and types of files (JSON and CSV)
  • Ensuring that the app works for both macOS and Windows
  • Learning about lots of new technologies and languages for our team, including Python, Flask, HTML, CSS, Javascript, Matplot, file parsing, and data analysis
  • Working in different time zones

Lessons / Takeaways

  • Plan more in advance instead of diving into code headfirst
  • Deploy on Heroku earlier
  • Spend less time getting program to run on different files, but do more analysis for one specific file

Accomplishments / Contributions

  • Overall, we are proud to have completed a tough project and develop a functional web app that effectively parses files, handles multiple types of datasets, and generates PDFs!
  • What are we proud to accomplish / what did we work on?
    • "Gaining a better understanding of Python, and having a first real dive in data analytics โ€” it changed how I see machine learning forver!" -Kes
    • "Learning to use python better and parsing with pandas database, integrating back + front end, and deploying on heroku!" -Elizabeth
    • "Working with images and extracting useful info out of it to generate reports!" -Sakshi
    • "Parsing files, generating histograms, and connecting the analysis to the PDFs was super exciting for me!" -Will

Next Steps

  • Implement more advanced metrics and recommendations for dataset analysis
  • Allow users to upload their own datasets in addition to datasets on kaggle
  • Work on getting more files accepted, like .txt, etc

Contributors

โญElizabeth Crouther

โญKes Cardoso

โญSakshi Gupta

โญWilliam Yang

Navigate

โžก CONTRIBUTION.md

โžก LIBRARIES.md

License

MIT LICENSE 2021

Thank You! โœจ

datasetbucket's People

Contributors

elleanne avatar kescardoso avatar sakshigupta265 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

datasetbucket's Issues

Fix Category Bug

Categories:
after installing the location select functionality, bugs in category selection appeared

  • previous multi selection not returning from db on edit view
  • string and list not rendering properly (again!)

Fix PDF title

Sometimes, the PDF title isn't correct. We need to check that it is getting passed the correct string. It is possible this might be fixed when we fix the issue of the old datasets not being downloaded.

generate a pdf of the report

The results from the analysis will be available to download as a pdf, in addition to being displayed in the html.

Add dictionary of country demographics

Add population demographics of each country to be used in the analysis of the data.
We want to compare the demographics in the data to the demographics in a given population, to check if the data is representative of the whole.

Link field with jinja tags (for pdf reports retrieval)

Feature to be implemented as an alternative:

Figure out a way to add a link input to Add New Dataset, to include pdf reports from a cloud drop box.
Specially in case the JSON file handling doesn't integrate well to the project

Improve PDF analysis

Currently, the analysis and recommendations generated by the program is lacking. For CSV files, it only calculates basic metrics (like mean and variance) and generates a few histograms. For JSON files, the program doesn't give much analysis as well.

We need people to improve on the PDF analysis/recommendations!

Add 2 more acceptable formats to JSON files

We want to include 3 diffterent formats supported:
{'root': {dict}} ,
{'content': {}, 'annotation':[{'labels':{}},{'points':{}}], 'extras':{}} , currently the only accepted format
{'keyword':{}, 'keyword':{}, 'keyword':{}}

Better UX/UI/Design rules

Add:

  • Footer
  • About/instructions page
  • Credits

Improve user-friendly and aesthetic features:

  • color key
  • buttons
  • typography

Setup File Uploading with Cloud Storage

On add_dataset, the file is only successfully uploaded in local deployment.
After deploying our app to heroku, a fix is needed: either link files uploading to mongodb via b64 or via a cloud storage like s3 to handle user uploads in the deployed version (ultimately necessary).

Fix image analysis

image analysis files need their paths updated. Images in a dataset are not currently getting analyzed or reported.

Delete .vscode file

@kescardoso I want to delete the .vscode folder and .DS_Store in main, but I don't want to break the deployment. Can we delete it safely?

Analysis Page

Created a page with a text input form, to analyze datasets from a Kaggle command or url.

The page is not pragmatically functioning, it needs further developments and installation to link the HTML form to the bias auditor app.

Link to the deployed version: https://datasetbucket.herokuapp.com/analyse_data

Categories Query

Create the categories Query and management sessions:

  • List categories with links
  • Create Add New and Edit category forms

Improve PDF analytics

  • Break down number of samples for each value
  • Plot histograms/diagrams representing distributions for each feature
  • Other analytics improvements

Add about page

Add about page to explain bias to the user and explain how we will analyze their data.

More info/functionalities for user profile

Develop user profile with more functionalities (not a priority, but cool if we have time)

  • hyperlink author on datasets.html
  • NavBar with active user session tag
  • User info (name, position, photo, links, etc.)
  • User tasks, activities, contributions, stats....

Some things left to figure out and learn ๐Ÿค”

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.