Coder Social home page Coder Social logo

sfu-db / dataprep Goto Github PK

View Code? Open in Web Editor NEW
1.9K 26.0 198.0 219.1 MB

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.

Home Page: http://dataprep.ai

License: MIT License

Python 60.38% HTML 9.30% JavaScript 26.79% Vue 1.00% CSS 2.50% Just 0.03%
dataprep data-science datapreparation dataconnector eda exploratory-data-analysis data-exploration connector cleaning datacleaning

dataprep's People

Contributors

andywangsfu avatar atol avatar bowen0729 avatar dependabot[bot] avatar devinllu avatar dovahcrow avatar dylanzxc avatar elsie4ever avatar eutialia avatar fatbuddy avatar jinglinpeng avatar juandavidospina avatar jwa345 avatar kla55 avatar lakshay-sethi avatar noirtree avatar pallavibharadwaj avatar peiwangdb avatar peshotan avatar qidanrui avatar ryanwdale avatar sahmad11 avatar samplertechreport avatar shub970 avatar wangxiaoying avatar waterpine avatar yixuy avatar yuzhenmao avatar yxie66 avatar zshandy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataprep's Issues

eda.plot: empty bins in histogram

Currently the histogram will keep the bins even it is empty. For example, run the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
plot(df)
The result is:
Screen Shot 2020-03-12 at 2 19 01 PM

Actually, Pclass and survived only has 3 and 2 distinct values, respectively. Since we will show 10 bins by default, lots of bins are empty in Pclass and survived. Could we have a better way to visualize the histogram in this case? Maybe take a look at how other plotting library handling this issue.

plot(df, x, y): ngroups does not work

I would like to increase the number of groups in the box plot of plot(df, "suicides", "country"), but found that setting ngroups = 20 does not work (see below).

Screen Shot 2019-12-20 at 12 58 03 PM

Runtime warning while using plot_correlation() on Kaggle notebooks

I was using dataprep.ai in my notebooks on kaggle and found that plot_correlation() function throws a runtime warning while plotting. I think there must be try and except functionalities in the function so that it doesn't throw runtime warnings because it looks odd.
Thank you!

data_connector: add Author table to DBLP?

I suggest adding Author table to the DBLP connector. This can solve the name ambiguity issue.

Suppose I would like to find all the papers published by Guoliang Li (Tsinghua). I can first query the Author table to find all the people whose name is Guoliang Li. https://dblp.org/search/author/api?q=Guoliang$_Li$

There are six people and the second one is whom I am looking for:

Screen Shot 2019-12-22 at 11 49 57 AM

Once #50 is supported, I can get all Guoliang Li (Tsinghua)'s publications through https://dblp.org/search/publ/api?q=author%3AGuoliang_Li_0001%3A

eda.plot_missing: error when passing column

I'm getting this error when running the following code

import pandas as pd
df = pd.read_csv("https://s3-us-west-2.amazonaws.com/dataprep.dsl/datasets/suicide-rate.csv")
plot_missing(df, "suicides")

Screen Shot 2020-04-02 at 12 26 49 PM
@Waterpine @jinglinpeng is anyone else getting this error? I'm using dask version 2.9.1

plot(df): xtics need to be optimized for numerical attributes

I was doing plot(df) on the example data and found that the xtics of many numerical attributes are not carefully set. Below I compared the xtics of tableau (left) and dataprep (right). Apparently, tableau looks better. This is not a high-priority, but worth considering in the future release.

Screen Shot 2019-12-18 at 10 05 45 PM

plot: error in x ticks for some dataset

For some dataset, the x ticks of plot may have issues. Please try the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/52236/phpAyyBys', na_values = ['?', 'nan'])
plot(df)
The result is as follows:
Screen Shot 2020-01-30 at 4 14 34 PM

plot_missing(df, x): several bugs

I found a few bugs when running plot_missing(df, 'HDI_for_year').

Bug 1. Would be better to let the user know not all countries are displayed. Please work with @brandonlockhart to check how to add ngroups here and also take a look at #42 .

Screen Shot 2019-12-20 at 9 46 51 PM

Bug 2. Orange bars are not displayed on the sex and age tabs.

Screen Shot 2019-12-20 at 9 48 07 PM

Bug 3. No bars are displayed on the country_year and gdp_for_year tabs.

Screen Shot 2019-12-20 at 9 51 04 PM

EDA plot function

Goal: Plot function includes plot(df), plot(df, x="x") and plot(df, x="x", y="y")

Step 1: create intermediates
Step 2: plot graphs based on intermediates

Combine two function together

I find that plot(df, x, y) and plot_correlation(df, x, y) have similar outputs. Why not combine them together. Then, we just use the plot(df, x, y) to analyze the data.
屏幕快照 2020-04-02 下午9 32 53

data_connector: API issuing strategy expression

Design a mechanism of support fluent API query, i.e. get results effectively with respect to the network condition and websites' constraint, etc. (A retrospection from previous meetings.)

plot_missing: documentation for num_bins and num_cols

Please check below a screenshot of the documentation for plot_missing.

Screen Shot 2019-12-20 at 1 42 37 PM

There are a few issues:

  1. It should be "num_bins" rather than "bins_num". The description ("The number of rows in the figure") is also a bit confusing.
  2. In plot(df), this parameter is called bins. Please work with @brandonlockhart and make it consistent. I would suggest calling it bins, which is consistent with pandas.dataframe.hist.
  3. Maybe it's better to use ncols instead of num_cols. This is because that in plot(df), there is a parameter called ngroups, which is short for num_groups.

support of time series

This issue is about the rough idea to support time series in dataprep.eda.

Essentially, datatime could be regarded as a numeric type, and it could be transformed to timestamp (float) via datatime.timestamp() or pd.to_numeric(). Hence, we could do the following work as the initial support of time series.

  1. Identify the column with datatime64 type in the dataframe.
  2. plot(df) & plot(df, x): handle time series column like numeric column, which could be binalized. When show the ticks of time series column, show the datetime string via function like datatime.strftime(). An example output is https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html
  3. plot(df, x, y): When x is a datetime column and y is a numeric column, change the scatter plot with the line chart, which shows how y changes with x. For all other cases, apply the processing as step 2.
  4. plot_correlation: we could ignore the datetime column as pandas does, or transform datetime to numeric column via pd.to_numeric() and then apply the original processing.
  5. plot_missing: apply the similar processing of step 2.

data_connector: dblp schema

Please check the dblp schema below.

Screen Shot 2019-12-21 at 1 53 52 PM

  1. I suggest using authors as a column name since it has more than one author.

  2. I suggest using pages as a column name since it has more than one page.

  3. Why is the data type of the venue column a list? Should it be string?

No module named 'toml'

I tried to upgrade dataprep to the latest version: pip install -U dataprep, but got the following error:

Screen Shot 2020-03-29 at 9 07 11 PM

Design and implement an all-in-one report

Given a dataset, run all (or many) of the plot functions, and output the visualizations into a nicely formatted html file. This will be similar to pandas-profiling, the main differences being a larger variety of interaction plots and the tooltip. The current plan is to not include descriptive statistics.

  • Create a low-fidelity mockup
  • Get feedback about the mockup from the DataPrep team
  • Implement the report
  • Test

plot_missing(df): bugs in tooltips

I got a dataset collected from Yelp using the data_connector API. The dataset has 20 rows. Below shows the first 10 values in the address3 column.

Screen Shot 2019-12-20 at 2 51 37 PM

Bug 1. Please check the tooltip below.

  • missing% is larger than 100%?
  • loc should be 5 rather than 4~5.

Screen Shot 2019-12-20 at 2 52 44 PM

Bug 2. Please check the tooltip below.

  • Should the location of the first row 1 or 0? In Pandas Dataframe, iloc starts from 0.

Screen Shot 2019-12-20 at 3 02 55 PM

Bug 3. Should we consider empty strings as missing values? If so, the address3 column should have many more missing values.

plot(df, x, y): make it possible and easy for users to set ngroups

When running plot(df, "country", "generation"), I got the following plots. It seems that it is impossible to adjust ngroups (i.e., top 5, top 20, top 70) for each plot.

Screen Shot 2019-12-20 at 1 13 06 PM

Screen Shot 2019-12-20 at 1 13 14 PM

Screen Shot 2019-12-20 at 1 13 26 PM

To make it easy to set ngroups, I have one proposal.

  • First, we make the three plots have the same ngroups by default (e.g., ngroups = 10).
  • Second, if a user wants to change ngroups, she only needs to change one parameter and then it will be applied to all three plots.
  • Third, if ngroups is very large, then we should increase the plot width/height automatically so that a user can view the whole plot by scrolling the vertical and horizontal bars. See the plot below for an example.

Screen Shot 2019-12-20 at 1 25 06 PM

dataprep.eda: add case study for 'Housing Price'

The task is to add a case study of using dataprep.eda to simplify the EDA for 'Housing Price' task.

House Price:
Data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
EDA: https://www.kaggle.com/mgmarques/houses-prices-complete-solution

The case study has 2 purposes:

  1. Help us figure out if the current API design is complete and easy to use.
  2. Educate the users on how to use our tool to finish popular tasks.

TODO:

  • Create Jupyter Notebook.
  • Add the use case into the documentation.

data_connector: pagination design

I'm working on the design of the pagination feature of data connector. Here are some plans and existing problems. Thoughts are very welcome.

Plan:

  • Implementation of limit specification: user can specify the limit to control the maximum number of returned

  • Implementation of fetch all results under query: under the help of offset parameter for each API (since_id for Twitter, may need further modification)

Problems:

  1. How to find a general way to represent parameters in the query() function
  2. How to deal with the specific way of Twitter API in terms of pagination

plot_missing(df, x, y): several bugs

I found a few bugs after running plot_missing(df, x, y).

Bug 1. DropMissing should be orange and Origin should be blue. Also, the PDF curve looks strange to me. Please double-check whether it is correct.

Screen Shot 2019-12-20 at 11 10 41 PM

Bug 2. The two CDF curves overlap, which looks strange to me.
Screen Shot 2019-12-20 at 11 17 50 PM

Bug 3. Please make the box plot consistent with the one generated by plot(df, numerical_x). Also, the color scheme of the box plot looks strange to me.

Screen Shot 2019-12-20 at 11 18 55 PM

plot(df, x): histogram shows incorrect values

When running plot(df, 'year' 31) on the example dataset (suicide-rate.csv), I got the following histogram, which shows are 2015: 936 and 2016: 904. However, the correct values should be 2015: 744 and 2016: 160.

Screen Shot 2019-12-18 at 11 21 31 PM

Conda Installation of the dataprep AI is not supported

Conda Installation for the data prep AI is not supported.

$ conda install dataprep
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  • dataprep

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

fix KDE plot

The KDE plot of Dataprep is bad and needs to be fixed:
Screen Shot 2020-04-10 at 4 59 35 PM

eda.plot_missing: error when change column type

I try the training data of Titanic, which could be download in https://www.kaggle.com/c/titanic/data.

The following code will raise an error:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype("object")
plot_missing(df, 'Age')

The error information is as follows:
Screen Shot 2020-03-12 at 2 07 30 PM

However, if we do not change column type of 'PassengerId'. I.e., remove df['PassengerId'] = df['PassengerId'].astype("object"), the code run successfully.

dataprep.eda: correlation for categorical column

For a classification task, we need to understand how other columns are related to the label column, which is a categorical column. However, current plot_correlation only supports correlation for numeric column, we need to think about how the correlation of categorical column could be understood (may via or not via plot_correlation function).

eda.plot_correlation: handle categorical column

I'm considering whether we should handle categorical variables in plot_correlation. One use case of plot_correlation is plot_correlation(df, x = label) to rank the features that are correlated to the label. For this scenario, it would be important to have a uniform way to measure the correlation for both categorical variable and continuous variable.

My idea is to add one measure to handle categorical variables, such as Cramer's V (based on chi-square's test) or Uncertainty Coefficient (based on mutual information). For continuous variable, we make bins and treat it as categorical variable.

It requires to add one more tab on current output of plot_correlation(df) and plot_correlation(df, x), which shows the Cramer's V or Uncertainty Coefficient for all columns. Please let me know any opinions. @dovahcrow @jnwang @Waterpine @brandonlockhart

data_connector: issue using API parameters without template variables

Support for templates was added in this PR.
When template variables are not specified in the API request, the template value still contains the string around it and is not "empty". This always results in key conflicting with to_key Warning and returns empty results.

Example:
if first_name and last_name are template variables and are not mentioned in the request and to_key is q and is specified in the following manner:
df = dc.query("publication", q="Journal Articles")
The request contains template value <Template memory:7f4f6c36c2d0> author:_: with the above Warning. Instead of returning the publications of type "Journal Articles", it would return an empty data frame.

plot(df) and plot_correlation(df) fail when data has 'list' columns

When running plot(df) and plot_correlation(df) on the following dataframe, since the author column is a list, both plot and plot_correlation failed.

For plot(), the reported error is TypeError: unhashable type: 'list'

For plot_correlation(df), the reported error AssertionError: No numerical columns found

Screen Shot 2019-12-21 at 1 36 30 PM

data_connector: automate testing configuration

The task is to make every PR trigger impacted module tests automatically. This function can be achieved by Github actions.

  • design the test workflow for data-connector

  • implement the workflow

  • code review

  • PR & release

Extend data-connector for more webistes

  • look for more frequently used websites besides our current supporting ones (e.g. yelp) and make a list

  • learn how to write data-connector config for a new website

  • implement to support one more website (the rest on the list would be supported in the future)

  • PR & code review

plot_correlation: handle missing values

It looks plot_correlation(df, x) and plot_correlation(df, x, y) cannot handle missing values. Could you please take a look? @Waterpine

The code is as follows:
df = pd.read_csv('https://www.openml.org/data/get_csv/9/dataset_9_autos.arff', na_values = ['?'])
plot_correlation(df, 'price')
plot_correlation(df, 'price', 'bore')

The running result is as follows:
Screen Shot 2020-01-21 at 5 18 19 PM
Screen Shot 2020-01-21 at 5 18 42 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.