sfu-db / dataprep Goto Github PK

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.

Home Page: http://dataprep.ai

License: MIT License

Python 60.38% HTML 9.30% JavaScript 26.79% Vue 1.00% CSS 2.50% Just 0.03%

dataprep data-science datapreparation dataconnector eda exploratory-data-analysis data-exploration connector cleaning datacleaning

dataprep's People

Contributors

Stargazers

Watchers

Forkers

gomrinal deeksha0104 inderpartap abhishek-pv najq madankrishnan97 qzx820 hashihab willmartell aguiarandre kla55 dylanzxc sanjana12111994 krieya peshotan pallavibharadwaj panoscorp avianaglobal bexxmodd feanorian johnjboren brdhunga korakot zggithubbb kamalesh-pathy-va allensmile gamebusterz cilmidheere keshabb 79212 matthieurouland yxie66 suppu-github luke202001 ranzj pitsopo ryanwdale jinski71 sbrugman adbmd yuzhenmao juandavidospina nick-zrymiak soheil647 mpwjames lenamax2355 dgonzo nguyenkhacbaoanh netkingcode andywangsfu jmalinao19 badkoubeh pplonski lakshay-sethi kgtdbx hwec0112 wanyun-yang abe2g shrinivasdharmadhikari mcgarrah hypothesis2304 noirtree danielsywang mpaulonis open-sources-project jinglinpeng crayon eshnil2000 subratac laopeng2021 the1onwrongway thoughtsynapse arungrace88 pwwang jombaba ccchai1 kmissa speedyidea miguelleon88 321hg debugx-x stjordanis mohannaesmail hanaluana sahmad11 devinllu marcofernandez007 waterpine genomicsnx fantasticer shiven004 agilicus gaybro8777 gaelicgrime forestlzj yurifreire2007 eluisluzquadros ychuckt8 dustinpartain bowen0729

dataprep's Issues

plot(df) error

eda.plot: empty bins in histogram

Currently the histogram will keep the bins even it is empty. For example, run the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
plot(df)
The result is:

Actually, Pclass and survived only has 3 and 2 distinct values, respectively. Since we will show 10 bins by default, lots of bins are empty in Pclass and survived. Could we have a better way to visualize the histogram in this case? Maybe take a look at how other plotting library handling this issue.

eda.plot: JavaScript output is disabled in JupyterLab

The plot can not be showed and there is an warning: JavaScript output is disabled in JupyterLab when I import and invoke eda.plot using Jupyter lab (notebook does not have this issue).

eda.plot_correlation: plot_correlation is not efficient

Currently, compared with seaborn, plot_correlation is not efficient enough.

I don't know the reason, maybe we should check the code and make the function more efficient.

dataprep.eda: add case study for 'Titanic'

The task is to add a case study (on jupyter notebook) of using dataprep.eda to do ML task.

plot(df, x, y): ngroups does not work

I would like to increase the number of groups in the box plot of plot(df, "suicides", "country"), but found that setting ngroups = 20 does not work (see below).

Runtime warning while using plot_correlation() on Kaggle notebooks

I was using dataprep.ai in my notebooks on kaggle and found that plot_correlation() function throws a runtime warning while plotting. I think there must be try and except functionalities in the function so that it doesn't throw runtime warnings because it looks odd.
Thank you!

data_connector: add Author table to DBLP?

I suggest adding Author table to the DBLP connector. This can solve the name ambiguity issue.

Suppose I would like to find all the papers published by Guoliang Li (Tsinghua). I can first query the Author table to find all the people whose name is Guoliang Li. https://dblp.org/search/author/api?q=Guoliang$_Li$

There are six people and the second one is whom I am looking for:

Once #50 is supported, I can get all Guoliang Li (Tsinghua)'s publications through https://dblp.org/search/publ/api?q=author%3AGuoliang_Li_0001%3A

eda.plot_missing: error when passing column

I'm getting this error when running the following code

import pandas as pd
df = pd.read_csv("https://s3-us-west-2.amazonaws.com/dataprep.dsl/datasets/suicide-rate.csv")
plot_missing(df, "suicides")

@Waterpine @jinglinpeng is anyone else getting this error? I'm using dask version 2.9.1

dataprep plot() doesn't show plots on the Google Colab interface.

I used dataprep on google colab for one of my EDA work. The dataprep library works well on google colab but doesn't shows up the plots.

plot(df): xtics need to be optimized for numerical attributes

I was doing plot(df) on the example data and found that the xtics of many numerical attributes are not carefully set. Below I compared the xtics of tableau (left) and dataprep (right). Apparently, tableau looks better. This is not a high-priority, but worth considering in the future release.

data_connector: Fetch all publications of one specific conference

Suppose a user wants to fetch all publications of one specific conference (e.g., SIGMOD Conference). Dataprep.data_connector cannot meet her needs. For example, the following paper will be returned.

A user can get all publications of SIGMOD Conference through this API: https://dblp.org/search/publ/api?q=venue%3ASIGMOD_Conference%3A

Please consider to support this feature.

plot: error in x ticks for some dataset

For some dataset, the x ticks of plot may have issues. Please try the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/52236/phpAyyBys', na_values = ['?', 'nan'])
plot(df)
The result is as follows:

plot_missing(df, x): several bugs

I found a few bugs when running plot_missing(df, 'HDI_for_year').

Bug 1. Would be better to let the user know not all countries are displayed. Please work with @brandonlockhart to check how to add ngroups here and also take a look at #42 .

Bug 2. Orange bars are not displayed on the sex and age tabs.

Bug 3. No bars are displayed on the country_year and gdp_for_year tabs.

EDA plot function

Goal: Plot function includes plot(df), plot(df, x="x") and plot(df, x="x", y="y")

Step 1: create intermediates
Step 2: plot graphs based on intermediates

data_connector (Spotify): missing cols in the Artist and Album tables

Is there any reason for not including external_urls, images for the Artist table and for not including popularity, album_type, copyrights, external_urls for the Album table?

Also, in the Album table, can an album have more than one artist?

Combine two function together

I find that plot(df, x, y) and plot_correlation(df, x, y) have similar outputs. Why not combine them together. Then, we just use the plot(df, x, y) to analyze the data.

data_connector: API issuing strategy expression

Design a mechanism of support fluent API query, i.e. get results effectively with respect to the network condition and websites' constraint, etc. (A retrospection from previous meetings.)

plot_missing: documentation for num_bins and num_cols

Please check below a screenshot of the documentation for plot_missing.

There are a few issues:

It should be "num_bins" rather than "bins_num". The description ("The number of rows in the figure") is also a bit confusing.
In plot(df), this parameter is called bins. Please work with @brandonlockhart and make it consistent. I would suggest calling it bins, which is consistent with pandas.dataframe.hist.
Maybe it's better to use ncols instead of num_cols. This is because that in plot(df), there is a parameter called ngroups, which is short for num_groups.

support of time series

This issue is about the rough idea to support time series in dataprep.eda.

Essentially, datatime could be regarded as a numeric type, and it could be transformed to timestamp (float) via datatime.timestamp() or pd.to_numeric(). Hence, we could do the following work as the initial support of time series.

Identify the column with datatime64 type in the dataframe.
plot(df) & plot(df, x): handle time series column like numeric column, which could be binalized. When show the ticks of time series column, show the datetime string via function like datatime.strftime(). An example output is https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html
plot(df, x, y): When x is a datetime column and y is a numeric column, change the scatter plot with the line chart, which shows how y changes with x. For all other cases, apply the processing as step 2.
plot_correlation: we could ignore the datetime column as pandas does, or transform datetime to numeric column via pd.to_numeric() and then apply the original processing.
plot_missing: apply the similar processing of step 2.

data_connector: dblp schema

Please check the dblp schema below.

I suggest using authors as a column name since it has more than one author.
I suggest using pages as a column name since it has more than one page.
Why is the data type of the venue column a list? Should it be string?

No module named 'toml'

I tried to upgrade dataprep to the latest version: pip install -U dataprep, but got the following error:

Design and implement an all-in-one report

Given a dataset, run all (or many) of the plot functions, and output the visualizations into a nicely formatted html file. This will be similar to pandas-profiling, the main differences being a larger variety of interaction plots and the tooltip. The current plan is to not include descriptive statistics.

Create a low-fidelity mockup
Get feedback about the mockup from the DataPrep team
Implement the report
Test

plot_missing(df): bugs in tooltips

I got a dataset collected from Yelp using the data_connector API. The dataset has 20 rows. Below shows the first 10 values in the address3 column.

Bug 1. Please check the tooltip below.

missing% is larger than 100%?
loc should be 5 rather than 4~5.

Bug 2. Please check the tooltip below.

Should the location of the first row 1 or 0? In Pandas Dataframe, iloc starts from 0.

Bug 3. Should we consider empty strings as missing values? If so, the address3 column should have many more missing values.

plot(df, x, y): make it possible and easy for users to set ngroups

When running plot(df, "country", "generation"), I got the following plots. It seems that it is impossible to adjust ngroups (i.e., top 5, top 20, top 70) for each plot.

To make it easy to set ngroups, I have one proposal.

First, we make the three plots have the same ngroups by default (e.g., ngroups = 10).
Second, if a user wants to change ngroups, she only needs to change one parameter and then it will be applied to all three plots.
Third, if ngroups is very large, then we should increase the plot width/height automatically so that a user can view the whole plot by scrolling the vertical and horizontal bars. See the plot below for an example.

dataprep.eda: add case study for 'Housing Price'

The task is to add a case study of using dataprep.eda to simplify the EDA for 'Housing Price' task.

House Price:
Data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
EDA: https://www.kaggle.com/mgmarques/houses-prices-complete-solution

The case study has 2 purposes:

Help us figure out if the current API design is complete and easy to use.
Educate the users on how to use our tool to finish popular tasks.

TODO:

Create Jupyter Notebook.
Add the use case into the documentation.

data_connector: pagination design

I'm working on the design of the pagination feature of data connector. Here are some plans and existing problems. Thoughts are very welcome.

Plan:

Implementation of limit specification: user can specify the limit to control the maximum number of returned
Implementation of fetch all results under query: under the help of offset parameter for each API (since_id for Twitter, may need further modification)

Problems:

How to find a general way to represent parameters in the query() function
How to deal with the specific way of Twitter API in terms of pagination

Check version of dataprep

I would like to know which version of dataprep that I install. Can we support this?

plot_missing(df, x, y): several bugs

I found a few bugs after running plot_missing(df, x, y).

Bug 1. DropMissing should be orange and Origin should be blue. Also, the PDF curve looks strange to me. Please double-check whether it is correct.

Bug 2. The two CDF curves overlap, which looks strange to me.

Bug 3. Please make the box plot consistent with the one generated by plot(df, numerical_x). Also, the color scheme of the box plot looks strange to me.

eda.plot: box plot x-axis label is not clear

Currently, when we use plot(df, x, y, bins), if you set the number of bins too large, the box plot's x-axis label is not clear.

plot(df, x): histogram shows incorrect values

When running plot(df, 'year' 31) on the example dataset (suicide-rate.csv), I got the following histogram, which shows are 2015: 936 and 2016: 904. However, the correct values should be 2015: 744 and 2016: 160.

Conda Installation of the dataprep AI is not supported

Conda Installation for the data prep AI is not supported.

$ conda install dataprep
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

dataprep

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

fix KDE plot

The KDE plot of Dataprep is bad and needs to be fixed:

dataprep.eda: report implementation

Implement the Report enhancement as specified in #74.

plot_correlation: why does it not work for the columns with missing values?

For the columns with missing values, plot_correlation(df) returns NaN as their correlation values (see below).

It would be better to take a look at how pandas.DataFrame.corr overcomes this limitation.

dataprep.eda: plot(df,x,y) with categorical variables bar chart problems

Some labels are overlapping in the nested bar chart:

I think we should include the count in the tooltip of the stacked bar chart since this plot could be deceiving if there is a small number of observations in a group

eda.plot_missing: error when change column type

I try the training data of Titanic, which could be download in https://www.kaggle.com/c/titanic/data.

The following code will raise an error:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype("object")
plot_missing(df, 'Age')

The error information is as follows:

However, if we do not change column type of 'PassengerId'. I.e., remove df['PassengerId'] = df['PassengerId'].astype("object"), the code run successfully.

dataprep.eda: correlation for categorical column

For a classification task, we need to understand how other columns are related to the label column, which is a categorical column. However, current plot_correlation only supports correlation for numeric column, we need to think about how the correlation of categorical column could be understood (may via or not via plot_correlation function).

Add documentation for data_connector

eda.plot_correlation: handle categorical column

I'm considering whether we should handle categorical variables in plot_correlation. One use case of plot_correlation is plot_correlation(df, x = label) to rank the features that are correlated to the label. For this scenario, it would be important to have a uniform way to measure the correlation for both categorical variable and continuous variable.

My idea is to add one measure to handle categorical variables, such as Cramer's V (based on chi-square's test) or Uncertainty Coefficient (based on mutual information). For continuous variable, we make bins and treat it as categorical variable.

It requires to add one more tab on current output of plot_correlation(df) and plot_correlation(df, x), which shows the Cramer's V or Uncertainty Coefficient for all columns. Please let me know any opinions. @dovahcrow @jnwang @Waterpine @brandonlockhart

data_connector: issue using API parameters without template variables

Support for templates was added in this PR.
When template variables are not specified in the API request, the template value still contains the string around it and is not "empty". This always results in key conflicting with to_key Warning and returns empty results.

Example:
if first_name and last_name are template variables and are not mentioned in the request and to_key is q and is specified in the following manner:
df = dc.query("publication", q="Journal Articles")
The request contains template value <Template memory:7f4f6c36c2d0> author:_: with the above Warning. Instead of returning the publications of type "Journal Articles", it would return an empty data frame.

Add docstring for data_connector

plot(df) and plot_correlation(df) fail when data has 'list' columns

When running plot(df) and plot_correlation(df) on the following dataframe, since the author column is a list, both plot and plot_correlation failed.

For plot(), the reported error is TypeError: unhashable type: 'list'

For plot_correlation(df), the reported error AssertionError: No numerical columns found

data_connector: automate testing configuration

The task is to make every PR trigger impacted module tests automatically. This function can be achieved by Github actions.

design the test workflow for data-connector
implement the workflow
code review
PR & release

data_connector: Fetch all publications of one specific author

Suppose a user wants to fetch all publications of one specific author (e.g., Jian Pei). Dataprep.data_connector cannot meet her needs. For example, the first paper is not written by Jian Pei, but it was returned since the author list contains the keywords Jian and Pei.

A user can get all publications of Jian Pei through this API: https://dblp.org/search/publ/api?q=author%3AJian_Pei%3A

Please consider to support this feature.

data_connector: schema.json, adding description attribute for field definition?

Adding a description for the parameters will help the users understand how to specify values for each parameter. For example, the format of the longitude in Yelp.businesses table; the maximum limit of the results that a user can expect (if we incorporate limit parameter in the future).

Design and write tutorial notebook for data_connector

Extend data-connector for more webistes

look for more frequently used websites besides our current supporting ones (e.g. yelp) and make a list
learn how to write data-connector config for a new website
implement to support one more website (the rest on the list would be supported in the future)
PR & code review

plot_correlation: handle missing values

It looks plot_correlation(df, x) and plot_correlation(df, x, y) cannot handle missing values. Could you please take a look? @Waterpine

The code is as follows:
df = pd.read_csv('https://www.openml.org/data/get_csv/9/dataset_9_autos.arff', na_values = ['?'])
plot_correlation(df, 'price')
plot_correlation(df, 'price', 'bore')

The running result is as follows:

eda.plot: add pairplot

I think seaborn pairplot function is very useful, which could give us a reasonable idea about variables relationships: https://seaborn.pydata.org/generated/seaborn.pairplot.html
However, if we use dataprep, we have to write for loops.