autoviml / autoviz Goto Github PK

Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

License: Apache License 2.0

Python 100.00%

visualization python3 python xgboost automl scikit-learn machine-learning tableau automl-algorithms tpot

autoviz's Introduction

Join our elite team of contributors!

👋 Welcome to the AutoViML Fan Club Page!
We just hit 3300 stars collectively for all AutoViML libraries on Github!!

AutoViML creates innovative Open Source libraries to make data scientists' and machine learning engineers' lives easier and more productive!

Our innovative libraries so far:

🤝 AutoViz Automatically Visualizes any dataset, any size with a single line of code. Now with Bokeh and Holoviews it can make your charts and dashboards interactive!
🤝 Auto_ViML Automatically builds multiple ML models with a single line of code. Uses scikit-learn, XGBoost and CatBoost.
🤝 Auto_TS Automatically builds ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with DASK to handle millions of rows.
🤝 Featurewiz Uses advanced feature engineering strategies and select the best features from your data set fast with a single line of code. Now updated with DASK to handle millions of rows.
🤝 Deep_AutoViML Builds tensorflow keras models and pipelines for any data set, any size with text, image and tabular data, with a single line of code.
🤝 lazytransform Automatically transform all categorical, date-time, NLP variables to numeric in a single line of code, for any data, set any size.
🤝 pandas_dq Automatically find and fix data quality issues in your dataset with a single line of code, for pandas.

Feb-2024: Added "Auto Encoders" for automatic feature extraction to featurewiz library for #feature-extraction

On Feb 8, 2024, we released a major update to our popular "featurewiz" library that will transform your input into a latent space with a dimension of latent_dim. This lower dimension (similar to PCA) will enable you to extract the best patterns in your data for the toughest imbalanced class and multi-class problems. Try it and let us know!
how to use autoencoders in featurewiz

April-2023: Released a major new python library "pandas_dq" #data_quality #dataengineering

On April 2, 2023, we released a major new Python library called "pandas_dq" that will automatically find and fix data quality issuesin your train and test dataframes in a single line of code, for any data, set any size.
how many pixels wide is my screen

April-2022: Released a major new python library "lazytransform" #featureengineering #featureselection

On April 3, 2022, we released a major new Python library called "lazytransform" that will automatically transform all categorical, date-time, NLP variables to numeric in a single line of code, for any data, set any size.

Jan-2022: Major upgrade to featurewiz: you can now perform feature selection thru fit and transform #MLOps #featureselection

As of version 0.0.90, featurewiz has a scikit-learn compatible feature selection transformer called FeatureWiz. You can use it to perform fit and predict as follows. You will get a Scikit-Learn Transformer object that you can add it to other data pipelines in MLops to select the top variables from your dataset.

Dec-23-2021 Update: AutoViz now does Wordclouds! #autoviz #wordcloud

AutoViz can now create Wordclouds automatically for your NLP variables in data. It detects NLP variables automatically and creates wordclouds for them.

Dec 21, 2021: AutoViz now runs on Docker containers as part of MLOps pipelines. Check out Orchest.io

We are excited to announce that AutoViz and Deep_AutoViML are now available as containerized applications on Docker. This means that you can build data pipelines using a fantastic tool like orchest.io to build MLOps pipelines visually. Here are two sample pipelines we have created:

AutoViz pipeline: https://lnkd.in/g5uC-z66 Deep_AutoViML pipeline: https://lnkd.in/gdnWTqCG

You can find more examples and a wonderful video on orchest's web site

Dec-17-2021 AutoViz now uses HoloViews to display dashboards with Bokeh and save them as Dynamic HTML for web serving #HTML #Bokeh #Holoviews

Now you can use AutoViz to create Interactive Bokeh charts and dashboards (see below) either in Jupyter Notebooks or in the browser. Use chart_format as follows:

chart_format='bokeh': interactive Bokeh dashboards are plotted in Jupyter Notebooks.
chart_format='server', dashboards will pop up for each kind of chart on your web browser.
chart_format='html', interactive Bokeh charts will be silently saved as Dynamic HTML files under AutoViz_Plots directory

Languages and Tools:

Our Kaggle Badges:

Connect with us on Linkedin:

autoviz's People

Contributors

Stargazers

Watchers

Forkers

cl0vis animeshgoyal9 cdubbs512 gitrekm prateekdubeydynamic risenw emekaborisama bolaji61 anhmike morenoh149 a3digit xwydq danroth-nyt navaneeth20 mengwangk skyroz08 gridl shalevy1 aidhamza saketh21 sanyam07 claretnnamocha abigailsleek tqcai biplavadhikary gururajang dmlunde wenya-poh robertlugg moaisus davidurpani mzahran001 ai-hub-deep-learning-fundamental jamshaidsohail5 mossydidar moncybigdata abenavs finalclub d2atta shyamalschandra restevesd ghayth82 satishjasthi lenamax2355 bbyun28 neerajbhat98 goncaloperes globalhelpforall aswinjose89 hkarhani prashant-rahul ankitshah009 ankitshrivastava2910 msantoshnetha ramperiannan kbaaziz princexoleo statsai 321hg thanhtunggggg sjoerdteunisse sillinous hsulin0806 subratac laopeng2021 aylr shaikbasith sexroute thoughtsynapse santolina hercules261188 mchandrakandh manikant92 syh0397 as85207 she-osprey jetwang88 jbednar miggytrinidad orangesi adbmd desolatetraveller hudakas jacklx2021 sandy1811 mehmegleap chetanmehra numbnessissense stemeye yard1 ykjin stjordanis rahulsai05 python-repository-hub pyquantsharp m7mouda4raf tanglespace techthiyanes jteranp peteresis

autoviz's Issues

`html` format file output

Most of the other EDA tools support html format file output.
Do you have any plans to support html format file output function, taking the output path of the file as an argument?
Thank you.

DataFrame

Dear all,
I couldn't figure out how to pass a dataframe (instead of a csv file) to AV.AutoViz.
Could somebody please give me a short hint?

thanks in advance.

incorrect categorical variable assignment

I have a dataframe that has 32 columns of type object and float.32
one of the float.32 columns is treated as categorical by AutoViz.
How can I exclude that column from being treated as categorical?

some variables in data removed automatically

Hi,
I gave input csv contains 20 variables,while preprocessing it removed all important columns,may i know the reason?.
note:- removed columns contains fill data without null values

Unable to Hide Plots

Setting verbose=2 does not hide the plots from being shown in either python script or python notebook.

Module autoviz not found

ModuleNotFoundError: No module named 'autoviz'

Let me know if I am doing anything wrong

Categories barplot with target same color

Hi!
I find out that for categories bar plot with target color splitting, the color is the same for both target values. I uploaded an example here : test AutoViz on adult data notebook.

By the way, great job with this package !
Nathan.

How do we see output using a script file a terminal?

Hi AutoViML,

Firstly, congratulations and thanks for this wonderful package.
This works perfectly fine with Jupyter notebooks but how do I use the same if I am using an IDE let say Spyder?

Thanks in advance.
Mohit

AutoViz misidentifies my dependent variable as a categorical variable, which is in fact a continuous variable.

My dependent variable is the Loneliness scale score within the range of 1 of 4.

When I run the basic code of Autoviz() below, I do not get any results regarding my dependent variable.
report_AV = AV.AutoViz('', dfte=data)

When I run the code containing depVar argument below, I get the results that appears to regard my dependent variable as a categorical variable. This makes the result useless for my research.
report_AV = AV.AutoViz('', dfte=data, depVar='Loneliness')

Here are some examples that I get from the above code.

I've checked with the datatypes of my dataframe, and my dependent variable column's datatype is float64.

Is there any way to solve this issue?

Verbose = 2 does not save images

Running verbose = 2 does not save the images anywhere!
Can you please see this issue?

Error thrown while running Autoviz in python file

I have a file called Autoviz.py with the following lines of code

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()

filename = "Iris.csv"
sep = ","
dft = AV.AutoViz(
    filename,
    sep=",",
    depVar="",
    dfte=None,
    header=0,
    verbose=2,
    lowess=False,
    chart_format="png",
)

Now when I run this python file from the terminal with the python Autoviz.py I get the following error.

(saana) E:\on_nine_ai\testing data>python Autoviz.py
Traceback (most recent call last):
  File "Autoviz.py", line 1, in <module>
    from autoviz.AutoViz_Class import AutoViz_Class
  File "C:\Users\LENOVO\anaconda3\envs\saana\lib\site-packages\autoviz\__init__.py", line 2, in <module>
    from autoviz.AutoViz_Class import AutoViz_Class
  File "C:\Users\LENOVO\anaconda3\envs\saana\lib\site-packages\autoviz\AutoViz_Class.py", line 61, in <module>
    from autoviz.AutoViz_Holo import AutoViz_Holo
  File "C:\Users\LENOVO\anaconda3\envs\saana\lib\site-packages\autoviz\AutoViz_Holo.py", line 31, in <module>
    get_ipython().magic('matplotlib inline')
NameError: name 'get_ipython' is not defined

But this does not happen if I run the same lines of code in a python notebook. I understand that when verbose is set to 0 or 1, the plots are generated interactively in python notebooks. But when I set verbose to 2 and run a python file, I expect a folder to be created and all the result images to be stored inside that. Please help me out with this.

Project logo [help wanted]

If anyone with design sensibilities sees this. We are open to changing the project logo.

We like the pandas logo for example https://github.com/pandas-dev/pandas

No plot visible in local Jupiter

from autoviz.AutoViz_Class import AutoViz_Class
%matplotlib
AV = AutoViz_Class()
df = AV.AutoViz(filename='',dfte=train,depVar='Species',verbose=1)

Using matplotlib backend: Qt5Agg
Shape of your Data Set loaded: (150, 5)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Number of Numeric Columns = 4
Number of Integer-Categorical Columns = 0
Number of String-Categorical Columns = 0
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 0
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 0
Number of NLP String Columns = 0
Number of Date Time Columns = 0
Number of ID Columns = 0
Number of Columns to Delete = 0
4 Predictors classified...
No variables removed since no ID or low-information variables found in data set

################ Multi_Classification VISUALIZATION Started #####################
Data Set Shape: 150 rows, 5 cols
Data Set columns info:

SepalLengthCm: 0 nulls, 35 unique vals, most common: {5.0: 10, 5.1: 9}
SepalWidthCm: 0 nulls, 23 unique vals, most common: {3.0: 26, 2.8: 14}
PetalLengthCm: 0 nulls, 43 unique vals, most common: {1.5: 14, 1.4: 12}
PetalWidthCm: 0 nulls, 22 unique vals, most common: {0.2: 28, 1.3: 13}
Species: 0 nulls, 3 unique vals, most common: {'Iris-setosa': 50, 'Iris-versicolor': 50}

Columns to delete:
' []'
Boolean variables %s
' []'
Categorical variables %s
' []'
Continuous variables %s
" ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']"
Discrete string variables %s
' []'
Date and time variables %s
' []'
ID variables %s
' []'
Target variable %s
' Species'
Total Number of Scatter Plots = 10
No categorical or boolean vars in data set. Hence no pivot plots...
No categorical or numeric vars in data set. Hence no bar charts.
Time to run AutoViz = 2 seconds

###################### AUTO VISUALIZATION Completed ########################

but no plot.

In kaggle it' was working fine

https://www.kaggle.com/gauravduttakiit/multi-classification-problem-iris/notebook

The title part at the top of the output image is cut off

When using vervose=2 to output an svg or png file, there is an issue where the top title part is cut off.
There seems to be a problem with the height value setting, please check.

JSON file example

Is there an example where I could use a JSON file with AutoVIZ, seem to be running into errors. Would be good to have a sample file that works.

Autoviz couldn't Process 100k records..Any solutions?

Autoviz couldn't Process 100k records with 16 features..Any solutions?

chart_format="server" is not working!

Dear,
I set the chart_format="server", the new error occurs as:

~\anaconda3\lib\site-packages\autoviz\AutoViz_Holo.py in draw_scatters_hv(dfin, nums, chart_format, problem_type, dep, classes, lowess, mk_dir, verbose)
410 if chart_format in ['server', 'bokeh_server', 'bokeh-server']:
411 #server = pn.serve(hv_all, start=True, show=True)
--> 412 hv_all.show()
413 elif chart_format == 'html':
414 save_html_data(hv_all, chart_format, plot_name, mk_dir)

AttributeError: 'str' object has no attribute 'show'

In addition, when using html output the scatterplot, it seems unnecessary because the packet output pair_scatters already! Furthermore, I am not sure it is packet error or my PC got problem but scatterplot in HTML output file is blank!
autoviz_test_server.zip

HTMl and BOKEH not output all!

The autoviz work well for chart_format svg, but BOKEH and HTML not all the dataset work well, when running I encounter:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: ''

After searching through the internet, it seems like the problems is with the pandas!
Sorry but I do not know how to fix that!
Kind regard,

autoviz_test.zip

JupyterLab/Pandas Dataframe/Bokeh leads to: KeyError: "[''] not in index"

First off- thank you for creating this repository and for the latest Jupyter integration. I'm incredibly excited to use it, thank you for all the hard work!

Brief Description

Given .csv file test.csv:

name,some_string,some_boolean,some_number,some_amt
Kerry Bullock,RFH63GSB6XC,Yes,7,$92.87
Anika Stokes,BYU27VYT1LW,No,65,$48.20
Constance Jensen,KBF13GYN3FV,No,5,$14.28
Malcolm Alvarez,UUK28QNF8BG,No,90,$27.33
Clarke Hanson,KKT63JHC7KC,No,9,$28.52
David Ford,EDC73WSO8PU,No,2,$94.31
Abbot Combs,RRN89HGS1RT,Yes,71,$89.90

(test.csv)

When this code is ran in Jupyter Lab:

import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class

df = pd.read_csv('test.csv')

AV.AutoViz(
    filename="",
    dfte=df,
    depVar='',
    verbose=0,
    lowess=False,
    chart_format="bokeh",
)

One bokeh chart is generated and two stack traces are displayed. The first one says:

...

~/anaconda3/envs/reporting/lib/python3.9/site-packages/autoviz/AutoViz_Holo.py in select_widget(each_cat)
    531                 width_size=15
    532                 #######  This is where you plot the histogram of categorical variable input as each_cat ####
--> 533                 conti_df = dft[[dep,each_cat]].groupby(each_cat).mean().reset_index()
    534                 row_ticks = dft[dep].unique().tolist()
    535                 color_list = next(colors)

...

KeyError: "[''] not in index"

AutoViz_holo.py, line 533

Then a chart is displayed, followed by the second stack trace:

...

~/anaconda3/envs/reporting/lib/python3.9/site-packages/autoviz/AutoViz_Holo.py in AutoViz_Holo(filename, sep, depVar, dfte, header, verbose, lowess, chart_format, max_rows_analyzed, max_cols_analyzed)
    192         ls_objects.append(drawobj42)
    193     else:
--> 194         drawobj41 = dfin[dep].hvplot(kind='bar', color='r', title='Histogram of Target variable').opts(
    195                         height=height_size,width=width_size,color='lightgreen', xrotation=70)
    196         drawobj42 = dfin[dep].hvplot(kind='kde', color='g', title='KDE Plot of Target variable').opts(

...

KeyError: ''

AutoViz_holo.py, line 194

In both cases it looks like the code is expecting dep to not be an empty string, and is failing when trying to use the empty string to select a column in the DataFrame.

Detail of the expected change(s) in behaviour

At first glance it looks like some additional checks of dep would help, but it also looks like the cats variable may have an empty string in it which may be the cause of the first stack trace. I'd need to do a deeper dive to get a clearer idea.

addition of example of how to install autoviz to workstation

this is to write an example of how to install autoviz to your user workstation

Could not draw ...

I'm trying to run autoviz on my pandas DataFrame and, oddly, sometimes it works, sometimes is does not and displays the following:

Shape of your Data Set loaded: (68, 7)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    6 Predictors classified...
        No variables removed since no ID or low-information variables found in data set

################ Multi_Classification VISUALIZATION Started #####################
Total Number of Scatter Plots = 10
Could not draw Distribution Plots
Could not draw Pivot Charts against Dependent Variable
Time to run AutoViz = 3 seconds 

 ###################### AUTO VISUALIZATION Completed ########################

As far as I can tell, nothing changes between the times when it works and when it does not. What can trigger this kind of error?

Image size is too large error. Autoviz creating enormous image sizes

I tried to use Autoviz on the following dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Using the following code to call Autoviz:
dftc = AV.AutoViz('../input/house-prices-advanced-regression-techniques/train.csv', depVar='SalePrice', verbose=0, chart_format='bokeh')

It was unable to display the charts without error giving the following error:

KeyError Traceback (most recent call last)
/tmp/ipykernel_133/4057966485.py in
----> 1 dftc = AV.AutoViz('../input/house-prices-advanced-regression-techniques/train.csv', verbose=0, chart_format='bokeh')

/opt/conda/lib/python3.7/site-packages/autoviz/AutoViz_Class.py in AutoViz(self, filename, sep, depVar, dfte, header, verbose, lowess, chart_format, max_rows_analyzed, max_cols_analyzed, save_plot_dir)
238 dft = AutoViz_Holo(filename, sep, depVar, dfte, header, verbose,
239 lowess,chart_format,max_rows_analyzed,
--> 240 max_cols_analyzed, save_plot_dir)
241 else:
242 dft = self.AutoViz_Main(filename, sep, depVar, dfte, header, verbose,

/opt/conda/lib/python3.7/site-packages/autoviz/AutoViz_Holo.py in AutoViz_Holo(filename, sep, depVar, dfte, header, verbose, lowess, chart_format, max_rows_analyzed, max_cols_analyzed, save_plot_dir)
193 ls_objects.append(drawobj6)
194 if len(date_vars) > 0:
--> 195 drawobj7 = draw_date_vars_hv(dfin,dep,date_vars, nums, chart_format, problem_type, mk_dir, verbose)
196 ls_objects.append(drawobj7)
197 if len(nums) > 0 and len(cats) > 0:

/opt/conda/lib/python3.7/site-packages/autoviz/AutoViz_Holo.py in draw_date_vars_hv(df, dep, datevars, num_vars, chart_format, modeltype, mk_dir, verbose)
940 if modeltype == 'Regression' or dep == None or dep == '':
941 kind = 'line'
--> 942 hv_plot = dft[num_vars+[dep]].hvplot( height=400, width=600,kind=kind,
943 title='Time Series Plot of all Numeric variables and Target').opts(legend_position='top_left')
944 hv_panel = pn.Row(pn.WidgetBox( kind), hv_plot)

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379

KeyError: "[''] not in index"

Error in callback <function install_repl_displayhook..post_execute at 0x7f01919304d0> (for post_execute):

ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/matplotlib/pyplot.py in post_execute()
136 def post_execute():
137 if matplotlib.is_interactive():
--> 138 draw_all()
139
140 try: # IPython >= 2

/opt/conda/lib/python3.7/site-packages/matplotlib/_pylab_helpers.py in draw_all(cls, force)
135 for manager in cls.get_all_fig_managers():
136 if force or manager.canvas.figure.stale:
--> 137 manager.canvas.draw_idle()
138
139

/opt/conda/lib/python3.7/site-packages/matplotlib/backend_bases.py in draw_idle(self, *args, **kwargs)
2058 if not self._is_idle_drawing:
2059 with self._idle_draw_cntx():
-> 2060 self.draw(*args, **kwargs)
2061
2062 @Property

/opt/conda/lib/python3.7/site-packages/matplotlib/backends/backend_agg.py in draw(self)
429 def draw(self):
430 # docstring inherited
--> 431 self.renderer = self.get_renderer(cleared=True)
432 # Acquire a lock on the shared font cache.
433 with RendererAgg.lock, \

/opt/conda/lib/python3.7/site-packages/matplotlib/backends/backend_agg.py in get_renderer(self, cleared)
445 and getattr(self, "_lastKey", None) == key)
446 if not reuse_renderer:
--> 447 self.renderer = RendererAgg(w, h, self.figure.dpi)
448 self._lastKey = key
449 elif cleared:

/opt/conda/lib/python3.7/site-packages/matplotlib/backends/backend_agg.py in init(self, width, height, dpi)
91 self.width = width
92 self.height = height
---> 93 self._renderer = _RendererAgg(int(width), int(height), dpi)
94 self._filter_renderers = []
95

ValueError: Image size of 2000x81750 pixels is too large. It must be less than 2^16 in each direction.

lightgbm instead of xgboost

Is t possible to use lightgbm instead of xgboost?
It is extremely faster and lighter

DataFrame as input

Hey just wondering if you're thinking about the ability to just pass a dataframe to AutoViz instead of the file.

I can help by creating a PR for it

Misplaced graph x ylabel

Hi Ram,
I have tried this package and found out a potential bug.
When I tried to do the AV.AutoViz('', ',', 'target', df) to run an autoViz stuff, the x y labels of each graph are misplaced (x label should be placed at y label and vice versa.). I have tried two datasets and it still happened. Please look into this and see if this is a bug or I just did something wrong. Thanks!
Jeff

exporting the report

Similar project to AutoViz are Sweetviz and Pandas Profiling.

They could export the report as a HTML file.
I wonder if this library also has this function?

Save plots as png images.

Can we save all visualizations generated by Autoviz as png image files in the current working directory?

AutoViz Crashes with the Error

When I try to apply AutoViz to analizing the data of one of the competitions in Kaggle (namely, https://www.kaggle.com/c/lish-moa/data), it crashes.

Below is the error trap I get

Imported AutoViz_Class version: 0.0.68. Call using: 
    from autoviz.AutoViz_Class import AutoViz_Class
    AV = AutoViz_Class()
    AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,
                            lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30)
            
To remove previous versions, perform 'pip uninstall autoviz'
Shape of your Data Set: (21948, 876)
Classifying variables in data set...
    875 Predictors classified...
        This does not include the Target column(s)
    2 variables removed since they were ID or low-information variables
    List of variables removed: ['sig_id', 'cp_type']
Since Number of Rows in data 21948 exceeds maximum, randomly sampling 2500 rows for EDA...
872 numeric variables in data exceeds limit, taking top 40 variables
Number of numeric variables = 872
    Number of variables removed due to high correlation = 227 
    Adding 1 categorical variables to reduced numeric variables  of 645
Selected No. of variables = 646 
Finding Important Features...
Not able to read or load file. Please check your inputs and try again...

My code to reproduce the problem is provided in https://gist.github.com/gvyshnya/7644fd77567051203ad96d95fbc7ef2a

I run that code on my local machine (not in a Kaggle kernel). The above-mentioned code expects the data files from the competition to be placed in data subfolder (relative to the folder where you place the python script with the code).

Below are the key details about my OS and Python Environment

Windows 10
Python 3.7 in Anaconda
AutoViz_Class version: 0.0.68

The trace from pd.show_versions(as_json=False) on my machine is provided below, just in case

INSTALLED VERSIONS
------------------
commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python           : 3.7.0.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.2
numpy            : 1.19.2
pytz             : 2018.5
dateutil         : 2.7.3
pip              : 20.1
setuptools       : 49.2.0
Cython           : 0.28.5
pytest           : 5.3.2
hypothesis       : None
sphinx           : 1.7.9
blosc            : None
feather          : None
xlsxwriter       : 1.1.0
lxml.etree       : 4.2.5
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 6.5.0
pandas_datareader: None
bs4              : 4.6.3
bottleneck       : 1.2.1
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.1.2
numexpr          : 2.6.8
odfpy            : None
openpyxl         : 2.5.6
pandas_gbq       : 0.12.0
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.2.11
tables           : 3.4.4
tabulate         : 0.8.2
xarray           : None
xlrd             : 1.1.0
xlwt             : 1.3.0
numba            : 0.48.0

"Not able to read or load file. Please check your inputs and try again..."

hello Ram,
when i run the code on my dateset, dft = av.AutoViz('', sep, target, df)
i get this error
"Not able to read or load file. Please check your inputs and try again..."

what could the issue be?

Normed Histogram plot with negative y value?

Hi, the plots I have all has negative y values. How to interpret this?

I think the following code generates the plots.
sns.distplot(dft.loc[dft[dep]==target_var][each_conti],bins=binsize, ax= ax1,
label=target_var, hist=False, kde=True,
color=color2)
legend_flag += 1

Data Viz for training data after making the split

We should explore data after making a train-test split to avoid data leakage.
How can I supply a data frame (training data only) to df.Autoviz() function? I tried supplying dataframe and leaving filename as an empty string but it's not giving me charts.

My Code:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://raw.githubusercontent.com/arora123/Data/master/WA_Fn-UseC_-Telco-Customer-Churn.csv')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state =1)

!pip install autoviz
# To import AutoViz_Class from autoviz-AutoViz_Class

from autoviz.AutoViz_Class import AutoViz_Class 

#To initialize class
av = AutoViz_Class()

av.AutoViz('', sep=',', depVar='Churn', dfte=pd.DataFrame(x, y), 
           header=1, verbose=1, lowess=False, chart_format='svg',)

Output

Shape of your Data Set: (7043, 20)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Not able to read or load file. Please check your inputs and try again...

IndexError: list index out of range

Hi. We have recently integrated AutoViz into PyCaret and I think I found an edge case here that needs to be fixed in AutoViz. This problem only happens when the dataset only has 1 numeric feature. My guess is that it needs at least 2 variables for the scatter plot. The expected fix will basically involve some kind of exception handling.

To reproduce the error:

pip install pycaret

from pycaret.datasets import get_data
data = get_data('cancer')

from pycaret.classification import *
s = setup(data, target = 'Class', session_id = 123, silent = True)

eda()

IndexError Traceback (most recent call last)
in
----> 1 eda()

~\pycaret\pycaret\classification.py in eda(data, target, display_format, **kwargs)
2946 None
2947 """
-> 2948 return pycaret.internal.tabular.eda(
2949 data=data, target=target, display_format=display_format, **kwargs
2950 )

~\pycaret\pycaret\internal\tabular.py in eda(data, target, display_format, **kwargs)
10397
10398 AV = AutoViz_Class()

10399 AV.AutoViz(
10400 filename="", dfte=data, depVar=target, chart_format=display_format, **kwargs
10401 )

~\anaconda3\envs\pycaret-dev\lib\site-packages\autoviz\AutoViz_Class.py in AutoViz(self, filename, sep, depVar, dfte, header, verbose, lowess, chart_format, max_rows_analyzed, max_cols_analyzed, save_plot_dir)
236 ####################################################################################
237 if chart_format.lower() in ['bokeh','server','bokeh_server','bokeh-server', 'html']:
--> 238 dft = AutoViz_Holo(filename, sep, depVar, dfte, header, verbose,
239 lowess,chart_format,max_rows_analyzed,
240 max_cols_analyzed, save_plot_dir)

~\anaconda3\envs\pycaret-dev\lib\site-packages\autoviz\AutoViz_Holo.py in AutoViz_Holo(filename, sep, depVar, dfte, header, verbose, lowess, chart_format, max_rows_analyzed, max_cols_analyzed, save_plot_dir)
175 ### You can draw pair scatters only if there are 2 or more numeric variables ####
176 if len(nums) >= 2:
--> 177 drawobj2 = draw_pair_scatters_hv(dfin, nums, problem_type, chart_format, dep,
178 classes, lowess, mk_dir, verbose)
179 ls_objects.append(drawobj2)

~\anaconda3\envs\pycaret-dev\lib\site-packages\autoviz\AutoViz_Holo.py in draw_pair_scatters_hv(dfin, nums, problem_type, chart_format, dep, classes, lowess, mk_dir, verbose)
521 quantileable = [x for x in nums if len(dft[x].unique()) > 20]
522
--> 523 x = pnw.Select(name='X-Axis', value=quantileable[0], options=quantileable)
524 y = pnw.Select(name='Y-Axis', value=quantileable[1], options=quantileable)
525 size = pnw.Select(name='Size', value='None', options=['None'] + quantileable)

IndexError: list index out of range

The code inside PyCaret that integrates AutoViz is as follows:

    from autoviz.AutoViz_Class import AutoViz_Class

    AV = AutoViz_Class()
    AV.AutoViz(
        filename="", dfte=data, depVar=target, chart_format=display_format, **kwargs
    )

Frequency distribution plot of target column is wrongly interpreted

Hi, when I have run on the steel classification dataset the frequency distribution plot for the target column is showing wrong.

AutoViz is recognizing int numbers as categorical vars, not numerical

When plotting charts, me and my team could see that numerical vars was being plotted as categorical vars. This means that an "age" data was being plotted as categories, like it was the same type as "card category" information.
We have dealt with that only transforming the "age" information, that was int64 type, to float numbers, which seemed to have worked and treated correctly as numerical var.
We used this dataset: https://www.kaggle.com/sakshigoyal7/credit-card-customers

Installation instructions and sample code not working

You need from autoviz import ... in the sample code. Preferably you should give a sample that can be just copy&pasted and run, and provide pictures of how it looks, so that one could evaluate whether to install this instead of the many other plotting libraries.

The dependencies are extremely heavy. Is it absolutely necessary to install Jupyter? Something inside also depends on sklearn, which was not included in pip deps.

As for CSV reading; if you are not able to autodetect/guess separators and date formats, do not bother "including" it in your library. It is just two lines of code to first load the data with pandas and then use another library for plotting, and in most cases one needs to do something in between anyway (data preprocessing).

An ideal plotting library would have API alike this:

from fictionalplot import Figure  # if possible, keep it to just one simple import

fig = Figure()   # Internally holds graphics context, Qt window, websocket to browser or whatever
fig.plot(df)  # display the graph and return instantly, try to auto-guess suitable format based on df

If using a Qt window, spawn a new process that does not terminate when the Python program ends, and that is automatically shared by all figures of all running programs (don't block execution of the program like Matplotlib does). If using Notebook/browser, you don't need separate process because browser already does that.

For true interactive plots (e.g. receive user input on scaling changes to recalculate new data in Python), use async/await to avoid blocking Python from executing while waiting for user input (but stay away from import asyncio which is utter crap -- instead use trio if you must).

Good luck with your plotting library. We could certainly use some good options (I am not entirely happy with either Matplotlib nor Plotly, and everything else is just bad).

encoding issue

I am trying to use AutoViz on a large data set with a shape of (1362132, 83)
THIS READS THE DATA SET

df = pd.read_csv("./Desktop/mgh_multi_gifts_desc_joined_tb_CSV.csv", error_bad_lines=False, engine='python', sep=",", encoding='cp1252'

THIS IS MY NEXT STEP

encoding = "cp1252 error_bad_lines = False engine ='python' sep = ',' target = 'gift_amount' datapath = './Desktop/' filename = 'mgh_multi_gifts_desc_joined_tb_CSV.csv' df = pd.read_csv(datapath+filename,sep=sep,index_col=None, error_bad_lines = error_bad_lines,engine = engine,encoding=encoding)

WHEN TRYING TO EXECUTE THIS NEXT STEP
dft = AV.AutoViz(datapath+filename, sep=sep, depVar=target, dfte=None, header=0, verbose=0, lowess=False,chart_format='svg',max_rows_analyzed=1500,max_cols_analyzed=30)

GETTING THE FOLLOWING MASSAGE

File encoding decoder utf-8 does not work for this file
File encoding decoder iso-8859-11 does not work for this file
File encoding decoder cpl252 does not work for this file
File encoding decoder latin1 does not work for this file
None of the decoders work...
Not able to read or load file. Please check your inputs and try again...

NOT SURE WHAT TO DO NEXT , ANY HELP WOULD BE MUCH APPRECIATED.

Rows limit

Is there a way to overcome this ? I want autoviz to go through the whole data frame regardless of large number of rows

Use black formatter

Would you welcome a PR adding black formatting to the project? https://github.com/psf/black

AutoViz not working with scikit-learn >= 0.24 on large datasets

Starting from version 0.24, in scikit-learn it is raised an error (instead of a warning) when in KFold and StratifiedKFold it is passed a random_state without setting shuffle to True.
When using AutoViz with a large dataset, in the function find_top_features_xgb, the KFold defined as kf = KFold(n_splits=n_splits, random_state=33) raises a ValueError and the overall auto visualization terminates with the message Not able to read or load file. Please check your inputs and try again....
If the intent is to shuffle the data in the KFold it should be added explicitly shuffle=True, because otherwise the data is not shuffled; on the other hand, if the intent is to not shuffle the data, the parameter random_state should be removed.

A simple dataset to use to reproduce the issue can be found on Kaggle at this URL.

Set output results path

Hi,

How can I control the output file path ?

Thanks,
Boris

Is it possible to have the input as dataframe instead of file?

if input is dataframe can be used in real time, files can be of any source and type

Read CSV file with different encodings

Hi. I'm trying to use the library with a CSV file that uses "ISO-8859-1" encoding, and the log says:

pandas ascii encoder does not work for this file. Continuing...
pandas utf-8 encoder does not work for this file. Continuing...
pandas iso-8859-1 encoder does not work for this file. Continuing...

After checking the source code, I found that there is a bug in the AutoViz_Utils.py file:

Here there is a for loop to try with different encodings but, as it can be seen, the encoding parameter of the pd.read_csv function is always set to None.

Please, check this, maybe I'm missing something.
Thanks in advance.

[Minor] AutoViz Crashes on the analysis of a dataset without any significant variables

If AV is fed with a dataset where it does not find any significant variable to analize (vs. the target variable specified), it crashes.

The code to reproduce the issue is provided in https://gist.github.com/gvyshnya/c53321dbe947cc55fec91ccf6ae07294

The environment to reproduce the problem is the same as indicated in #26

The expected behaviour would be to gracefully finish the analysis session with a comprehensive inforamtion message to a user and without a crash.

[suggestion] colab example notebooks

Include some colab notebooks in the examples so lazy people (ahem) can just click and open them to see it work.

[bug] problem with time series charts

Here is minimal reproducible example with google colab:

Date time column is no recognized, when input is file:

!pip install autoviz
from autoviz.AutoViz_Class import AutoViz_Class

import pandas as pd

AV = AutoViz_Class()

df = pd.DataFrame({'time': ['2020-01-15', '2020-02-15', '2020-03-15', '2020-04-15', '2020-05-15'], 'values': [1.0,2.5,3.2,4.2,5.6]})
df['time'] = pd.to_datetime(df['time'])
df.to_csv('ts.csv', index=False)

dft = AV.AutoViz("ts.csv", verbose=2)

hape of your Data Set loaded: (5, 2)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
Data Set Shape: 5 rows, 2 cols
Data Set columns info:
* time: 0 nulls, 5 unique vals, most common: {'2020-05-15': 1, '2020-03-15': 1}
* values: 0 nulls, 5 unique vals, most common: {3.2: 1, 5.6: 1}
--------------------------------------------------------------------
    Numeric Columns: ['values']
    Integer-Categorical Columns: []
    String-Categorical Columns: []
    Factor-Categorical Columns: []
    String-Boolean Columns: []
    Numeric-Boolean Columns: []
    Discrete String Columns: []
    NLP text Columns: []
    Date Time Columns: []
    ID Columns: ['time']
    Columns that will not be considered in modeling: []
    2 Predictors classified...
        This does not include the Target column(s)
        1 variables removed since they were ID or low-information variables
    List of variables removed: ['time']
No categorical or numeric vars in data set. Hence no bar charts.
Time to run AutoViz (in seconds) = 0.562

When input is dataframe - chart is not generated, but date time column is recognized:

dft = AV.AutoViz("", dfte=df, verbose=2)
Shape of your Data Set loaded: (5, 2)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
Data Set Shape: 5 rows, 2 cols
Data Set columns info:
* time: 0 nulls, 5 unique vals, most common: {Timestamp('2020-05-15 00:00:00'): 1, Timestamp('2020-04-15 00:00:00'): 1}
* values: 0 nulls, 5 unique vals, most common: {3.2: 1, 5.6: 1}
--------------------------------------------------------------------
    Numeric Columns: ['values']
    Integer-Categorical Columns: []
    String-Categorical Columns: []
    Factor-Categorical Columns: []
    String-Boolean Columns: []
    Numeric-Boolean Columns: []
    Discrete String Columns: []
    NLP text Columns: []
    Date Time Columns: ['time']
    ID Columns: []
    Columns that will not be considered in modeling: []
    2 Predictors classified...
        This does not include the Target column(s)
        No variables removed since no ID or low-information variables found in data set
Could not draw Date Vars
No categorical or numeric vars in data set. Hence no bar charts.
Time to run AutoViz (in seconds) = 0.408

Expected result: chart with date on x-axis, and value on y-axis.

Dependency Installation Versioning

At workshop, we encountered a couple of issues with versions of deps. The following versions and procedure allow for the demo notebook to run.

# in a brand new conda env
conda install jupyter pandas=0.23 matplotlib=3.0.2 seaborn=0.9 xlrd=1.2.0 scikit-learn
pip install xgboost

Not able to read or load file. Please check your inputs and try again...

pandas ascii encoder does not work for this file. Continuing...
pandas utf-8 encoder does not work for this file. Continuing...
pandas iso-8859-1 encoder does not work for this file. Continuing...
pandas cp1252 encoder does not work for this file. Continuing...
pandas latin1 encoder does not work for this file. Continuing...
Not able to read or load file. Please check your inputs and try again...

Hello Everyone, this is my first time using Autoviz, after reaing the guide i tried to read a csv dataset and i am getting this error. Is there any way to fix this or shall imake any changes to the csv file before using autoviz. Thanks in advance.

Bokeh Option Unavailable

Per the documentation, when setting chart_format to "bokeh," an interactive dashboard should be created in the Jupyter Notebook.

However, I am receiving the following error:

ValueError: Format 'bokeh' is not supported (supported formats: eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff)

Suggesting Updated for Wordcloud

1. Updating Stopwords List

Currently, I can see that Stopwords are defined as a list and I can see that it is missing a few stop words like "themselves".

def return_stop_words():
    STOP_WORDS = ['it', "this", "that", "to", 'its', 'am', 'is', 'are', 'was', 'were', 'a',
                'an', 'the', 'and', 'or', 'of', 'at', 'by', 'for', 'with', 'about', 'between',
                 'into','above', 'below', 'from', 'up', 'down', 'in', 'out', 'on', 'over',
                  'under', 'again', 'further', 'then', 'once', 'all', 'any', 'both', 'each',
                   'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so',
                    'than', 'too', 'very', 's', 't', 'can', 'just', 'd', 'll', 'm', 'o', 're',
                    've', 'y', 'ain', 'ma']
    add_words = ["s", "m",'you', 'not',  'get', 'no', 'via', 'one', 'still', 'us', 'u','hey','hi','oh','jeez',
                'the', 'a', 'in', 'to', 'of', 'i', 'and', 'is', 'for', 'on', 'it', 'got','aww','awww',
                'not', 'my', 'that', 'by', 'with', 'are', 'at', 'this', 'from', 'be', 'have', 'was',
                '', ' ', 'say', 's', 'u', 'ap', 'afp', '...', 'n', '\\']
    stop_words = list(set(STOP_WORDS+add_words))
    return sorted(stop_words)

Isn't it better to use NLTK stop words list??

from nltk.corpus import stopwords

for lang in langs:
  stopwords.words(lang)

Copied from: https://gist.github.com/sebleier/554280

2. Lemmatization before plotting

I think it is better if we lemmatize the data before we plot then words like "reads", "reading" will count as the same, which will give us a better word cloud.

maximum recursion depth exceeded in comparison

I have a CSV file with 40,000 records and I was trying to run Autoviz on this data
from autoviz.AutoViz_Class import AutoViz_Class AV = AutoViz_Class() filename = 'Cleaned_InnerJoinedDataframe.csv' df1 = AV.AutoViz(filename)

But this fails with error
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\font_manager.py in findfont(self, prop, fontext, directory, fallback_to_default, rebuild_if_missing)
1236 return self._findfont_cached(
1237 prop, fontext, directory, fallback_to_default, rebuild_if_missing,
-> 1238 rc_params)
1239
1240 @lru_cache()

RecursionError: maximum recursion depth exceeded while calling a Python object