ydataai / ydata-profiling Goto Github PK

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Home Page: https://docs.profiling.ydata.ai

License: MIT License

Python 92.94% HTML 3.88% Makefile 0.21% Batchfile 0.14% CSS 0.87% Jupyter Notebook 1.73% JavaScript 0.18% Dockerfile 0.04%

pandas-profiling pandas-dataframe statistics jupyter-notebook exploration data-science python pandas machine-learning deep-learning

ydata-profiling's Issues

Low Memory option?

Hey Jos,

We're trying to run the profiler on a small EC2 instance (large files -measured in gigs), we're wondering if there is anything we can do to stop getting the memory errors during the profiling.

Any suggestions would be appreciated.

Also, are you planning on modifications to this library in the near future? What's your thoughts on "next steps" for the profiler?

Thanks and best...Rich Murnane

AttributeError: module 'pandas_profiling' has no attribute 'formatters'

Not sure what's up with that.

Booleans are treated as numeric instead of categorical

A column containing True or False will be summarized like a numeric column and trigger numeric warnings such as was probed successfully has 651907 / 94.4% zeros . Here's an example http://i.imgur.com/OaCPJ86.png . Boolean should be summarized like a categorial variable.

Adding plots for dates attribuets

Man what you have done with pandas-profiling is amazing it saves me hours a week !!
I think it would be great to have an additional plot for date attributes .

thx

Duplicated functions?

Hi Jos,

It seems to me there are (almost) duplicated functions in base.py with describe having inner functions that are almost identical to the outer functions. I've chopped those off, as seen here:

dpavlic@20df9e6

The tests all run and everything still generates as usual. Am I missing something, or can these go?

Numeric with low Distinct count should be treated as "Categorical"

I guess <15 is ok, because mean, min, max and zeros are kind of useless

ValueError when generating profile from dataframe

I have a pandas dataframe, when I try to generated report from df.head(), everything works fine. But when I try to generate report for the whole dataframe, I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-54-b5ad826014b5> in <module>()
----> 1 pandas_profiling.ProfileReport(df)

./python3.5/site-packages/pandas_profiling/__init__.py in __init__(self, df, **kwargs)
     15         sample = kwargs.get('sample', df.head())
     16 
---> 17         description_set = describe(df, **kwargs)
     18 
     19         self.html = to_html(sample,

./python3.5/site-packages/pandas_profiling/base.py in describe(df, bins, correlation_overrides, pool_size, **kwargs)
    307 
    308         confusion_matrix=pd.crosstab(data1,data2)
--> 309         if confusion_matrix.values.diagonal().sum() == len(df):
    310             ldesc[name1] = pd.Series(['RECODED', name2], index=['type', 'correlation_var'])
    311 

ValueError: diag requires an array of at least two dimensions

HDFStore chunksize

Is it possible to profile a large dataset in HDF5 file store by iterating over chunks? Something perhaps like the following:

pandas_profiling.ProfileReport(store.select('table',chunksize=10000))

Thanks,
C

Profiling one variable

Hi,
very great job !!!!!!! I have a problem

I tried to profile just one variable by using this

pandas_profiling.ProfileReport(pd.DataFrame(training['num_var4'], columns=['num_var4']))

The problem is in the toggle details, I just have the Statistics but not the Histogram, Common Values, Extreme Values.

How to have them?

Your package is very great !

regards

Customization Options

Just tried out your package for a small project at work. I'm already a big fan of it as it accomplishes a great deal of what I would do in the pre-modeling phase with a single command. It also looks much nicer with the HTML format.

Are you planning on adding some customization options? For example, it would be great to set the number of bins in the histograms. I work with some highly skewed data and getting a good visual tends to require a large amount of bins.

Additionally, crosstabs and scatter plots for bivariate descriptive analysis would be awesome. Maybe not by default, but the ability to specificy them using tuples of column names would be mighty useful.

Incorrect calculation for % unique for variables with missing values

We've found that the % unique calculation sometimes shows up as >100%, and the it appears to be in cases where there are a small number of unique values and "missing" is one of them. It looks like that "missing" is counted in the numerator but not in the denominator of the calculation.

For example, we have a variable with 2 possible values ("Y" and "missing") that shows 200.0% unique (e.g. 2/1).

Another variable with 3 possible values ("Y", "N", "missing") shows 150.0% unique (e.g. 3/2).

unorderable types: str() > int()

I get the above error message. Works fine if I exclude the object columns.

Change Requests

Hi Jos,

As previously sent, here is the list of things I think would make this profiling tool even better.

pandas_profiling:

In dataset info area of the report, add filename and file size

at the bottom of the report, add a simple output grid/table listing of the variables in the order they came, and their variable type. Something like df.dtypes

for categorical/text data, add min/max/average length of the strings (this is helpful for people who will build database tables on the data)

not sure how you'd do it, but it'd be good to know how many records are exact duplicates of one another, counts and %

for categorical/text data, it'd be great to illustrate the regex expression for strings, an example would be which have number data in strings - e.g. phone number regex expression is like ((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}  this regex thing is clearly likely much more difficult, just figured I’d throw it out there…

non-ascii problems in Python 2.7

In trying to run pandas-profiling on a dataset I get errors indicating it's choking on Unicode characters. Full traceback below:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-11-e6ab17740d9a> in <module>()
----> 1 pandas_profiling.ProfileReport(responses)

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/pandas_profiling/__init__.pyc in __init__(self, df, **kwargs)
     18 
     19         self.html = to_html(sample,
---> 20                             description_set)
     21 
     22         self.description_set = description_set

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/pandas_profiling/base.pyc in to_html(sample, stats_object)
    458         if row['type'] == 'CAT':
    459             formatted_values['minifreqtable'] = freq_table(stats_object['freq'][idx], n_obs,
--> 460                                                            templates.template('mini_freq_table'), templates.template('mini_freq_table_row'), 3)
    461 
    462             if row['distinct_count'] > 50:

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/pandas_profiling/base.pyc in freq_table(freqtable, n, table_template, row_template, max_number_to_print)
    413 
    414         for label, freq in six.iteritems(freqtable.iloc[0:max_number_to_print]):
--> 415             freq_rows_html += _format_row(freq, label, max_freq, row_template, n)
    416 
    417         if freq_other > min_freq:

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/pandas_profiling/base.pyc in _format_row(freq, label, max_freq, row_template, n, extra_class)
    391                                        extra_class=extra_class,
    392                                        label_in_bar=label_in_bar,
--> 393                                        label_after_bar=label_after_bar)
    394 
    395     def freq_table(freqtable, n, table_template, row_template, max_number_to_print):

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/jinja2/environment.pyc in render(self, *args, **kwargs)
   1006         except Exception:
   1007             exc_info = sys.exc_info()
-> 1008         return self.environment.handle_exception(exc_info, True)
   1009 
   1010     def render_async(self, *args, **kwargs):

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/jinja2/environment.pyc in handle_exception(self, exc_info, rendered, source_hint)
    778             self.exception_handler(traceback)
    779         exc_type, exc_value, tb = traceback.standard_exc_info
--> 780         reraise(exc_type, exc_value, tb)
    781 
    782     def join_path(self, template, parent):

/Users/brian/.virtualenvs/narnia/lib/python2.7/site-packages/pandas_profiling/templates/mini_freq_table_row.html in top-level template code()
      6             {{ label_in_bar }}
      7         </div>
----> 8         {{ label_after_bar }}
      9     </td>
     10 </tr>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 24: ordinal not in range(128)

AttributeError: 'module' object has no attribute 'api'

On running import pandas_profiling and then pandas_profiling.ProfileReport(df)

An AttributeError is returned as shown:

I have tried my best to understand this error but it doesn't make any sense to me. I have installed as recommended on README and have followed the standard debugging steps but can't get this to work.

Any help would be greatly appreciated!

AttributeError: module 'six' has no attribute 'viewkeys'

Hi.

In [14]: pandas_profiling.ProfileReport(df)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-b5ad826014b5> in <module>()
----> 1 pandas_profiling.ProfileReport(df)

/usr/local/lib/python3.5/dist-packages/pandas_profiling/__init__.py in __init__(self, df, **kwargs)
     19 
     20         self.html = to_html(sample,
---> 21                             description_set)
     22 
     23         self.description_set = description_set

/usr/local/lib/python3.5/dist-packages/pandas_profiling/base.py in to_html(sample, stats_object)
    389             formatted_values[col] = fmt(value, col)
    390 
--> 391         for col in set(row.index) & six.viewkeys(row_formatters):
    392             row_classes[col] = row_formatters[col](row[col])
    393             if row_classes[col] == "alert" and col in templates.messages:

AttributeError: module 'six' has no attribute 'viewkeys'

python 3.5

MatplotlibDeprecationWarning (set_axis_bgcolor)

I've just updated matplotlib and all the other libs. And this is what I get.

My code:

rep = pandas_profiling.ProfileReport(df)
rep.to_file('pandas_profiling_report.html')

Warning:

/usr/local/lib/python3.6/site-packages/pandas_profiling/base.py:133: MatplotlibDeprecationWarning: The set_axis_bgcolor function was deprecated in version 2.0. Use set_facecolor instead.
  plot.set_axis_bgcolor("w")

The report looks fine.

Data quality "summary score" - possible enhancement

Would anybody see merit in a "data quality summary score"? Meaning a number between 0-100 about the usability of each variable. A normally distributed variable with no missing values would be a 100 and depending on the problems (outliers, skew, missing data, ...) in the data the score lowers.

I have seen this a couple of times in commercial solutions. Any ideas?

pandas-profiling not installable with Python 3.6

When trying to install pandas-profiling with a current Anaconda version, I get an error. This is because I am already on Python 3.6 and conda info shows:
`
conda info pandas-profiling=1.3.0
Fetching package metadata .............

pandas-profiling 1.3.0 py27_0

file name : pandas-profiling-1.3.0-py27_0.tar.bz2
name : pandas-profiling
version : 1.3.0
build string: py27_0
build number: 0
channel : conda-forge
size : 28 KB
arch : x86_64
has_prefix : False
license : MIT
md5 : a7a2a6ef13981cfb2224fd788059a3fa
noarch : None
platform : linux
requires : ()
subdir : linux-64
url : https://conda.anaconda.org/conda-forge/linux-64/pandas-profiling-1.3.0-py27_0.tar.bz2
dependencies:
jinja2 >=2.8
matplotlib >=1.4
pandas >=0.16
python 2.7*
six >=1.9

pandas-profiling 1.3.0 py34_0

file name : pandas-profiling-1.3.0-py34_0.tar.bz2
name : pandas-profiling
version : 1.3.0
build string: py34_0
build number: 0
channel : conda-forge
size : 28 KB
arch : x86_64
has_prefix : False
license : MIT
md5 : 2a2116f78d6be8513c014a7b9dec6720
noarch : None
platform : linux
requires : ()
subdir : linux-64
url : https://conda.anaconda.org/conda-forge/linux-64/pandas-profiling-1.3.0-py34_0.tar.bz2
dependencies:
jinja2 >=2.8
matplotlib >=1.4
pandas >=0.16
python 3.4*
six >=1.9

pandas-profiling 1.3.0 py35_0

file name : pandas-profiling-1.3.0-py35_0.tar.bz2
name : pandas-profiling
version : 1.3.0
build string: py35_0
build number: 0
channel : conda-forge
size : 223 KB
arch : x86_64
has_prefix : False
license : MIT
md5 : a00d9fab0150a53d8c6c234bf9cb5d22
noarch : None
platform : linux
requires : ()
subdir : linux-64
url : https://conda.anaconda.org/conda-forge/linux-64/pandas-profiling-1.3.0-py35_0.tar.bz2
dependencies:
jinja2 >=2.8
matplotlib >=1.4
pandas >=0.16
python 3.5*
six >=1.9
`
Could you please provide a version for 3.6? Thanks!

Enhancement: add binary variable type

If the number of distinct values of a numerical value is 2, the current report is overkill, and even confusing. A simpler layout would be better.

TypeError: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

    stats['range'] = stats['max'] - stats['min']
TypeError: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

I got this error

IndexError

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in _make_selectors(self)
149 selector = self.sorted_labels[-1] + stride * comp_index + self.lift
150 mask = np.zeros(np.prod(self.full_shape), dtype=bool)
--> 151 mask.put(selector, True)
152
153 if mask.sum() < len(self.index):

IndexError: index 1366591616 is out of bounds for axis 0 with size 1366589656

Difficulty getting access to latest `get_rejected_variables()` functionality

I installed pandas-profiling using:

pip install pandas-profiling

This gave me pandas-profiling 1.0.0a2, but the corresponding __init__.py file did not contain the get_rejected_variables() functionality.

I then cloned the git repo, and tried to install using:

python setup.py install

This gave me the zipped egg file in ~/anaconda3/envs/python2/lib/python2.7/site-packages, and I got an error that python could not find pandas_profiling.mplstyle because the pandas_profiling directory did not exist.

Finally, I extracted the egg file and move pandas_profiling into ~/anaconda3/envs/python2/lib/python2.7/site-packages, and got the new functionality.

Might be something I missed. Just letting you know.

Add badges for "highly skewed", "zeros"

I enjoy the badges on the report and am happy to submit a PR for this if it isn't being worked on elsewhere.

Gnome session dies on pandas_profiling.ProfileReport(df)

Just ran in ipython locally on my machine and the whole gnome session died - couldnt bring it back until full machine restart.
does that make any sense?

OverflowError: signed integer is greater than maximum

Code:

import pandas as pd
import numpy as np
import pandas_profiling

df = pd.read_csv('/tmp/a.t',sep='\t')
df.replace(to_replace=["\\N"," "],value=np.nan,inplace=True)
pfr = pandas_profiling.ProfileReport(df.head())
pfr.to_file("/tmp/example.html")

Stack trace:

  File "<ipython-input-22-5d56f8a2a91a>", line 9, in <module>
    pfr = pandas_profiling.ProfileReport(df)

  File "/home/user/.local/lib/python3.5/site-packages/pandas_profiling/__init__.py", line 17, in __init__
    description_set = describe(df, **kwargs)

  File "/home/user/.local/lib/python3.5/site-packages/pandas_profiling/base.py", line 284, in describe
    ldesc = {col: s for col, s in pool.map(local_multiprocess_func, df.iteritems())}

  File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value

OverflowError: signed integer is greater than maximum

/home/user/.local/lib/python3.5/site-packages/pandas_profiling/base.py:223: RuntimeWarning: divide by zero encountered in long_scalars
  'p_unique': distinct_count / count}

Example of the data (tab separated):

ts	b	c	d	e	f	g	h	i	j	k
1234576601963	63992632	320	97123492385	4	325	value234	520938	HASH	\N	const
1234572654839	28444632	319	96761692385	4	325	value1	520938	HASH	\N	const
1234573590399	58412632	320	96830092385	4	325	value1	520938	HASH	\N	const
1234573550737	63312632	320	96822892385	4	325	value1	520938	HASH	\N	const
1234575246094	74492632	320	96903892385	4	325	value234	520938	HASH	\N	const
1234576285832	64342632	320	97067692385	4	325	value234	520938	HASH	\N	const
1234567951818	35632	319	96759642785	4	310	value917	520938	HASH	\N	const
1234574017225	90822632	320	96872692385	4	325	value1	520938	HASH	\N	const
1234575810946	70392632	320	96992692385	4	325	value234	520938	HASH	\N	const

Key Error: 20

KeyError Traceback (most recent call last)
in ()
----> 1 profile = pandas_profiling.ProfileReport(selectFeatureData)
2 profile.to_file(outputfile="allData.html")

/home/amol/anaconda/lib/python2.7/site-packages/pandas_profiling/init.pyc in init(self, df)
15 description_set = describe(df)
16 self.html = to_html(df.head(),
---> 17 description_set)
18
19 def to_file(self, outputfile=DEFAULT_OUTPUTFILE):

/home/amol/anaconda/lib/python2.7/site-packages/pandas_profiling/base.pyc in to_html(sample_df, stats_object)
313 templates.mini_freq_table, templates.mini_freq_table_row, 3)
314 formatted_values['freqtable'] = freq_table(stats_object['freq'][idx], n_obs,
--> 315 templates.freq_table, templates.freq_table_row, 20)
316 if row['distinct_count'] > 50:
317 messages.append(templates.messages['HIGH_CARDINALITY'].format(formatted_values, varname = formatters.fmt_varname(idx)))

/home/amol/anaconda/lib/python2.7/site-packages/pandas_profiling/base.pyc in freq_table(freqtable, n, table_template, row_template, max_number_of_items_in_table)
252 freq_rows_html = u''
253
--> 254 freq_other = sum(freqtable[max_number_of_items_in_table:])
255 freq_missing = n - sum(freqtable)
256 max_freq = max(freqtable.values[0], freq_other, freq_missing)

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in getitem(self, key)
595 key = check_bool_indexer(self.index, key)
596
--> 597 return self._get_with(key)
598
599 def _get_with(self, key):

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in _get_with(self, key)
600 # other: fancy integer or otherwise
601 if isinstance(key, slice):
--> 602 indexer = self.index._convert_slice_indexer(key, kind='getitem')
603 return self._get_values(indexer)
604 elif isinstance(key, ABCDataFrame):

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _convert_slice_indexer(self, key, kind)
3871
3872 # translate to locations
-> 3873 return self.slice_indexer(key.start, key.stop, key.step)
3874
3875 def get_value(self, series, key):

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in slice_indexer(self, start, end, step, kind)
2568 This function assumes that the data is sorted, so use at your own peril
2569 """
-> 2570 start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
2571
2572 # return a slice

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in slice_locs(self, start, end, step, kind)
2712 start_slice = None
2713 if start is not None:
-> 2714 start_slice = self.get_slice_bound(start, 'left', kind)
2715 if start_slice is None:
2716 start_slice = 0

/home/amol/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_slice_bound(self, label, side, kind)
2660 except ValueError:
2661 # raise the original KeyError
-> 2662 raise err
2663
2664 if isinstance(slc, np.ndarray):

KeyError: 20.0

Add "target" variable to ProfileReport and then add more graphs

https://seaborn.pydata.org/generated/seaborn.countplot.html is very useful if target is categorical. so it'd be nice to plot it if necessary

ValueError: The first argument of bincount must be non-negative

Testing pandas-profiling with the df I am currently working with, and getting

/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    247                 n.imag += np.bincount(indices, weights=tmp_w.imag, minlength=bins)
    248             else:
--> 249                 n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
    250 
    251         # We now compute the bin edges since these are returned
ValueError: The first argument of bincount must be non-negative

The dataframe I'm using:

print df.head()

0         1   2   3         4    5    6            7    8    9         10  \
0   1  0.697663   1   1  0.000005  1.0  inf     2.307568  inf  inf  0.000005   
1   1  0.983510   1   1  0.000008  1.0  inf    59.642170  inf  inf  0.000008   
2   1  1.000000   1   1  0.000000  1.0  inf          inf  inf  inf  0.000000   
3   1  0.999195   1   1  0.000004  1.0  inf  1241.660000  inf  inf  0.000004   
4   1  1.000000   1   1  0.000064  1.0  inf          inf  inf  inf  0.000064   

    11  
0  inf  
1  inf  
2  inf  
3  inf  
4  inf

JupyterLab Support

With JupyterLabs set to enter beta this fall and its 1.0 release towards the new year, I have begun to move towards working in JupyterLabs. This profiling library is one of my favorites to use when getting started with a new data source, so I wanted to be sure I could use it within JupyterLabs before switching. Unfortunately, it does look like there are formatting issues:

After pandas update: module 'pandas' has no attribute 'api'

When

from pandas_profiling import ProfileReport 
ProfileReport(df)

I get
Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar return list(map(*args)) File "/usr/local/lib/python3.5/dist-packages/pandas_profiling/base.py", line 247, in multiprocess_func return x[0], describe_1d(x[1], **kwargs) File "/usr/local/lib/python3.5/dist-packages/pandas_profiling/base.py", line 232, in describe_1d vartype = get_vartype(data) File "/usr/local/lib/python3.5/dist-packages/pandas_profiling/base.py", line 46, in get_vartype elif pd.api.types.is_numeric_dtype(data): AttributeError: module 'pandas' has no attribute 'api' """
pandas-0.20.3
pandas_profiling is from pypl

Properly handle columns with mixed numeric and categorical values

You might want to handle them as categorical or define parameter to solve the following problem:

  File "/home/user/.local/lib/python3.5/site-packages/pandas_profiling/__init__.py", line 17, in __init__
    description_set = describe(df, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/pandas_profiling/base.py", line 308, in describe
    confusion_matrix=pd.crosstab(data1,data2)
  File "/home/user/.local/lib/python3.5/site-packages/pandas-0.20.3-py3.5-linux-x86_64.egg/pandas/core/reshape/pivot.py", line 493, in crosstab
    aggfunc=len, margins=margins, dropna=dropna)
  File "/home/user/.local/lib/python3.5/site-packages/pandas-0.20.3-py3.5-linux-x86_64.egg/pandas/core/reshape/pivot.py", line 160, in pivot_table
    table = table.sort_index(axis=1)
  File "/home/user/.local/lib/python3.5/site-packages/pandas-0.20.3-py3.5-linux-x86_64.egg/pandas/core/frame.py", line 3243, in sort_index
    labels = labels._sort_levels_monotonic()
  File "/home/user/.local/lib/python3.5/site-packages/pandas-0.20.3-py3.5-linux-x86_64.egg/pandas/core/indexes/multi.py", line 1241, in _sort_levels_monotonic
    indexer = lev.argsort()
  File "/home/user/.local/lib/python3.5/site-packages/pandas-0.20.3-py3.5-linux-x86_64.egg/pandas/core/indexes/base.py", line 2091, in argsort
    return result.argsort(*args, **kwargs)
TypeError: unorderable types: str() < int()

Confusing name

Hi,

When I think of the verb 'Profiling' I think of compatible projects like 'line_profiler', 'memory_profiler' or performance profiling. Given that pandas has a similar function called 'describe()' would it make sense to choose another name for this project?

John

Create documentation as project grows

I suggest readthedocs.io

memory & CPU eating

Then I try to generate several pandas_profilings for big tables it generates a lot of processes. Perhaps some of the threads are not closing properly. And become zombie processes and eating CPU and RAM.
Python 2, Ubuntu 14.04, Jupyter Notebook 4.

When processing large data files about 800MB errors out

Thank you so much for this great program which helps me a lot, but When I try to process a file of 800mb it errors out saying memory error how to process this huge files. Please guide me in this.

Can't import in a Jupyter server extension

I have an extension where I want to load pandas-profiling but I get an error on the import:

  File "/Users/gram/anaconda/envs/jupyter42/lib/python2.7/site-packages/pandas_profiling/__init__.py", line 4, in <module>
    from .base import describe, to_html
  File "/Users/gram/anaconda/envs/jupyter42/lib/python2.7/site-packages/pandas_profiling/base.py", line 20, in <module>
    from matplotlib import pyplot as plt
  File "/Users/gram/anaconda/envs/jupyter42/lib/python2.7/site-packages/matplotlib/pyplot.py", line 114, in <module>
    _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
  File "/Users/gram/anaconda/envs/jupyter42/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
    globals(),locals(),[backend_name],0)
  File "/Users/gram/anaconda/envs/jupyter42/lib/python2.7/site-packages/matplotlib/backends/backend_macosx.py", line 24, in <module>
    from matplotlib.backends import _macosx
RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

Code refactoring

Hello,

I would like to perform a code refactoring in order to split the big base.py module in several modules with separated concerns:

plot: Dedicated to plot features (histograms at this time)
describe: Dedicated to compute the variables description
report: Dedicated to generate the report
base: Containing only remaining common features

Additional modules will be ket as is.

formatters: Containing formatting utilities
templates: Jinja templating helpers and definition

It's just a first step, then I would like to split the big report generation, but this is another story ;-)

I would like your approval before starting it, since it will change a lot of things.

Many thanks for your answer.
Best,
Romain

Getting more done in GitHub with ZenHub

Hola! @JtCloud has created a ZenHub account for the JosPolfliet organization. ZenHub is the only project management tool integrated natively in GitHub – created specifically for fast-moving, software-driven teams.

How do I use ZenHub?

To get set up with ZenHub, all you have to do is download the browser extension and log in with your GitHub account. Once you do, you’ll get access to ZenHub’s complete feature-set immediately.

What can ZenHub do?

ZenHub adds a series of enhancements directly inside the GitHub UI:

Real-time, customizable task boards for GitHub issues;
Multi-Repository burndown charts, estimates, and velocity tracking based on GitHub Milestones;
Personal to-do lists and task prioritization;
Time-saving shortcuts – like a quick repo switcher, a “Move issue” button, and much more.

Add ZenHub to GitHub

Still curious? See more ZenHub features or read user reviews. This issue was written by your friendly ZenHub bot, posted by request from @JtCloud.

Duplicate (and highly correlated) categoricals

Duplicated categorical variables are not flagged:

Perhaps it would be worthwhile to detect these types of exact duplicates, or maybe also perform a correlation analysis on categoricals?

Passing `check_correlation` kwarg to ProfileReport fails

I have run across uses for this where correlation isn't useful. In trying to disable it, the histogram function fails when the kwarg is given.

STR

load/create a dataframe
profile pandas_profiling.ProfileReport(data, check_correlation=False)

Stacktrace

TypeError Traceback (most recent call last)
in ()
----> 1 pandas_profiling.ProfileReport(data, check_correlation=False)

/Users/taylor.miller/anaconda/lib/python3.6/site-packages/pandas_profiling/init.py in init(self, df, **kwargs)
15 sample = kwargs.get('sample', df.head())
16
---> 17 description_set = describe(df, **kwargs)
18
19 self.html = to_html(sample,

/Users/taylor.miller/anaconda/lib/python3.6/site-packages/pandas_profiling/base.py in describe(df, bins, correlation_overrides, pool_size, **kwargs)
282 pool = multiprocessing.Pool(pool_size)
283 local_multiprocess_func = partial(multiprocess_func, **kwargs)
--> 284 ldesc = {col: s for col, s in pool.map(local_multiprocess_func, df.iteritems())}
285 pool.close()
286

/Users/taylor.miller/anaconda/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):

/Users/taylor.miller/anaconda/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):

TypeError: _plot_histogram() got an unexpected keyword argument 'check_correlation'

IndexError: index 709544305 is out of bounds for axis 0 with size 708899956

This code

import pandas_profiling 
profile = pandas_profiling.ProfileReport(df) # Note that len(df) = 3888656 X 48 Columns
profile.to_file(outputfile="Baseline_Profile.html")

Generates this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-32-c53084b4e5f3> in <module>()
      1 import pandas_profiling
----> 2 profile = pandas_profiling.ProfileReport(df)
      3 profile.to_file(outputfile="Baseline_Profile.html")

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas_profiling\__init__.py in __init__(self, df, **kwargs)
     15         sample = kwargs.get('sample', df.head())
     16 
---> 17         description_set = describe(df, **kwargs)
     18 
     19         self.html = to_html(sample,

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas_profiling\base.py in describe(df, bins, correlation_overrides, pool_size, **kwargs)
    306             continue
    307 
--> 308         confusion_matrix=pd.crosstab(data1,data2)
    309         if confusion_matrix.values.diagonal().sum() == len(df):
    310             ldesc[name1] = pd.Series(['RECODED', name2], index=['type', 'correlation_var'])

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\tools\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna, normalize)
    480         df['__dummy__'] = 0
    481         table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 482                                aggfunc=len, margins=margins, dropna=dropna)
    483         table = table.fillna(0).astype(np.int64)
    484 

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\tools\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
    131         to_unstack = [agged.index.names[i] or i
    132                       for i in range(len(index), len(keys))]
--> 133         table = agged.unstack(to_unstack)
    134 
    135     if not dropna:

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\frame.py in unstack(self, level, fill_value)
   4034         """
   4035         from pandas.core.reshape import unstack
-> 4036         return unstack(self, level, fill_value)
   4037 
   4038     # ----------------------------------------------------------------------

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in unstack(obj, level, fill_value)
    402 def unstack(obj, level, fill_value=None):
    403     if isinstance(level, (tuple, list)):
--> 404         return _unstack_multiple(obj, level)
    405 
    406     if isinstance(obj, DataFrame):

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in _unstack_multiple(data, clocs)
    297         dummy.index = dummy_index
    298 
--> 299         unstacked = dummy.unstack('__placeholder__')
    300         if isinstance(unstacked, Series):
    301             unstcols = unstacked.index

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\frame.py in unstack(self, level, fill_value)
   4034         """
   4035         from pandas.core.reshape import unstack
-> 4036         return unstack(self, level, fill_value)
   4037 
   4038     # ----------------------------------------------------------------------

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in unstack(obj, level, fill_value)
    406     if isinstance(obj, DataFrame):
    407         if isinstance(obj.index, MultiIndex):
--> 408             return _unstack_frame(obj, level, fill_value=fill_value)
    409         else:
    410             return obj.T.stack(dropna=False)

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in _unstack_frame(obj, level, fill_value)
    449         unstacker = _Unstacker(obj.values, obj.index, level=level,
    450                                value_columns=obj.columns,
--> 451                                fill_value=fill_value)
    452         return unstacker.get_result()
    453 

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in __init__(self, values, index, level, value_columns, fill_value)
    101 
    102         self._make_sorted_values_labels()
--> 103         self._make_selectors()
    104 
    105     def _make_sorted_values_labels(self):

c:\users\redacted\appdata\local\continuum\anaconda3\lib\site-packages\pandas\core\reshape.py in _make_selectors(self)
    136         selector = self.sorted_labels[-1] + stride * comp_index + self.lift
    137         mask = np.zeros(np.prod(self.full_shape), dtype=bool)
--> 138         mask.put(selector, True)
    139 
    140         if mask.sum() < len(self.index):

IndexError: index 709544305 is out of bounds for axis 0 with size 708899956

Warnings column names should be links to details in Variables section

Benefits

This would expedite navigation through large reports.
I've seen a few click them with expectation of this behavior, so it seems natural.

Will submit a PR if you agree @JosPolfliet

This call to matplotlib.use() has no effect because the backend has already

/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/pandas_profiling/base.py:20: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'module://ipykernel.pylab.backend_inline' by the following code:
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 177, in start
super(ZMQIOLoop, self).start()
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
if self.run_code(code, result):
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 8, in
import matplotlib.pyplot as plt
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 69, in
from matplotlib.backends import pylab_setup
File "/home/flash1/work/software/python/anaconda2/lib/python2.7/site-packages/matplotlib/backends/init.py", line 14, in
line for line in traceback.format_stack()

matplotlib.use('Agg')

Histogram bins Argument Ignored

It seems that passing bins for plotting histograms is being ignored. The call:

pandas_profiling.ProfileReport(removals_df, bins="auto")

places bins in kwargs in the constructor (file init.py), which is passed in the call to the function describe (implemented in file base.py). Since the argument bins isn't part of kwargs in this function, it's never passed to other functions. In addition, calls to functions histogram or mini_histogram aren't passing a bins argument.

Error while running- AttributeError: 'Series' object has no attribute 'memory_usage'

I have this dataset, all values are numerical (int/float).

Not Sure why i am getting this error.

pandas version - 0.17.0
matplotlib version - 1.5.0

Not seeing anything, just a stream of warnings in the console

Sorry if this is a stupid mistake, but I'm running this on a small slice of a CSV (1000 rows in my slice, ~450 columns) and I don't see anything. The console is full of a stream of warnings like below. The python process seems to freeze as if it's loading something, but then nothing ever happens.

I'm on a Mac using python3 installed via homebrew. Is there anything in particular I need to install or be doing in order to see the output?

The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
/Users/micah/Library/Python/3.6/lib/python/site-packages/pandas_profiling/base.py:223: RuntimeWarning: divide by zero encountered in long_scalars
  'p_unique': distinct_count / count}
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.

Ploting a response variable on the histograms

Hey,

Great job with pandas-profiling I love it.
I think it would be great to have an extra parameter to specify a response column.
Plotting the average response for every bin of the histograms (for each variables) would allow to see obvious trends/correlations and would be useful for any regression problem (might be more tricky for classification where the response are discrete).
What do you think ?

Thanks!

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This error comes up as follows (Python 3.6):

File "C:\Users\me\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas_profiling\__init__.py", line 17, in __init__
description_set = describe(df, **kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas_profiling\base.py", line 294, in describe
if correlation_overrides and x in correlation_overrides:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I believe the problem is in line 294 of base.py:

 if correlation_overrides and x in correlation_overrides:

I think it should be changed to the following:

 if correlation_overrides is not None and x in correlation_overrides:

Same goes for line 307:

 if correlation_overrides is not None and name1 in correlation_overrides:

Clashing text with very long variable names

Long variable names seem to cause a bit of a problem as seen here:

I am by no means very familiar with bootstrap, but it would seem to me that making the variable take the full grid, and then offsetting the actual stats rows would do the trick:

master...dpavlic:fix_longname

This would mean the variable names would sit on its own with the relevant stats underneath, like so: