Warn about NaNs in data

When a dataframe or series is created warn about NaN values.

Display doesn't work properly on Jupyter Lab

Brief Description

Display is messed up on Jupyter Lab.
An example -- [x] button doesn't work, and display is a bit off (at least relative to the screens in the documentation).

System Information

Ubuntu 18.04.3 LTS on WSL (1).
Python 3.6.8
Jupyter Lab Version: 1.2.3
Browser: Google Chrome Version 78.0.3904.97 (Official Build) (64-bit) on Windows 10 1903

Minimally Reproducible Code

Behavior isn't wrong, just display. Not really relevant to code. imports were only dovpandas, re, pandas.

Bug - object has no attribute 'shape'

Brief Description

When the value of a key in a Series is of type dictionary, accessing the key in the following way raises an exception:
s['dictionary']

System Information

python 3.6
ubuntu 18.04LTS

Minimally Reproducible Code

x=pd.DataFrame({'dictionary':[{100: 1, 200: 2}]})
s = x.iloc[0]
s['dictionary']

when this works fine:

x=pd.DataFrame({'dictionary':[{100: 1, 200: 2}]})
s = x.iloc[0]
s.dictionary

Error Messages

~/anaconda3/lib/python3.6/site-packages/dovpanda/core.py in suggest_at_iat(res, arguments)
186 def suggest_at_iat(res, arguments):
187 self = arguments.get('self')
--> 188 shp = res.shape
189 if res.ndim < 1: # Sometimes specific slicing will return value
190 return

AttributeError: 'dict' object has no attribute 'shape'

Find a way to hook a post hook for DataFrame constructor

Brief Description

Add option for a new kind of hint, for the exceptions Pandas creating.
If an exception thrown trying to analyze and explain it extensively.

SyntaxError when importing dovpanda

Brief Description

After installing the package, when I import dovpanda I get a SyntaxError

System Information

Windows OS

IPython

Python version (required):
3.5.2

Minimally Reproducible Code

import dovpanda

Error Messages



Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-d1c921b0113c>", line 1, in <module>
    import dovpanda
  File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "C:\Program Files\Anaconda3\lib\site-packages\dovpanda\__init__.py", line 6, in <module>
    from dovpanda import tips
  File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "C:\Program Files\Anaconda3\lib\site-packages\dovpanda\tips.py", line 48
    html = f'''
        <div class="alert alert-warning" role="alert">
          {self.html}
          <button type="button" class="close" data-dismiss="alert" aria-label="Close">
            <span aria-hidden="true">&times;</span>
          </button>
        </div>
                  <p>
            Source: <a href="{self.ref_url}" target="_blank">{self.ref_name}</a>
          </p>
        '''
               
                                                      
                     
                                                                                      
                                                   
                   
              
                     
                                                                                
              
          ^
SyntaxError: invalid syntax

change print method

create Ledger.tell and change the print function.
allow

print
logging
custom
html

Use context manager to mute hooks

with dovpanda.mute():
    pd.concat(df1,df2)

Check if data in series is string but in dateformat

Basic Functionality

Check after a Series or Dataframe creation and after addition of a new column if the data look like datetime but in a different dtype.

Hooks Upon

Series / Dataframe creation functions
assign
insert
_setitem

Hook Type

per-hook

Design

check that the data inserted into the structure is with date structure (with the help of python-dateutil), if the dtype is not date - suggest using pandas.to_datetime()

After importing dovpanda, pandas read_csv() doesn't work.

Brief Description

I'm trying to concatenate csv files that start with keyword(example uses A). With dovpanda, the csv files don't seem to be found. I checked the "glob(os.path.join(keyword + '*.csv'))" line by itself, that brings up a list of the correct csv files to concatenate.

System Information

Windows 10
Jupyter Notebook
Python 3.6.5 :: Anaconda, Inc.

Minimally Reproducible Code

import pandas as pd
import dovpanda

keyword = str('A')

df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
display (df)

df.to_csv(path_or_buf=(f"Files {today}.csv"), index=False, encoding='ascii')

Error Messages

SAD PANDA

I'm so sorry, but I crashed on wrong_concat_axis hooks on concat with error descriptor 'union' of 'set' object needs an argument
But you can change that!
Please Report a bug×
Line 5: df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-221aac6881a7> in <module>()
      3 #CRM File merge
      4 keyword = str('A')
----> 5 df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
      6 
      7 myList = ['enrollment_id','roster_record_id','user_id','email','ssn','identify_as',

~\Anaconda3\lib\site-packages\dovpanda\base.py in run(*args, **kwargs)
    154             arguments = self._get_arguments(f, *args, **kwargs)
    155             self.run_hints(pres, arguments)
--> 156             ret = f(*args, **kwargs)
    157             self.run_hints(posts, ret, arguments)
    158             return ret

~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    226                        keys=keys, levels=levels, names=names,
    227                        verify_integrity=verify_integrity,
--> 228                        copy=copy, sort=sort)
    229     return op.get_result()
    230 

~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    260 
    261         if len(objs) == 0:
--> 262             raise ValueError('No objects to concatenate')
    263 
    264         if keys is None:

ValueError: No objects to concatenate

[HINT] When saving a dataframe to text format, suggest ziping

Basic Functionality

if saved filesize is big, suggest zipping "on the fly" like in tip 10 here

Hooks Upon

write methods that are text (json, csv)

Hook Type

post

Design

use functionality of get size similar to #58

Advise against .apply on dataframe

Basic Functionality

on calling df.apply, explain that apply on dataframe is not a lot better than loop.

Hooks Upon

DataFrame.apply()

Hook Type

pre

Brief Description

Design

straight forward, no need to use args

Tips Mechanism

For analysis time without need for pandas to be loaded.

example:

>>> dovpanda.tip()
you can call .map with a dict

find a way to make hook for properties

i.e. df.shape

Dev check mechanism

give option such as dovpanda.add_checks() which will add more hooks for dev checks

if calling `values` on shape (1,1) suggest using `.at()`

add line/cell magic

Brief Description

Inside a notebook have the option to make dovpanda only work on selected cells with cellmagic

example:

%%dovpanda
df = pd.concat((df1,df2))
df.iloc[0,0]

%dovpanda df = pd.concat((df1,df2))

After dovpanda import, pandas read_csv no longer works with URL argument

Brief Description

After importing dovpanda, pd.read_csv(url, ...) no longer works, generating a file not found error.

System Information

Notebook / IPython / Python Console

Any/all

Python version (required):

3.7.3

Minimally Reproducible Code

import pandas as pd
df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat")
print(df.head(3)) # works
import dovpanda
df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat") # error

Error Messages

<ipython-input-4-91539aa51e2d> in <module>
      1 import pandas as pd
----> 2 df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat")
      3 print(df.head(3))

~/anaconda3/envs/tf2/lib/python3.7/site-packages/dovpanda/base.py in run(*args, **kwargs)
    157                 for pre in pres:
    158                     if self.similar <= pre.stop_nudge:
--> 159                         pre.replacement(arguments)
    160                 ret = f(*args, **kwargs)
    161                 for post in posts:

~/anaconda3/envs/tf2/lib/python3.7/site-packages/dovpanda/core.py in check_csv_size(arguments)
    108 def check_csv_size(arguments):
    109     filename = arguments.get('filepath_or_buffer')
--> 110     if os.path.getsize(filename) > config.MAX_CSV_SIZE:
    111         ledger.tell('File size is very large and may take time to load. '
    112                     'If you would like to avoid format issues before the complete file loads, '

~/anaconda3/envs/tf2/lib/python3.7/genericpath.py in getsize(filename)
     48 def getsize(filename):
     49     """Return the size of a file, reported by os.stat()."""
---> 50     return os.stat(filename).st_size
     51 
     52 

FileNotFoundError: [Errno 2] No such file or directory: 'http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat'

[HINT] Reset_index with drop=True

Basic Functionality

when the user calls reset_index try to understand if the index is meaningful. If not, suggest reset_index with drop=True

Hooks Upon

reset_index

Hook Type

pre

Design

if name_of_index is [non-important-names-list]: hint

Check that Pandas installed

Brief Description

Currently, there is a check whether pandas is imported or not - for better verbosity can check that if it even installed and that the version of pandas is supported.

can use pkg_resources and specifically pkg_resources.require

problem in handling empty dataframes

I'm loading and preprocessing a legitimate dataframe, using a long function, with calls to pymongo and mysql (will try spending the time to make a minimal example if issue is unclear).

I have this line:
df2 = pd.concat([v for v in some_cache_df.values()]) if some_cache_df else pd.DataFrame()

It raises an IndexError in line 45 of /dovpanda/core.py, where there's a call to cols = {df.shape[1] for df in objs}

[HINT] suggest clip

Basic Functionality

if series > scalar, and inside condition, and setitem...
might be complex to implement.

Hooks Upon

ge and setitem maybe more

Hook Type

pre

Design

Error at suggest_at_iat function

Brief Description

When using this method, the hint throws an error.

self.data.groupby(by="Class").count().idxmax()[0]

What that line does is find the most repeated class (column) in the dataframe. (probably there is a better way)

System Information

Windows 10 pro
Notebook
Python version : 3.7.4

Minimally Reproducible Code

import pandas as pd
import dovpandas

a = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],"Class": ['a', 'a', 'b', 'a', 'b', 'c']})

a.groupby(by="Class").count().idxmax()[0]

Error Messages

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-fe782dc356aa> in <module>
      1 a = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],"Class": ['a', 'a', 'b', 'a', 'b', 'c']})
      2 
----> 3 a.groupby(by="Class").count().idxmax()[0]

c:\users\marti\appdata\local\programs\python\python37\lib\site-packages\dovpanda\base.py in run(*args, **kwargs)
    161                 for post in posts:
    162                     if self.similar <= post.stop_nudge:
--> 163                         post.replacement(ret, arguments)
    164             return ret
    165 

c:\users\marti\appdata\local\programs\python\python37\lib\site-packages\dovpanda\core.py in suggest_at_iat(res, arguments)
    186 def suggest_at_iat(res, arguments):
    187     self = arguments.get('self')
--> 188     shp = res.shape
    189     if res.ndim < 1:  # Sometimes specific slicing will return value
    190         return

AttributeError: 'str' object has no attribute 'shape'

Only hook when called from user's code

Don't run hooks when pandas calls relative to itself.

probable design: run inside at the start of a replacement function, and make it return if not from user's code.
future functionality: Make as a decorator with "levels"

breaks pandas concat

Brief Description

dovpanda appears to be incompatible with pandas' concat method.

System Information

Linux 4.4.0 via Windows Substation for Linux
also reproduces on native Linux 3.10.0 (CentOS distro)
Python version 3.7.3

Minimally Reproducible Code

import pandas as pd
import dovpanda

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

Error Messages

IndexError: tuple index out of range

Check for duplicate indices after more operations

Basic Functionality

Same as post hook for concat but for more combining operations such as
join,merge, etc...

Hooks Upon

function that combine dataframes

Hook Type

Brief Description

Design

Maybe requires only adding a list to an existing decorator.

warning on series == val

import dovpanda in console error - NameError: name 'display' is not defined

Brief Description

I am getting the following error when importing dovpanda from inside the console :
NameError: name 'display' is not defined

System Information

WIN10-64bit

Python version (required): 3.6.7
Anaconda env:

Minimally Reproducible Code

import pandas as pd
import numpy as np
import dovpanda

Error Messages

Traceback (most recent call last): File "", line 1, in File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda_init_.py", line 7, in from dovpanda.core import ledger File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\core.py", line 7, in ledger = Ledger() File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 119, in init self.teller = _Teller() File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 34, in init self.set_output('display') File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 103, in set_output self.output = display NameError: name 'display' is not defined

[HINT] suggest elegant pipes

Basic Functionality

if 3 or more are called in the same line, suggest multiline

(df
 .f()
 .g()
 .h()
)

Hooks Upon

(almost) everything

Hook Type

Design

Date object checks on more methods

Basic Functionality

same as #37 but loop over all columns in a dataframe.

Hooks Upon

DF_CREATION and maybe more.

Hook Type

post

Brief Description

Design

for col in res.columns:
    #do similar to #37

Strip html before console display

Don't crash on dovpanda errors

Brief Description

if a hook is behaving badly, tell that to the user but let them continue working.

I would like to propose...

unhashable

Brief Description

System Information

Notebook / IPython / Python Console
Python version (required):

Minimally Reproducible Code

Error Messages

[HINT] Don't append iterative

Basic Functionality

if user loops over many dfs and appends each new one to a big existing one, tell them to make list of dfs in use one time concat.

Hooks Upon

append.

Hook Type

pre

Design

Use ledger.memory and see if in recent history the append is called from the same line - meaning it's in a loop.

Show what command caused hint

I'm running a long piece of code (imported to a notebook), and though I see a useful hint (I'm doing series comparison wrong), I can't find what is causing it (maybe the hint is wrong? I can't tell).

I'd like some hint (haha) as to what command or line is causing the hint to pop

concat

if same number of rows, and different col names, check that correct axis (horizontal)

Add a hint in case of large csv files to add nrows=10, when a call to 'read_csv' is being executed

Basic Functionality

When loading large csv files (100K lines at least), we should recommend the user to add
nrows=10, so they can check that the types and formatting of the csv file fits their demands.

Hooks Upon

read_csv

Hook Type

Brief Description

pre-hook

Design

Error when extracting date cell from Dataframe

Brief Description

When trying to extract one cell using iloc, dovpanda runs into an error. Sometimes, despite the error, this suggestion is made:
The shape of the returned series from slicing is (1,) Which suggests you are interested in the value and not in a new series. Try instead: series.iat[row, col]

System Information

OS: Ubuntu 18.04
Python: 3.7.5
Jupyter Lab: 1.14

Minimally Reproducible Code

df = pd.DataFrame([pd.Timestamp('01-01-2019')])
df.iloc[0][0]

Error Messages

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-dc9af601df62> in <module>
      1 df = pd.DataFrame([pd.Timestamp('01-01-2019')])
----> 2 df.iloc[0][0]

~/anaconda3/envs/x/lib/python3.7/site-packages/dovpanda/base.py in run(*args, **kwargs)
    161                 for post in posts:
    162                     if self.similar <= post.stop_nudge:
--> 163                         post.replacement(ret, arguments)
    164             return ret
    165 

~/anaconda3/envs/x/lib/python3.7/site-packages/dovpanda/core.py in suggest_at_iat(res, arguments)
    186 def suggest_at_iat(res, arguments):
    187     self = arguments.get('self')
--> 188     shp = res.shape
    189     if res.ndim < 1:  # Sometimes specific slicing will return value
    190         return

AttributeError: 'Timestamp' object has no attribute 'shape'

Have the option to add several original functions for one hook

Brief Description

have function foo the option to hook more than one methods.

implementations:

Beautiful but hard

@ledger.add_hint('bar')
@ledger.add_hint('baz')
def foo():
    ...

Easier but not as nice

@ledger.add_hint(['bar','baz'])
def foo():
    ...

suggest CategoryDType

if

col.dtype is object
col.nunique < threshold

then
suggest converting to Categorical

split with expand

series.str.split(',',expand=True)

Hint on "inplace=False" happens when in plot/show

As I understand, the dov hints when a line calls a function with "inplace=False" (explicitly or implicitly by default) but there's no assignment.

However, sometimes functions are called in order to show the results in an output cell, or a plot.

For example, I might want to call
pd.Series(data=model.feature_importances_, index=features).sort_values(ascending=False)
In order to inspect the top features, even without assignment.

Possible workarounds/solutions:

On the user side - Assign, then show/plot. That might even be good practice that I'm avoiding.
On the dov side - Can it know that a line is about to send output? or to plot? If not, maybe leech on the default, and suppress the hint when inplace=False is explicit?

concat

if same column names check that correct axis (vertical)

[HINT] hint when df=df.dropna(inplace=True) is causing df=None

Basic Functionality

df.dropna(inplace=False) returns the modified dataframe. When changing to inplace=True, it's easy to forget to remove the "df= " at the beginning - so it's a useful hint...

Hooks Upon

Hook Type

pre-hook

Design

I don't know... maybe just notice that the df is assigned None?

Ledger Memory/Cache

Brief Description

Have the ledger have a memory/cache so it can remember X last pandas operations.
example:
series1 == series2 should hint the user to use series1.equals(series2), unless inside .loc where it is expected.

Will require hooking on every pandas function (Maybe just DataFrame) in order to add to memory.

add Hint class

Class Hook:
   def __init__(self, hook_function, hook_type='pre', hook_level=default):
       pass
   def __repr__
   def _repr_html_

Consider combining with Teller class

[HINT] Correlated columns

Basic Functionality

After a DataFrame created, if there pair of column that correlated above a threshold - notify about them.

Hooks Upon

Every function that resulted with creating a DataFrame

Hook Type

Post

Design

Calculate the pairwise correlation.
Need to think about the overhead of time and memory with big datasets.

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

equality check between dataframes

if user checks df1 == df2 suggest maybe they mean df1.equals(df2)

dovpanda-dev / dovpanda Goto Github PK

dovpanda's People

Stargazers

Watchers

Forkers

dovpanda's Issues

Brief Description

System Information

Minimally Reproducible Code

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Brief Description

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Basic Functionality

Hooks Upon

Hook Type

Design

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Basic Functionality

Hooks Upon

Hook Type

Design

Basic Functionality

Hooks Upon

Hook Type

Brief Description

Design

Brief Description

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Basic Functionality

Hooks Upon

Hook Type

Design

Brief Description

Basic Functionality

Hooks Upon

Hook Type

Design

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Basic Functionality

Hooks Upon

Hook Type

Brief Description

Design

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Basic Functionality

Hooks Upon

Hook Type

Design

Basic Functionality

Hooks Upon

Hook Type

Brief Description

Design

Brief Description

Brief Description

System Information

Minimally Reproducible Code

Error Messages