dovpanda-dev / dovpanda Goto Github PK
View Code? Open in Web Editor NEWDirections overlay for working with pandas in an analysis environment
License: BSD 3-Clause "New" or "Revised" License
Directions overlay for working with pandas in an analysis environment
License: BSD 3-Clause "New" or "Revised" License
When a dataframe or series is created warn about NaN values.
Display is messed up on Jupyter Lab.
An example -- [x] button doesn't work, and display is a bit off (at least relative to the screens in the documentation).
Ubuntu 18.04.3 LTS on WSL (1).
Python 3.6.8
Jupyter Lab Version: 1.2.3
Browser: Google Chrome Version 78.0.3904.97 (Official Build) (64-bit) on Windows 10 1903
Behavior isn't wrong, just display. Not really relevant to code. imports were only dovpandas, re, pandas.
When the value of a key in a Series is of type dictionary, accessing the key in the following way raises an exception:
s['dictionary']
python 3.6
ubuntu 18.04LTS
x=pd.DataFrame({'dictionary':[{100: 1, 200: 2}]})
s = x.iloc[0]
s['dictionary']
when this works fine:
x=pd.DataFrame({'dictionary':[{100: 1, 200: 2}]})
s = x.iloc[0]
s.dictionary
~/anaconda3/lib/python3.6/site-packages/dovpanda/core.py in suggest_at_iat(res, arguments)
186 def suggest_at_iat(res, arguments):
187 self = arguments.get('self')
--> 188 shp = res.shape
189 if res.ndim < 1: # Sometimes specific slicing will return value
190 returnAttributeError: 'dict' object has no attribute 'shape'
Design option: one ledger per package
Add option for a new kind of hint, for the exceptions Pandas creating.
If an exception thrown trying to analyze and explain it extensively.
After installing the package, when I import dovpanda I get a SyntaxError
Windows OS
IPython
import dovpanda
Traceback (most recent call last):
File "C:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-d1c921b0113c>", line 1, in <module>
import dovpanda
File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\dovpanda\__init__.py", line 6, in <module>
from dovpanda import tips
File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\Program Files\JetBrains\PyCharm 2018.2.3\helpers\pydev\_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\dovpanda\tips.py", line 48
html = f'''
<div class="alert alert-warning" role="alert">
{self.html}
<button type="button" class="close" data-dismiss="alert" aria-label="Close">
<span aria-hidden="true">×</span>
</button>
</div>
<p>
Source: <a href="{self.ref_url}" target="_blank">{self.ref_name}</a>
</p>
'''
^
SyntaxError: invalid syntax
create Ledger.tell
and change the print function.
allow
with dovpanda.mute():
pd.concat(df1,df2)
Check after a Series or Dataframe creation and after addition of a new column if the data look like datetime but in a different dtype.
per-hook
check that the data inserted into the structure is with date structure (with the help of python-dateutil), if the dtype is not date - suggest using pandas.to_datetime()
I'm trying to concatenate csv files that start with keyword(example uses A). With dovpanda, the csv files don't seem to be found. I checked the "glob(os.path.join(keyword + '*.csv'))" line by itself, that brings up a list of the correct csv files to concatenate.
Windows 10
Jupyter Notebook
Python 3.6.5 :: Anaconda, Inc.
import pandas as pd
import dovpanda
keyword = str('A')
df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
display (df)
df.to_csv(path_or_buf=(f"Files {today}.csv"), index=False, encoding='ascii')
SAD PANDA
I'm so sorry, but I crashed on wrong_concat_axis hooks on concat with error descriptor 'union' of 'set' object needs an argument
But you can change that!
Please Report a bug×
Line 5: df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-221aac6881a7> in <module>()
3 #CRM File merge
4 keyword = str('A')
----> 5 df = pd.concat(map(pd.read_csv, glob(os.path.join(keyword + '*.csv'))))
6
7 myList = ['enrollment_id','roster_record_id','user_id','email','ssn','identify_as',
~\Anaconda3\lib\site-packages\dovpanda\base.py in run(*args, **kwargs)
154 arguments = self._get_arguments(f, *args, **kwargs)
155 self.run_hints(pres, arguments)
--> 156 ret = f(*args, **kwargs)
157 self.run_hints(posts, ret, arguments)
158 return ret
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
226 keys=keys, levels=levels, names=names,
227 verify_integrity=verify_integrity,
--> 228 copy=copy, sort=sort)
229 return op.get_result()
230
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
260
261 if len(objs) == 0:
--> 262 raise ValueError('No objects to concatenate')
263
264 if keys is None:
ValueError: No objects to concatenate
if saved filesize is big, suggest zipping "on the fly" like in tip 10 here
write methods that are text (json, csv)
post
use functionality of get size similar to #58
on calling df.apply, explain that apply on dataframe is not a lot better than loop.
DataFrame.apply()
pre
straight forward, no need to use args
For analysis time without need for pandas to be loaded.
example:
>>> dovpanda.tip()
you can call .map with a dict
i.e. df.shape
give option such as dovpanda.add_checks()
which will add more hooks for dev checks
Inside a notebook have the option to make dovpanda only work on selected cells with cellmagic
example:
%%dovpanda
df = pd.concat((df1,df2))
df.iloc[0,0]
%dovpanda df = pd.concat((df1,df2))
After importing dovpanda, pd.read_csv(url, ...) no longer works, generating a file not found error.
Any/all
3.7.3
import pandas as pd
df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat")
print(df.head(3)) # works
import dovpanda
df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat") # error
<ipython-input-4-91539aa51e2d> in <module>
1 import pandas as pd
----> 2 df = pd.read_csv("http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat")
3 print(df.head(3))
~/anaconda3/envs/tf2/lib/python3.7/site-packages/dovpanda/base.py in run(*args, **kwargs)
157 for pre in pres:
158 if self.similar <= pre.stop_nudge:
--> 159 pre.replacement(arguments)
160 ret = f(*args, **kwargs)
161 for post in posts:
~/anaconda3/envs/tf2/lib/python3.7/site-packages/dovpanda/core.py in check_csv_size(arguments)
108 def check_csv_size(arguments):
109 filename = arguments.get('filepath_or_buffer')
--> 110 if os.path.getsize(filename) > config.MAX_CSV_SIZE:
111 ledger.tell('File size is very large and may take time to load. '
112 'If you would like to avoid format issues before the complete file loads, '
~/anaconda3/envs/tf2/lib/python3.7/genericpath.py in getsize(filename)
48 def getsize(filename):
49 """Return the size of a file, reported by os.stat()."""
---> 50 return os.stat(filename).st_size
51
52
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.stat.ufl.edu/~winner/data/pgalpga2008.dat'
when the user calls reset_index try to understand if the index is meaningful. If not, suggest reset_index with drop=True
reset_index
pre
if name_of_index is [non-important-names-list]: hint
Currently, there is a check whether pandas is imported or not - for better verbosity can check that if it even installed and that the version of pandas is supported.
can use pkg_resources
and specifically pkg_resources.require
I'm loading and preprocessing a legitimate dataframe, using a long function, with calls to pymongo and mysql (will try spending the time to make a minimal example if issue is unclear).
I have this line:
df2 = pd.concat([v for v in some_cache_df.values()]) if some_cache_df else pd.DataFrame()
It raises an IndexError
in line 45 of /dovpanda/core.py, where there's a call to cols = {df.shape[1] for df in objs}
if series > scalar, and inside condition, and setitem...
might be complex to implement.
ge and setitem maybe more
pre
When using this method, the hint throws an error.
self.data.groupby(by="Class").count().idxmax()[0]
What that line does is find the most repeated class (column) in the dataframe. (probably there is a better way)
import pandas as pd
import dovpandas
a = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],"Class": ['a', 'a', 'b', 'a', 'b', 'c']})
a.groupby(by="Class").count().idxmax()[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-fe782dc356aa> in <module>
1 a = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],"Class": ['a', 'a', 'b', 'a', 'b', 'c']})
2
----> 3 a.groupby(by="Class").count().idxmax()[0]
c:\users\marti\appdata\local\programs\python\python37\lib\site-packages\dovpanda\base.py in run(*args, **kwargs)
161 for post in posts:
162 if self.similar <= post.stop_nudge:
--> 163 post.replacement(ret, arguments)
164 return ret
165
c:\users\marti\appdata\local\programs\python\python37\lib\site-packages\dovpanda\core.py in suggest_at_iat(res, arguments)
186 def suggest_at_iat(res, arguments):
187 self = arguments.get('self')
--> 188 shp = res.shape
189 if res.ndim < 1: # Sometimes specific slicing will return value
190 return
AttributeError: 'str' object has no attribute 'shape'
Don't run hooks when pandas calls relative to itself.
probable design: run inside at the start of a replacement function, and make it return if not from user's code.
future functionality: Make as a decorator with "levels"
dovpanda appears to be incompatible with pandas' concat
method.
Linux 4.4.0 via Windows Substation for Linux
also reproduces on native Linux 3.10.0 (CentOS distro)
Python version 3.7.3
import pandas as pd
import dovpanda
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])
IndexError: tuple index out of range
Same as post
hook for concat
but for more combining operations such as
join
,merge
, etc...
function that combine dataframes
Maybe requires only adding a list to an existing decorator.
I am getting the following error when importing dovpanda from inside the console :
NameError: name 'display' is not defined
WIN10-64bit
import pandas as pd
import numpy as np
import dovpanda
Traceback (most recent call last): File "", line 1, in File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda_init_.py", line 7, in from dovpanda.core import ledger File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\core.py", line 7, in ledger = Ledger() File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 119, in init self.teller = _Teller() File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 34, in init self.set_output('display') File "C:\DevPrograms\Anaconda2\envs\test\lib\site-packages\dovpanda\base.py", line 103, in set_output self.output = display NameError: name 'display' is not defined
if 3 or more are called in the same line, suggest multiline
(df
.f()
.g()
.h()
)
(almost) everything
same as #37 but loop over all columns in a dataframe.
DF_CREATION
and maybe more.
post
for col in res.columns:
#do similar to #37
if a hook is behaving badly, tell that to the user but let them continue working.
I would like to propose...
if user loops over many dfs and appends each new one to a big existing one, tell them to make list of dfs in use one time concat.
append
.
pre
Use ledger.memory
and see if in recent history the append is called from the same line - meaning it's in a loop.
I'm running a long piece of code (imported to a notebook), and though I see a useful hint (I'm doing series comparison wrong), I can't find what is causing it (maybe the hint is wrong? I can't tell).
I'd like some hint (haha) as to what command or line is causing the hint to pop
if same number of rows, and different col names, check that correct axis (horizontal)
When loading large csv files (100K lines at least), we should recommend the user to add
nrows=10
, so they can check that the types and formatting of the csv file fits their demands.
read_csv
pre-hook
When trying to extract one cell using iloc, dovpanda runs into an error. Sometimes, despite the error, this suggestion is made:
The shape of the returned series from slicing is (1,) Which suggests you are interested in the value and not in a new series. Try instead: series.iat[row, col]
df = pd.DataFrame([pd.Timestamp('01-01-2019')])
df.iloc[0][0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-dc9af601df62> in <module>
1 df = pd.DataFrame([pd.Timestamp('01-01-2019')])
----> 2 df.iloc[0][0]
~/anaconda3/envs/x/lib/python3.7/site-packages/dovpanda/base.py in run(*args, **kwargs)
161 for post in posts:
162 if self.similar <= post.stop_nudge:
--> 163 post.replacement(ret, arguments)
164 return ret
165
~/anaconda3/envs/x/lib/python3.7/site-packages/dovpanda/core.py in suggest_at_iat(res, arguments)
186 def suggest_at_iat(res, arguments):
187 self = arguments.get('self')
--> 188 shp = res.shape
189 if res.ndim < 1: # Sometimes specific slicing will return value
190 return
AttributeError: 'Timestamp' object has no attribute 'shape'
have function foo
the option to hook more than one methods.
@ledger.add_hint('bar')
@ledger.add_hint('baz')
def foo():
...
@ledger.add_hint(['bar','baz'])
def foo():
...
if
then
suggest converting to Categorical
series.str.split(',',expand=True)
As I understand, the dov hints when a line calls a function with "inplace=False" (explicitly or implicitly by default) but there's no assignment.
However, sometimes functions are called in order to show the results in an output cell, or a plot.
For example, I might want to call
pd.Series(data=model.feature_importances_, index=features).sort_values(ascending=False)
In order to inspect the top features, even without assignment.
Possible workarounds/solutions:
if same column names check that correct axis (vertical)
df.dropna(inplace=False) returns the modified dataframe. When changing to inplace=True, it's easy to forget to remove the "df= " at the beginning - so it's a useful hint...
pre-hook
I don't know... maybe just notice that the df is assigned None?
Have the ledger have a memory/cache so it can remember X last pandas operations.
example:
series1 == series2 should hint the user to use series1.equals(series2), unless inside .loc
where it is expected.
DataFrame
) in order to add to memory.Class Hook:
def __init__(self, hook_function, hook_type='pre', hook_level=default):
pass
def __repr__
def _repr_html_
Consider combining with Teller
class
After a DataFrame created, if there pair of column that correlated above a threshold - notify about them.
Every function that resulted with creating a DataFrame
Post
Calculate the pairwise correlation.
Need to think about the overhead of time and memory with big datasets.
The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
if user checks df1 == df2
suggest maybe they mean df1.equals(df2)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.