wearepal / data-science-types Goto Github PK
View Code? Open in Web Editor NEWMypy stubs, i.e., type information, for numpy, pandas and matplotlib
License: Apache License 2.0
Mypy stubs, i.e., type information, for numpy, pandas and matplotlib
License: Apache License 2.0
Hi,
I'm using pyright in combination with the stubs provided by this repo. I'm getting a problem when I check the following code:
import pandas as pd
d = {'a': [1, 2]}
df = pd.DataFrame(data=d)
df_merged: pd.DataFrame = df.merge(right=df)
By running pyright against this code I get the following errors:
3:19 - error: Argument of type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
Type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
"Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Series[TypeVar('_DType')]"
"Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Index[TypeVar('_T')]"
"Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
"Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
"Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "DataFrame"
Cannot assign to "None"
TypeVar "_VT" is invariant
... (reportGeneralTypeIssues)
4:13 - error: "df.merge(right=df)" has type "Series[Unknown]" and is not callable (reportGeneralTypeIssues)
4:1 - error: Type of "df_merged" is unknown (reportUnknownVariableType)
Ignoring error 3.19 (that however is not clear to me as well), I would like to focus on error 4:13. Why the function merge
is said to have type Series[Unknown]
?
To Reproduce
install data-science-types stubs:
pip install data-science-types
install pyright:
sudo npm install -g pyright
run the file with the code above:
pyright test.py
I've already posted the issue on pyright and I have been suggested to submit an issue here because pyright is behaving accordingly with the information provided by the sutbs.
I have these two calls:
master_df[DF_VAR_COLUMN] = np.empty(shape=master_df.shape[0], dtype=str)
master_df[DF_VAR_IDX_COLUMN] = np.empty(shape=master_df.shape[0], dtype=int)
With the numpy stubs in place, mypy does not like this, claiming that:
eval.py:809: error: Value of type variable "_DType" of "empty" cannot be "str"
eval.py:810: error: Value of type variable "_DType" of "empty" cannot be "int"
But these are legitimate values, AFAICT. I'll try to see about a PR.
pandas.to_datetime is not in the type stubs
Also pandas.Timestamp is missing
The concat
method for joining multiple DataFrames appears to be missing several arguments, such as join
, keys
, levels
, and more.
Compare to the Pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Script used:
import numpy as np
arr: np.ndarray[np.float32] = np.array([0, 1, np.inf], dtype=np.float32)
print(np.isfinite(arr))
tests/pandas_test.py line 92 fails on Pandas 1.2
Extracting the relevant code
import pandas as pd
df: pd.DataFrame = pd.DataFrame(
[[1.0, 2.0], [4.0, 5.0], [7.0, 8.0]],
index=["cobra", "viper", "sidewinder"],
columns=["max_speed", "shield"],
)
s: "pd.Series[float]" = df["shield"].copy()
df.iloc[0] = s
Results in
ValueError: could not broadcast input array from shape (3) into shape (2)
This runs fine on Pandas 1.1.5
I think I've found a bug in the pyplot.pyi generation scripts.
I cloned this repository master branch, and set up a virtualenv from python 3.7.3 (virtualenv -p python3 .venv
).
Running pip install -e .
inside the virtual environment, generated a pyplot.pyi with missing commas at line 229 and below.
222 def plot(
223 x: Data,
224 y: Data,
225 fmt: Optional[str] = ...,
226 *,
227 scalex: bool = ...,
228 scaley: bool = ...,
229 agg_filter: Callable[[_NumericArray, int], _NumericArray] = ... # <-- comma missing here and at end of lines below
230 alpha: Optional[float] = ...
231 animated: Optional[bool] = ...
232 antialiased: Optional[bool] = ...
233 aa: Optional[bool] = ..., #alias of antialiased
234 clip_box: Optional[Bbox] = ...
235 clip_on: Optional[bool] = ...
236 clip_path: Optional[Callable[[Path, Transform], None]] = ...
237 color: Optional[str] = ...
238 c: Optional[str] = ...
239 contains: Optional[Callable[[Artist, MouseEvent], Tuple[bool, dict]]] = ...
240 dash_capstyle: Optional[Literal['butt', 'round', 'projecting']] = ...
241 dash_jointstyle: Optional[Literal['miter', 'round', 'bevel']] = ...
242 dashes: Optional[[Sequence[float], Tuple[None, None]]] = ...
243 drawstyle: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
244 ds: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
245 figure: Optional[Figure] = ...
246 fillstyle: Literal['full', 'left', 'right', 'bottom', 'top', 'none'] = ...
There are many types that are unions of List
and np.ndarray
and Series
. These should probably all be transformed to use Sequence
instead (which also would cover legitimate uses of Tuple
).
API reference : pandas.isna and pandas.Index.isna
There is something about numpy dtype
's and stubs that I don't understand that is keeping me from fixing some stubs. I hope someone can correct me.
After extending the type stubs for DataFrame
's __init__
and astype
as follows:
class DataFrame:
def __init__(
self,
data: Optional[Union[_ListLike, DataFrame, Dict[_str, _np.ndarray]]] = ...,
columns: Optional[_ListLike] = ...,
index: Optional[_ListLike] = ...,
dtype: Optional[_np.dtype] = ...,
): ...
...
def astype(self, dtype: Union[_str, Dict[str, _np.dtype]], copy: bool=True, errors: _ErrorType = 'raise') -> DataFrame: ...
I have the following which does not type-check properly:
query_df = pd.DataFrame(
columns=[
TEMPERATURE_COL,
OD_COL,
"od_log",
"media",
"gate",
"input",
"mean_log_gfp_live",
"mean_log_gfp_",
],
dtype=np.float64,
)
with the error eval.py:210: error: Argument "dtype" to "DataFrame" has incompatible type "Type[float64]"; expected "Optional[dtype]"
and this also:
query_df = query_df.astype(dtype={"input": np.str_, "gate": np.str_}, copy=False)
mypy3: Dict entry 0 has incompatible type "str": "Type[str_]"; expected str?: "dtype"
and mypy3: Dict entry 1 has incompatible type "str": "Type[str_]"; expected str?: "dtype"
I look at numpy_stubs/__init__.pyi
, and it looks like np.float64
and np.str_
are both defined there:
class floating(number, float): ...
class float64(floating): ...
...
class str_(dtype, str): ...
but it seems like mypy is seeing the actual values from numpy instead of the values from the numpy stubs.
This PR: wearepal/EthicML#236 and this PR: wearepal/EthicML#246 added the use a lot of functions to EthicML that aren't in data-science-types yet.
Running mypy with data-science-types on the following
import pandas as pd
df = pd.DataFrame({'a': [1]})
df.to_pickle('output.pkl')
Produces an error:
error: "Series[Any]" not callable
I would expect it to pass using DataFrame.to_pickle
I am new to type-hinting, but I thought I will give it a go :)
I noticed that quite some types are missing:
https://numpy.org/doc/stable/user/basics.types.html
https://github.com/predictive-analytics-lab/data-science-types/blob/master/numpy-stubs/__init__.pyi#L43
Is there a reason for these values missing (except that it is a lot of work to migrate all at once)? I might be able to free up some time to add these in a PR if you are interested.
matplotlib.pyplot.close has None as a default argument, however, the stub does not specify None.
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.close.html#matplotlib.pyplot.close
Shouldn't line 217 in https://github.com/predictive-analytics-lab/data-science-types/blob/master/matplotlib-stubs/pyplot.pyi.in
be changed from the first to the second?
def close(fig: Union[Figure, Literal["all"]]) -> None: ...
def close(fig: Union[Figure, Literal["all"], None]) -> None: ...
If of all thanks so much for doing this typing library. I use nptyping and we were trying your data-science-types. Here are the types that are missing. I'm no .pyi expert, but happy to help, so in our project here is what is not working and should all exist, in looking at the pyi files
I can see the errors, because index is just an array, but it actually has the name property
Pandas.DataFrame.index.name
Pandas.DataFrame.columns.name
Panda.read_hdf, read_html, read_excel, to_hdf
Pandas dataframe can't accept assignment
Pandas.columns can't be assigned
Pandas.dropna missing
Pandas.to_replace missing
Pandas.replace
Pandas.startwith
Pandas.string.startswith
Pandas dataframe cannot be used as a left operand
Numpy.ones_like=
Numpy.einsum
numpy.array doesn't handle the pass of a dataframe as an input (which works btw)
numpy.any not available
Would you consider removing your numpy types and letting nptyping handle that. There does not seem to be a good way to interoperate conflicting.pyi
Hi,
in my opinion it's universally considered a best practice/pythonic to safe memory using a generator rather than a list e.g. when concatenating dataframes. Sadly for now that raises an exception in mypy since concat only excepts Union[Sequence[DataFrame], Mapping[str, DataFrame]]. Notice that generators are not considered sequences.
I would very much appreciate it if somebody could add iterables to the accepted input types of pandas.concat.
Thank you for your time.
Looks like stubs only allow inplace=True
.
@overload
def reset_index(self, drop: bool = ...) -> DataFrame: ...
@overload
def reset_index(self, inplace: Literal[True], drop: bool = ...) -> None: ...
inplace=False
is indeed the default behavior and there is no need to specify it but it should still be allowed.
Hi! Cool project, and great to finally be able to not just ignore missing imports in mypy.ini!
I've just been getting started today trying out the library with some pre-existing code that heavily uses pandas. I know it's a work-in-progress so was not too surprised to get a few errors.
Some of these were just missing bits of functionality that would, I think, be straightforward additions - things like pd.date_range
, pd.to_datetime
, pd.tseries
and the like. I'm hoping to find some time to contribute, since I see you encourage it. :)
I was a bit less sure about about the results I got for this pattern: df.loc[:, "column_name"] =
, for which mypy threw this:
error: Invalid index type "Tuple[slice, str]" for "_LocIndexerFrame"; expected type "Tuple[Union[str, str_], Union[str, str_]]"
This was solveable with a minor refactor, but I had to just type: ignore
this one:
error: No overload variant of "__getitem__" of "_LocIndexerFrame" matches argument type "slice"
which appeared when doing df.loc["2018-10":, "column_name"]
/df.loc[datetime_object:, "column_name"]
(using the date-time index slicing functionality)
I wondered whether supporting slices is unfeasible or something you'd hope to include? .loc
's behaviour is pretty complex so I appreciate type-annotating fully would be painful!
A related issue was this mypy error:
error: Invalid index type "Tuple[Series[bool], str]" for "_LocIndexerFrame"; expected type "Tuple[Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]], Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]]]"
This one comes about from df.loc[boolean_series, "column_name"]
and my examination of the expected type showed me I could refactor to df.loc[boolean_series, ["column_name"]]
to get the same functionality - looks to me as though, as implemented, the type annotations allow you to either pass two collections to .loc
or two labels, but not a mixture?
Just want to check what's in-scope for this project before slinging PRs around!
The pandas stubs are missing pandas.to_numeric.
I would like to do a PR but I'm not really sure where to start or how to write proper type hints for this, as I've only just started learning about python typing for the last few days. Any help would be much appreciated.
Top level function documented here: https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html
this is here as a reminder to look at this post NeurIPS deadline
There is currently no type information for pandas.DataFrame.transpose
.
data-science-types==0.2.12
mypy==0.770
typing==3.6.4
Code to reproduce:
import numpy as np
from typing import List, NamedTuple
class Ok(NamedTuple):
x: List[int]
class Problem(NamedTuple):
x: np.ndarray[np.int64] # TypeError: 'type' object is not subscriptable
It seems there's a missing stub for the pandas module (or I at least can't find it). In any case this code:
pd.to_datetime(...)
Throws
error: Module has no attribute "to_datetime"
There is currently type information for read_csv
, read_feather
, and read_sql
, but no information for read_json
.
I forked the repo and ran the tests - ./check_all.sh
, it resulted in 152 errors found in 4 files. How to get started?
I think there is a bug in ndarray type hint.
I think that in case of np.array that has one row with only integers (type int
or int[64]
) and one row with at least one float (e.g. type float
), then it produces an error since rows have different types so conflict like "np.ndarray[float] != np.ndarray[int]" occurs.
# bug.py
import numpy as np
arr = np.array([[4.2, 2, 3.5], [12, 3, 6]])
and run mypy:
$ mypy bug.py
bug.py:3: error: Argument 1 to "array" has incompatible type "List[object]"; expected "Union[List[bool], List[List[bool]], List[List[List[bool]]], List[List[List[List[bool]]]]]"
1.19.2 / 3.8.5 / 0.2.20
Should there be a type that captures the type of thing that can be given as a dtype spec? I believe that this is Union[_str, Type[_np.dtype]]
, but I could be wrong. If we can identify a reasonable type for this, that might make a lot of typing smoother and more consistent.
are you writing these by hand or in some other way?
Script used:
import pandas as pd
df: pd.DataFrame = pd.DataFrame([[1, 2], [1, 4]], columns=["a", "b"], index=["c", "d"])
df.drop_duplicates(subset=["a"], inplace=True)
print(df)
Numpy has finally merged the stubs from numpy-stubs into the main numpy project.
numpy/numpy-stubs#88
numpy/numpy#16515
Will the numpy stubs in this project be removed when numpy 1.20.0 is released?
Example :
def upper_cased_header(df: pd.DataFrame) -> pd.DataFrame:
df.columns = [header.upper() for header in df.columns]
return df
mypy
will return
error: Unexpected keyword argument "inplace" for "fillna" of "Series"
Script used:
import pandas as pd
x: pd.DataFrame = pd.read_hdf("your_hdf_here.hdf")
I'm working on a PR for this.
pandas.Index takes a few optional parameters in the init after data
such as dtype
copy
, name
and tupleize_cols
Current type stubs only have data
https://github.com/predictive-analytics-lab/data-science-types/blob/3990a8f876a6e36afa53cc044b77d0448a5c468c/pandas-stubs/core/indexes/base.pyi#L19
In VSCode with pyright
, I'm trying to call savefig
with a BytesIO
object where the fname
would be. I'm ending up with:
Argument of type "BytesIO" cannot be assigned to parameter "fname" of type "str | Path" in function "savefig"
Type "BytesIO" cannot be assigned to type "str | Path"
"BytesIO" is incompatible with "str"
"BytesIO" is incompatible with "Path"Pyright (reportGeneralTypeIssues)
Since BytesIO
(also I think files opened in wb
mode? I'm not sure about that part) is definitely a valid target, I'd like to PR the type in. Would making the fname
type Union[str, Path, BytesIO]
be sufficient? Or would you prefer more types that could technically fit into the fname
slot?
I should add: this is great work. A co-worker of mine and I were looking around for numpy
typings and this repo offers such an improvement over the default typing
if you build a Dataframe with header=None
the axes are [RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=9, step=1)]
, so you can't access elements with df.at[123, 'xyz']
, but you need to use df.at[123,456]
.
according to https://github.com/predictive-analytics-lab/data-science-types/blob/7dab8238df9e93d00be6d683d8efabbdf95fc958/pandas-stubs/core/indexing.pyi#L88, however, _AtIndexerFrame
now only allow the second index to be a "_StrLike".
Ideally this should be a runtime check, but as I thing this is not possible with mypy, there should not be a false positive
Since PEP 589 in python 3.8, we can now create a class that inherits TypedDict and determine which type every key takes in a Dictionary, but it is not included at the DataFrame class optional, so it raises an error when running mypy.
I fixed locally by adding a variable:
_TypedDictLike = TypedDict
and passing it to the DataFrame class optional:
data: Optional[Union[_ListLike, DataFrame, Dict[_str, _ListLike], _TypedDictLike]]
I appreciate if you guys could implement it or if i could push a branch with this.
Thank you for your time!
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.gca.html?highlight=gca#matplotlib.pyplot.gca
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.vlines.html?highlight=vlines#matplotlib.pyplot.vlines
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hlines.html?highlight=hlines#matplotlib.pyplot.hlines
Three missing stubs from matplotlib.pyplot are gca, hlines, and vlines.
I am in the process of fleshing out a few pyi files with the definitions from Pandas.
My normal process for python development is to create a virtual environment on the root level of each project (to keep code segregated), like so:
python -m venv venv && . venv/bin/activate && pip install --upgrade pip && pip install -e .[dev]
After updating the pyi files and adding tests, everything looks okay, right up to the end of check_all.sh. When it is running the line && mypy tests \
this causes it to find a LOT (> 900 on my machine) of errors from packages in the venv folder. Sample output:
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/typing.pyi:675: error: Return type becomes "Union[bool, Any]" due to an unfollowed import
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:7: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:8: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/_thread.pyi:43: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/packaging/_typing.py:34: error: Statement is unreachable
Similar lines to those continue for many more lines.
I did notice if I deleted no_silence_site_packages = True
this goes away, but not sure the intention behind that setting, so I didn't want to delete it and cause downstream issues.
This is of course silly, but it's also fun.
Script used:
import pandas as pd
df: pd.DataFrame = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "b"], index=["c", "d"])
grouped = df.groupby("a")["b"]
grouped_list = grouped.apply(list)
print(df)
print(grouped)
print(grouped_list)
print(grouped.groups)
print(grouped.get_group(1))
I'm running this code:
import pandas
d = {"c": [1,2,3], "d": [4,5,6]}
df = pandas.DataFrame(data=d)
I was expecting no errors. However, I get this message:
Argument of type "Dict[str, List[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
Type "Dict[str, List[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
"Dict[str, List[int]]" is incompatible with "Series[TypeVar('_DType')]"
"Dict[str, List[int]]" is incompatible with "Index[TypeVar('_T')]"
"Dict[str, List[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
"Dict[str, List[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
"Dict[str, List[int]]" is incompatible with "DataFrame"
Cannot assign to "None"
TypeVar "_VT" is invariant
I was expecting that Dict[str, List[int]]
is compatible with Dict[_str, Sequence[TypeVar('_T_co')]]
which is listed in the possible types of data
. Probably I am missing what TypeVar('_T_co')
means.
MWE:
wget https://pypi.io/packages/source/d/data-science-types/data-science-types-0.2.21.tar.gz
tar -xvf data-science-types-0.2.21.tar.gz
tree
and you can see that gen_pyi.py
is not there. However, setup.py
calls it so it fails.
We should probably use this: https://github.com/typeddjango/pytest-mypy-plugins .
One particular disadvantage of the current way of testing is that you can't do "negative tests" by which I mean you can't specify that something should throw an error.
Is there a work around for this when using the where function in numpy? error: No overload variant of "where" matches argument types "Any", "int", "int"
Many numpy functions allow the usage not only of an _ArrayLike (List or ndarray) argument but also simple primitive value.
This is currently often not allowed with the type hints.
Two examples of code that results in errors when checked with mypy:
myarr = np.array(1.0)
-> No overload variant of "array" matches argument type "float"
np.append(myarr, 1.0)
-> Argument 2 to "append" has incompatible type "float"; expected "Union[Array[Any], Sequence[Any]]"
This could be fixed by using a Union[_ArrayLike, _DType]
instead of just _ArrayLike
.
Because this might be done for a lot of functions, I have refrained from partially changing the code where I know numpy accepts primitive types and submitting a pull request. I think this is better implemented on a general scale with more overview than I currently have.
(sorry if I open a lot of issues, I have tried to add this to my project and report every thing that fails)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.