Coder Social home page Coder Social logo

wearepal / data-science-types Goto Github PK

View Code? Open in Web Editor NEW
202.0 10.0 52.0 453 KB

Mypy stubs, i.e., type information, for numpy, pandas and matplotlib

License: Apache License 2.0

Python 99.84% Shell 0.16%
python mypy-stubs numpy pandas matplotlib stubs mypy type-stubs

data-science-types's Issues

Error on merge function of pandas data frames

Hi,
I'm using pyright in combination with the stubs provided by this repo. I'm getting a problem when I check the following code:

import pandas as pd

d = {'a': [1, 2]}
df = pd.DataFrame(data=d)

df_merged: pd.DataFrame = df.merge(right=df)

By running pyright against this code I get the following errors:

 3:19 - error: Argument of type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
  Type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Series[TypeVar('_DType')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Index[TypeVar('_T')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "DataFrame"
    Cannot assign to "None"
      TypeVar "_VT" is invariant
... (reportGeneralTypeIssues)
  4:13 - error: "df.merge(right=df)" has type "Series[Unknown]" and is not callable (reportGeneralTypeIssues)
  4:1 - error: Type of "df_merged" is unknown (reportUnknownVariableType)

Ignoring error 3.19 (that however is not clear to me as well), I would like to focus on error 4:13. Why the function merge is said to have type Series[Unknown]?

To Reproduce
install data-science-types stubs:
pip install data-science-types

install pyright:
sudo npm install -g pyright

run the file with the code above:
pyright test.py

I've already posted the issue on pyright and I have been suggested to submit an issue here because pyright is behaving accordingly with the information provided by the sutbs.

False positives on `np.empty`

I have these two calls:

    master_df[DF_VAR_COLUMN] = np.empty(shape=master_df.shape[0], dtype=str)
    master_df[DF_VAR_IDX_COLUMN] = np.empty(shape=master_df.shape[0], dtype=int)

With the numpy stubs in place, mypy does not like this, claiming that:

eval.py:809: error: Value of type variable "_DType" of "empty" cannot be "str"
eval.py:810: error: Value of type variable "_DType" of "empty" cannot be "int"

But these are legitimate values, AFAICT. I'll try to see about a PR.

Numpy has no method 'isfinite'

Script used:

import numpy as np

arr: np.ndarray[np.float32] = np.array([0, 1, np.inf], dtype=np.float32)

print(np.isfinite(arr))

test_frame_iloc fails on Pandas 1.2

tests/pandas_test.py line 92 fails on Pandas 1.2

Extracting the relevant code

import pandas as pd
df: pd.DataFrame = pd.DataFrame(
    [[1.0, 2.0], [4.0, 5.0], [7.0, 8.0]],
    index=["cobra", "viper", "sidewinder"],
    columns=["max_speed", "shield"],
)
s: "pd.Series[float]" = df["shield"].copy()
df.iloc[0] = s

Results in

ValueError: could not broadcast input array from shape (3) into shape (2)

This runs fine on Pandas 1.1.5

Missing commas in generated pyplot.pyi

I think I've found a bug in the pyplot.pyi generation scripts.

I cloned this repository master branch, and set up a virtualenv from python 3.7.3 (virtualenv -p python3 .venv).

Running pip install -e . inside the virtual environment, generated a pyplot.pyi with missing commas at line 229 and below.

   222	def plot(
   223	    x: Data,
   224	    y: Data,
   225	    fmt: Optional[str] = ...,
   226	    *,
   227	    scalex: bool = ...,
   228	    scaley: bool = ...,
   229	    agg_filter: Callable[[_NumericArray, int], _NumericArray] = ...    # <-- comma missing here and at end of lines below
   230	    alpha: Optional[float] = ...
   231	    animated: Optional[bool] = ...
   232	    antialiased: Optional[bool] = ...
   233	    aa: Optional[bool] = ..., #alias of antialiased
   234	    clip_box: Optional[Bbox] = ...
   235	    clip_on: Optional[bool] = ...
   236	    clip_path: Optional[Callable[[Path, Transform], None]] = ...
   237	    color: Optional[str] = ...
   238	    c: Optional[str] = ...
   239	    contains: Optional[Callable[[Artist, MouseEvent], Tuple[bool, dict]]] = ...
   240	    dash_capstyle: Optional[Literal['butt', 'round', 'projecting']] = ...
   241	    dash_jointstyle: Optional[Literal['miter', 'round', 'bevel']] = ...
   242	    dashes: Optional[[Sequence[float], Tuple[None, None]]] = ...
   243	    drawstyle: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
   244	    ds: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
   245	    figure: Optional[Figure] = ...
   246	    fillstyle: Literal['full', 'left', 'right', 'bottom', 'top', 'none'] = ...

Generalize many types to Sequence

There are many types that are unions of List and np.ndarray and Series. These should probably all be transformed to use Sequence instead (which also would cover legitimate uses of Tuple).

Problems with dtypes

There is something about numpy dtype's and stubs that I don't understand that is keeping me from fixing some stubs. I hope someone can correct me.

After extending the type stubs for DataFrame's __init__ and astype as follows:

class DataFrame:
    def __init__(
        self,
        data: Optional[Union[_ListLike, DataFrame, Dict[_str, _np.ndarray]]] = ...,
        columns: Optional[_ListLike] = ...,
        index: Optional[_ListLike] = ...,
        dtype: Optional[_np.dtype] = ...,
    ): ...
...
    def astype(self, dtype: Union[_str, Dict[str, _np.dtype]], copy: bool=True, errors: _ErrorType = 'raise') -> DataFrame: ...

I have the following which does not type-check properly:

    query_df = pd.DataFrame(
        columns=[
            TEMPERATURE_COL,
            OD_COL,
            "od_log",
            "media",
            "gate",
            "input",
            "mean_log_gfp_live",
            "mean_log_gfp_",
        ],
        dtype=np.float64,
    )

with the error eval.py:210: error: Argument "dtype" to "DataFrame" has incompatible type "Type[float64]"; expected "Optional[dtype]"

and this also:

    query_df = query_df.astype(dtype={"input": np.str_, "gate": np.str_}, copy=False)

mypy3: Dict entry 0 has incompatible type "str": "Type[str_]"; expected str?: "dtype" and mypy3: Dict entry 1 has incompatible type "str": "Type[str_]"; expected str?: "dtype"

I look at numpy_stubs/__init__.pyi, and it looks like np.float64 and np.str_ are both defined there:

class floating(number, float): ...
class float64(floating): ...
 ...
class str_(dtype, str): ...

but it seems like mypy is seeing the actual values from numpy instead of the values from the numpy stubs.

matplotlib.pyplot.close optional argument

matplotlib.pyplot.close has None as a default argument, however, the stub does not specify None.

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.close.html#matplotlib.pyplot.close

Shouldn't line 217 in https://github.com/predictive-analytics-lab/data-science-types/blob/master/matplotlib-stubs/pyplot.pyi.in
be changed from the first to the second?

def close(fig: Union[Figure, Literal["all"]]) -> None: ...
def close(fig: Union[Figure, Literal["all"], None]) -> None: ...

Missing Pandas types: dataframe.index.names, columns.names, read_*, assignments, starts with

If of all thanks so much for doing this typing library. I use nptyping and we were trying your data-science-types. Here are the types that are missing. I'm no .pyi expert, but happy to help, so in our project here is what is not working and should all exist, in looking at the pyi files

I can see the errors, because index is just an array, but it actually has the name property

Pandas.DataFrame.index.name
Pandas.DataFrame.columns.name
Panda.read_hdf, read_html, read_excel, to_hdf
Pandas dataframe can't accept assignment
Pandas.columns can't be assigned
Pandas.dropna missing
Pandas.to_replace missing
Pandas.replace
Pandas.startwith
Pandas.string.startswith
Pandas dataframe cannot be used as a left operand

Numpy problems

Numpy.ones_like=
Numpy.einsum
numpy.array doesn't handle the pass of a dataframe as an input (which works btw)
numpy.any not available

Would you consider removing your numpy types and letting nptyping handle that. There does not seem to be a good way to interoperate conflicting.pyi

add iterable to possible input types of pandas.concat

Hi,
in my opinion it's universally considered a best practice/pythonic to safe memory using a generator rather than a list e.g. when concatenating dataframes. Sadly for now that raises an exception in mypy since concat only excepts Union[Sequence[DataFrame], Mapping[str, DataFrame]]. Notice that generators are not considered sequences.

I would very much appreciate it if somebody could add iterables to the accepted input types of pandas.concat.

Thank you for your time.

Dataframe.reset_index only allows inplace=False

Looks like stubs only allow inplace=True.

    @overload
    def reset_index(self, drop: bool = ...) -> DataFrame: ...
    @overload
    def reset_index(self, inplace: Literal[True], drop: bool = ...) -> None: ...

inplace=False is indeed the default behavior and there is no need to specify it but it should still be allowed.

slicing with pandas .loc

Hi! Cool project, and great to finally be able to not just ignore missing imports in mypy.ini!

I've just been getting started today trying out the library with some pre-existing code that heavily uses pandas. I know it's a work-in-progress so was not too surprised to get a few errors.

Some of these were just missing bits of functionality that would, I think, be straightforward additions - things like pd.date_range, pd.to_datetime, pd.tseries and the like. I'm hoping to find some time to contribute, since I see you encourage it. :)

I was a bit less sure about about the results I got for this pattern: df.loc[:, "column_name"] = , for which mypy threw this:

error: Invalid index type "Tuple[slice, str]" for "_LocIndexerFrame"; expected type "Tuple[Union[str, str_], Union[str, str_]]"

This was solveable with a minor refactor, but I had to just type: ignore this one:

error: No overload variant of "__getitem__" of "_LocIndexerFrame" matches argument type "slice"

which appeared when doing df.loc["2018-10":, "column_name"]/df.loc[datetime_object:, "column_name"] (using the date-time index slicing functionality)

I wondered whether supporting slices is unfeasible or something you'd hope to include? .loc's behaviour is pretty complex so I appreciate type-annotating fully would be painful!

A related issue was this mypy error:

error: Invalid index type "Tuple[Series[bool], str]" for "_LocIndexerFrame"; expected type "Tuple[Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]], Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]]]"

This one comes about from df.loc[boolean_series, "column_name"] and my examination of the expected type showed me I could refactor to df.loc[boolean_series, ["column_name"]] to get the same functionality - looks to me as though, as implemented, the type annotations allow you to either pass two collections to .loc or two labels, but not a mixture?

Just want to check what's in-scope for this project before slinging PRs around!

Missing pandas.to_numeric

The pandas stubs are missing pandas.to_numeric.

I would like to do a PR but I'm not really sure where to start or how to write proper type hints for this, as I've only just started learning about python typing for the last few days. Any help would be much appreciated.

Error when typing Numpy arrays within NamedTuples

data-science-types==0.2.12
mypy==0.770
typing==3.6.4

Code to reproduce:

import numpy as np
from typing import List, NamedTuple

class Ok(NamedTuple):
    x: List[int]

class Problem(NamedTuple):
    x: np.ndarray[np.int64]  # TypeError: 'type' object is not subscriptable

Module has no attribute "to_datetime"

It seems there's a missing stub for the pandas module (or I at least can't find it). In any case this code:

pd.to_datetime(...)

Throws
error: Module has no attribute "to_datetime"

Tests failing on forking

I forked the repo and ran the tests - ./check_all.sh, it resulted in 152 errors found in 4 files. How to get started?

bug in ndarray: incompatible type error

I think there is a bug in ndarray type hint.
I think that in case of np.array that has one row with only integers (type int or int[64]) and one row with at least one float (e.g. type float), then it produces an error since rows have different types so conflict like "np.ndarray[float] != np.ndarray[int]" occurs.

Reproducing code example:

# bug.py
import numpy as np 
arr = np.array([[4.2, 2, 3.5], [12, 3, 6]])   

and run mypy:

$ mypy bug.py 

Error message:

bug.py:3: error: Argument 1 to "array" has incompatible type "List[object]"; expected "Union[List[bool], List[List[bool]], List[List[List[bool]]], List[List[List[List[bool]]]]]"

NumPy/Python/data-science-types versions information:

1.19.2 / 3.8.5 / 0.2.20

_DtypeSpec type

Should there be a type that captures the type of thing that can be given as a dtype spec? I believe that this is Union[_str, Type[_np.dtype]], but I could be wrong. If we can identify a reasonable type for this, that might make a lot of typing smoother and more consistent.

pyplot savefig type too narrow

In VSCode with pyright, I'm trying to call savefig with a BytesIO object where the fname would be. I'm ending up with:

Argument of type "BytesIO" cannot be assigned to parameter "fname" of type "str | Path" in function "savefig"
  Type "BytesIO" cannot be assigned to type "str | Path"
    "BytesIO" is incompatible with "str"
    "BytesIO" is incompatible with "Path"Pyright (reportGeneralTypeIssues)

Since BytesIO (also I think files opened in wb mode? I'm not sure about that part) is definitely a valid target, I'd like to PR the type in. Would making the fname type Union[str, Path, BytesIO] be sufficient? Or would you prefer more types that could technically fit into the fname slot?

I should add: this is great work. A co-worker of mine and I were looking around for numpy typings and this repo offers such an improvement over the default typing

[pandas] `_AtIndexerFrame` should support indexing by [int, int]

if you build a Dataframe with header=None the axes are [RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=9, step=1)], so you can't access elements with df.at[123, 'xyz'], but you need to use df.at[123,456].

according to https://github.com/predictive-analytics-lab/data-science-types/blob/7dab8238df9e93d00be6d683d8efabbdf95fc958/pandas-stubs/core/indexing.pyi#L88, however, _AtIndexerFrame now only allow the second index to be a "_StrLike".

Ideally this should be a runtime check, but as I thing this is not possible with mypy, there should not be a false positive

Add TypedDict to DataFrame class Optional

Since PEP 589 in python 3.8, we can now create a class that inherits TypedDict and determine which type every key takes in a Dictionary, but it is not included at the DataFrame class optional, so it raises an error when running mypy.

I fixed locally by adding a variable:

_TypedDictLike = TypedDict

and passing it to the DataFrame class optional:

data: Optional[Union[_ListLike, DataFrame, Dict[_str, _ListLike], _TypedDictLike]]

I appreciate if you guys could implement it or if i could push a branch with this.

Thank you for your time!

check_all.sh fails when using a project level virtual environment

I am in the process of fleshing out a few pyi files with the definitions from Pandas.

My normal process for python development is to create a virtual environment on the root level of each project (to keep code segregated), like so:

python -m venv venv && . venv/bin/activate && pip install --upgrade pip && pip install -e .[dev]

After updating the pyi files and adding tests, everything looks okay, right up to the end of check_all.sh. When it is running the line && mypy tests \ this causes it to find a LOT (> 900 on my machine) of errors from packages in the venv folder. Sample output:

venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/typing.pyi:675: error: Return type becomes "Union[bool, Any]" due to an unfollowed import
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:7: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:8: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/_thread.pyi:43: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/packaging/_typing.py:34: error: Statement is unreachable

Similar lines to those continue for many more lines.

I did notice if I deleted no_silence_site_packages = True this goes away, but not sure the intention behind that setting, so I didn't want to delete it and cause downstream issues.

Pandas 'SeriesGroupBy' has no method 'apply', 'groups', or 'get_group'

Script used:

import pandas as pd

df: pd.DataFrame = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "b"], index=["c", "d"])

grouped = df.groupby("a")["b"]
grouped_list = grouped.apply(list)

print(df)
print(grouped)
print(grouped_list)
print(grouped.groups)
print(grouped.get_group(1))

Data frame __init__ compalins if data is a dictionary

I'm running this code:

import pandas

d = {"c": [1,2,3], "d": [4,5,6]}
df = pandas.DataFrame(data=d)

I was expecting no errors. However, I get this message:

Argument of type "Dict[str, List[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
  Type "Dict[str, List[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
    "Dict[str, List[int]]" is incompatible with "Series[TypeVar('_DType')]"
    "Dict[str, List[int]]" is incompatible with "Index[TypeVar('_T')]"
    "Dict[str, List[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
    "Dict[str, List[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
    "Dict[str, List[int]]" is incompatible with "DataFrame"
    Cannot assign to "None"
      TypeVar "_VT" is invariant

I was expecting that Dict[str, List[int]] is compatible with Dict[_str, Sequence[TypeVar('_T_co')]] which is listed in the possible types of data. Probably I am missing what TypeVar('_T_co') means.

gen_pyi.py not in tar ball from PyPI

MWE:

wget https://pypi.io/packages/source/d/data-science-types/data-science-types-0.2.21.tar.gz
tar -xvf data-science-types-0.2.21.tar.gz
tree

and you can see that gen_pyi.py is not there. However, setup.py calls it so it fails.

No overload variant of "where"

Is there a work around for this when using the where function in numpy? error: No overload variant of "where" matches argument types "Any", "int", "int"

Missing Union with primitive (_DType) types in numpy functions

Many numpy functions allow the usage not only of an _ArrayLike (List or ndarray) argument but also simple primitive value.
This is currently often not allowed with the type hints.

Two examples of code that results in errors when checked with mypy:

  • myarr = np.array(1.0) -> No overload variant of "array" matches argument type "float"
  • np.append(myarr, 1.0) -> Argument 2 to "append" has incompatible type "float"; expected "Union[Array[Any], Sequence[Any]]"

This could be fixed by using a Union[_ArrayLike, _DType] instead of just _ArrayLike.
Because this might be done for a lot of functions, I have refrained from partially changing the code where I know numpy accepts primitive types and submitting a pull request. I think this is better implemented on a general scale with more overview than I currently have.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.