wearepal / data-science-types Goto Github PK

View Code? Open in Web Editor NEW

202.0 10.0 52.0 453 KB

Mypy stubs, i.e., type information, for numpy, pandas and matplotlib

License: Apache License 2.0

Python 99.84% Shell 0.16%

python mypy-stubs numpy pandas matplotlib stubs mypy type-stubs

data-science-types's Introduction

Mypy type stubs for NumPy, pandas, and Matplotlib

⚠️ this project has mostly stopped development ⚠️

The pandas team and the numpy team are both in the process of integrating type stubs into their codebases, and we don't see the point of competing with them.

This is a PEP-561-compliant stub-only package which provides type information for matplotlib, numpy and pandas. The mypy type checker (or pytype or PyCharm) can recognize the types in these packages by installing this package.

NOTE: This is a work in progress

Many functions are already typed, but a lot is still missing (NumPy and pandas are huge libraries). Chances are, you will see a message from Mypy claiming that a function does not exist when it does exist. If you encounter missing functions, we would be delighted for you to send a PR. If you are unsure of how to type a function, we can discuss it.

Installing

You can get this package from PyPI:

pip install data-science-types

To get the most up-to-date version, install it directly from GitHub:

pip install git+https://github.com/predictive-analytics-lab/data-science-types

Or clone the repository somewhere and do pip install -e ..

Examples

These are the kinds of things that can be checked:

Array creation

import numpy as np

arr1: np.ndarray[np.int64] = np.array([3, 7, 39, -3])  # OK
arr2: np.ndarray[np.int32] = np.array([3, 7, 39, -3])  # Type error
arr3: np.ndarray[np.int32] = np.array([3, 7, 39, -3], dtype=np.int32)  # OK
arr4: np.ndarray[float] = np.array([3, 7, 39, -3], dtype=float)  # Type error: the type of ndarray can not be just "float"
arr5: np.ndarray[np.float64] = np.array([3, 7, 39, -3], dtype=float)  # OK

Operations

import numpy as np

arr1: np.ndarray[np.int64] = np.array([3, 7, 39, -3])
arr2: np.ndarray[np.int64] = np.array([4, 12, 9, -1])

result1: np.ndarray[np.int64] = np.divide(arr1, arr2)  # Type error
result2: np.ndarray[np.float64] = np.divide(arr1, arr2)  # OK

compare: np.ndarray[np.bool_] = (arr1 == arr2)

Reductions

import numpy as np

arr: np.ndarray[np.float64] = np.array([[1.3, 0.7], [-43.0, 5.6]])

sum1: int = np.sum(arr)  # Type error
sum2: np.float64 = np.sum(arr)  # OK
sum3: float = np.sum(arr)  # Also OK: np.float64 is a subclass of float
sum4: np.ndarray[np.float64] = np.sum(arr, axis=0)  # OK

# the same works with np.max, np.min and np.prod

Philosophy

The goal is not to recreate the APIs exactly. The main goal is to have useful checks on our code. Often the actual APIs in the libraries is more permissive than the type signatures in our stubs; but this is (usually) a feature and not a bug.

Contributing

We always welcome contributions. All pull requests are subject to CI checks. We check for compliance with Mypy and that the file formatting conforms to our Black specification.

You can install these dev dependencies via

pip install -e '.[dev]'

This will also install NumPy, pandas, and Matplotlib to be able to run the tests.

Running CI locally (recommended)

We include a script for running the CI checks that are triggered when a PR is opened. To test these out locally, you need to install the type stubs in your environment. Typically, you would do this with

pip install -e .

Then use the check_all.sh script to run all tests:

./check_all.sh

Below we describe how to run the various checks individually, but check_all.sh should be easier to use.

Checking compliance with Mypy

The settings for Mypy are specified in the mypy.ini file in the repository. Just running

mypy tests

from the base directory should take these settings into account. We enforce 0 Mypy errors.

Formatting with black

We use Black to format the stub files. First, install black and then run

black .

from the base directory.

Pytest

python -m pytest -vv tests/

Flake8

flake8 *-stubs

License

Apache 2.0

data-science-types's People

Contributors

Stargazers

Watchers

Forkers

aguillon paddyalton rpgoldman zhsimon dmnpignaud prakhar-v tomzxforks shelelena royqh1979 annyzhao suvrajeet01 dnaaun athrpf fabiencelier blumu stephenkraemer patriktrelsmo-izettle thecleric discovertomorrow luisblanche gitter-badger sck22 varadharajan1 dylanlukes adimyth ericpts dvarrazzo nicoddemus courentin aagaard pmav99 jmargeta hvlot jeremiq eganjs wyl8899 sinemetu1 krassowski danpol tadeu edwardjross admariner danhper charmoniumq heraclescorp againxx kouui yosefm aberres stdedos arpitjain799 firebitsbr

data-science-types's Issues

Tests failing on forking

I forked the repo and ran the tests - ./check_all.sh, it resulted in 152 errors found in 4 files. How to get started?

Add logo to README

This is of course silly, but it's also fun.

Fails type checking on dataframe.to_pickle

Running mypy with data-science-types on the following

import pandas as pd
df = pd.DataFrame({'a': [1]})
df.to_pickle('output.pkl')

Produces an error:

error: "Series[Any]" not callable

I would expect it to pass using DataFrame.to_pickle

Problems with dtypes

There is something about numpy dtype's and stubs that I don't understand that is keeping me from fixing some stubs. I hope someone can correct me.

After extending the type stubs for DataFrame's __init__ and astype as follows:

class DataFrame:
    def __init__(
        self,
        data: Optional[Union[_ListLike, DataFrame, Dict[_str, _np.ndarray]]] = ...,
        columns: Optional[_ListLike] = ...,
        index: Optional[_ListLike] = ...,
        dtype: Optional[_np.dtype] = ...,
    ): ...
...
    def astype(self, dtype: Union[_str, Dict[str, _np.dtype]], copy: bool=True, errors: _ErrorType = 'raise') -> DataFrame: ...

I have the following which does not type-check properly:

    query_df = pd.DataFrame(
        columns=[
            TEMPERATURE_COL,
            OD_COL,
            "od_log",
            "media",
            "gate",
            "input",
            "mean_log_gfp_live",
            "mean_log_gfp_",
        ],
        dtype=np.float64,
    )

with the error eval.py:210: error: Argument "dtype" to "DataFrame" has incompatible type "Type[float64]"; expected "Optional[dtype]"

and this also:

    query_df = query_df.astype(dtype={"input": np.str_, "gate": np.str_}, copy=False)

mypy3: Dict entry 0 has incompatible type "str": "Type[str_]"; expected str?: "dtype" and mypy3: Dict entry 1 has incompatible type "str": "Type[str_]"; expected str?: "dtype"

I look at numpy_stubs/__init__.pyi, and it looks like np.float64 and np.str_ are both defined there:

class floating(number, float): ...
class float64(floating): ...
 ...
class str_(dtype, str): ...

but it seems like mypy is seeing the actual values from numpy instead of the values from the numpy stubs.

slicing with pandas .loc

Hi! Cool project, and great to finally be able to not just ignore missing imports in mypy.ini!

I've just been getting started today trying out the library with some pre-existing code that heavily uses pandas. I know it's a work-in-progress so was not too surprised to get a few errors.

Some of these were just missing bits of functionality that would, I think, be straightforward additions - things like pd.date_range, pd.to_datetime, pd.tseries and the like. I'm hoping to find some time to contribute, since I see you encourage it. :)

I was a bit less sure about about the results I got for this pattern: df.loc[:, "column_name"] = , for which mypy threw this:

error: Invalid index type "Tuple[slice, str]" for "_LocIndexerFrame"; expected type "Tuple[Union[str, str_], Union[str, str_]]"

This was solveable with a minor refactor, but I had to just type: ignore this one:

error: No overload variant of "__getitem__" of "_LocIndexerFrame" matches argument type "slice"

which appeared when doing df.loc["2018-10":, "column_name"]/df.loc[datetime_object:, "column_name"] (using the date-time index slicing functionality)

I wondered whether supporting slices is unfeasible or something you'd hope to include? .loc's behaviour is pretty complex so I appreciate type-annotating fully would be painful!

A related issue was this mypy error:

error: Invalid index type "Tuple[Series[bool], str]" for "_LocIndexerFrame"; expected type "Tuple[Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]], Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]]]"

This one comes about from df.loc[boolean_series, "column_name"] and my examination of the expected type showed me I could refactor to df.loc[boolean_series, ["column_name"]] to get the same functionality - looks to me as though, as implemented, the type annotations allow you to either pass two collections to .loc or two labels, but not a mixture?

Just want to check what's in-scope for this project before slinging PRs around!

Pandas `DataFrame.concat` missing some arguments

The concat method for joining multiple DataFrames appears to be missing several arguments, such as join, keys, levels, and more.

https://github.com/predictive-analytics-lab/data-science-types/blob/faebf595b16772d3aa70d56ea179a2eaffdbd565/pandas-stubs/__init__.pyi#L37-L42

Compare to the Pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

matplotlib.pyplot.close optional argument

matplotlib.pyplot.close has None as a default argument, however, the stub does not specify None.

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.close.html#matplotlib.pyplot.close

Shouldn't line 217 in https://github.com/predictive-analytics-lab/data-science-types/blob/master/matplotlib-stubs/pyplot.pyi.in
be changed from the first to the second?

def close(fig: Union[Figure, Literal["all"]]) -> None: ...

def close(fig: Union[Figure, Literal["all"], None]) -> None: ...

how did you create these?

are you writing these by hand or in some other way?

Missing pandas.to_numeric

The pandas stubs are missing pandas.to_numeric.

I would like to do a PR but I'm not really sure where to start or how to write proper type hints for this, as I've only just started learning about python typing for the last few days. Any help would be much appreciated.

missing np.diff

https://numpy.org/doc/stable/reference/generated/numpy.diff.html

test that python3.6 is still fully supported

this is here as a reminder to look at this post NeurIPS deadline

init of Index class does not takes all the parameters

pandas.Index takes a few optional parameters in the init after data such as dtype copy, name and tupleize_cols

Current type stubs only have data
https://github.com/predictive-analytics-lab/data-science-types/blob/3990a8f876a6e36afa53cc044b77d0448a5c468c/pandas-stubs/core/indexes/base.pyi#L19

Will numpy stubs be removed after next numpy release?

Numpy has finally merged the stubs from numpy-stubs into the main numpy project.

numpy/numpy-stubs#88
numpy/numpy#16515

Will the numpy stubs in this project be removed when numpy 1.20.0 is released?

False positives on `np.empty`

I have these two calls:

    master_df[DF_VAR_COLUMN] = np.empty(shape=master_df.shape[0], dtype=str)
    master_df[DF_VAR_IDX_COLUMN] = np.empty(shape=master_df.shape[0], dtype=int)

With the numpy stubs in place, mypy does not like this, claiming that:

eval.py:809: error: Value of type variable "_DType" of "empty" cannot be "str"
eval.py:810: error: Value of type variable "_DType" of "empty" cannot be "int"

But these are legitimate values, AFAICT. I'll try to see about a PR.

Cannot assign a list of strings to a pandas.DataFrame.columns

Example :

def upper_cased_header(df: pd.DataFrame) -> pd.DataFrame:
    df.columns = [header.upper() for header in df.columns]
    return df

mypy will return

error: Unexpected keyword argument "inplace" for "fillna" of "Series"

Missing numpy.savetxt

Top level function documented here: https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html

Add type information for pandas.DataFrame.transpose.

There is currently no type information for pandas.DataFrame.transpose.

Missing Pandas types: dataframe.index.names, columns.names, read_*, assignments, starts with

If of all thanks so much for doing this typing library. I use nptyping and we were trying your data-science-types. Here are the types that are missing. I'm no .pyi expert, but happy to help, so in our project here is what is not working and should all exist, in looking at the pyi files

I can see the errors, because index is just an array, but it actually has the name property

Pandas.DataFrame.index.name
Pandas.DataFrame.columns.name
Panda.read_hdf, read_html, read_excel, to_hdf
Pandas dataframe can't accept assignment
Pandas.columns can't be assigned
Pandas.dropna missing
Pandas.to_replace missing
Pandas.replace
Pandas.startwith
Pandas.string.startswith
Pandas dataframe cannot be used as a left operand

Numpy problems

Numpy.ones_like=
Numpy.einsum
numpy.array doesn't handle the pass of a dataframe as an input (which works btw)
numpy.any not available

Would you consider removing your numpy types and letting nptyping handle that. There does not seem to be a good way to interoperate conflicting.pyi

Dataframe.reset_index only allows inplace=False

Looks like stubs only allow inplace=True.

    @overload
    def reset_index(self, drop: bool = ...) -> DataFrame: ...
    @overload
    def reset_index(self, inplace: Literal[True], drop: bool = ...) -> None: ...

inplace=False is indeed the default behavior and there is no need to specify it but it should still be allowed.

Numpy has no method 'isfinite'

Script used:

import numpy as np

arr: np.ndarray[np.float32] = np.array([0, 1, np.inf], dtype=np.float32)

print(np.isfinite(arr))

gen_pyi.py not in tar ball from PyPI

MWE:

wget https://pypi.io/packages/source/d/data-science-types/data-science-types-0.2.21.tar.gz
tar -xvf data-science-types-0.2.21.tar.gz
tree

and you can see that gen_pyi.py is not there. However, setup.py calls it so it fails.

[pandas] `_AtIndexerFrame` should support indexing by [int, int]

if you build a Dataframe with header=None the axes are [RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=9, step=1)], so you can't access elements with df.at[123, 'xyz'], but you need to use df.at[123,456].

according to https://github.com/predictive-analytics-lab/data-science-types/blob/7dab8238df9e93d00be6d683d8efabbdf95fc958/pandas-stubs/core/indexing.pyi#L88, however, _AtIndexerFrame now only allow the second index to be a "_StrLike".

Ideally this should be a runtime check, but as I thing this is not possible with mypy, there should not be a false positive

Add type information for read_json method.

There is currently type information for read_csv, read_feather, and read_sql, but no information for read_json.

Various missing matplotlib.pyplot function stubs

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.gca.html?highlight=gca#matplotlib.pyplot.gca
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.vlines.html?highlight=vlines#matplotlib.pyplot.vlines
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hlines.html?highlight=hlines#matplotlib.pyplot.hlines

Three missing stubs from matplotlib.pyplot are gca, hlines, and vlines.

Missing pandas.isna and pandas.Index.isna

API reference : pandas.isna and pandas.Index.isna

Data frame init compalins if data is a dictionary

I'm running this code:

import pandas

d = {"c": [1,2,3], "d": [4,5,6]}
df = pandas.DataFrame(data=d)

I was expecting no errors. However, I get this message:

Argument of type "Dict[str, List[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
  Type "Dict[str, List[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
    "Dict[str, List[int]]" is incompatible with "Series[TypeVar('_DType')]"
    "Dict[str, List[int]]" is incompatible with "Index[TypeVar('_T')]"
    "Dict[str, List[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
    "Dict[str, List[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
    "Dict[str, List[int]]" is incompatible with "DataFrame"
    Cannot assign to "None"
      TypeVar "_VT" is invariant

I was expecting that Dict[str, List[int]] is compatible with Dict[_str, Sequence[TypeVar('_T_co')]] which is listed in the possible types of data. Probably I am missing what TypeVar('_T_co') means.

Module has no attribute "to_datetime"

It seems there's a missing stub for the pandas module (or I at least can't find it). In any case this code:

pd.to_datetime(...)

Throws
error: Module has no attribute "to_datetime"

Missing pandas.Series.iteritems

doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.iteritems.html?highlight=iteritem

(sorry if I open a lot of issues, I have tried to add this to my project and report every thing that fails)

Missing commas in generated pyplot.pyi

I think I've found a bug in the pyplot.pyi generation scripts.

I cloned this repository master branch, and set up a virtualenv from python 3.7.3 (virtualenv -p python3 .venv).

Running pip install -e . inside the virtual environment, generated a pyplot.pyi with missing commas at line 229 and below.

   222	def plot(
   223	    x: Data,
   224	    y: Data,
   225	    fmt: Optional[str] = ...,
   226	    *,
   227	    scalex: bool = ...,
   228	    scaley: bool = ...,
   229	    agg_filter: Callable[[_NumericArray, int], _NumericArray] = ...    # <-- comma missing here and at end of lines below
   230	    alpha: Optional[float] = ...
   231	    animated: Optional[bool] = ...
   232	    antialiased: Optional[bool] = ...
   233	    aa: Optional[bool] = ..., #alias of antialiased
   234	    clip_box: Optional[Bbox] = ...
   235	    clip_on: Optional[bool] = ...
   236	    clip_path: Optional[Callable[[Path, Transform], None]] = ...
   237	    color: Optional[str] = ...
   238	    c: Optional[str] = ...
   239	    contains: Optional[Callable[[Artist, MouseEvent], Tuple[bool, dict]]] = ...
   240	    dash_capstyle: Optional[Literal['butt', 'round', 'projecting']] = ...
   241	    dash_jointstyle: Optional[Literal['miter', 'round', 'bevel']] = ...
   242	    dashes: Optional[[Sequence[float], Tuple[None, None]]] = ...
   243	    drawstyle: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
   244	    ds: Literal['default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'] = ...
   245	    figure: Optional[Figure] = ...
   246	    fillstyle: Literal['full', 'left', 'right', 'bottom', 'top', 'none'] = ...

test_frame_iloc fails on Pandas 1.2

tests/pandas_test.py line 92 fails on Pandas 1.2

Extracting the relevant code

import pandas as pd
df: pd.DataFrame = pd.DataFrame(
    [[1.0, 2.0], [4.0, 5.0], [7.0, 8.0]],
    index=["cobra", "viper", "sidewinder"],
    columns=["max_speed", "shield"],
)
s: "pd.Series[float]" = df["shield"].copy()
df.iloc[0] = s

Results in

ValueError: could not broadcast input array from shape (3) into shape (2)

This runs fine on Pandas 1.1.5

Improve the test infrastructure

We should probably use this: https://github.com/typeddjango/pytest-mypy-plugins .

One particular disadvantage of the current way of testing is that you can't do "negative tests" by which I mean you can't specify that something should throw an error.

Add all the function needed in EthicML

This PR: wearepal/EthicML#236 and this PR: wearepal/EthicML#246 added the use a lot of functions to EthicML that aren't in data-science-types yet.

add iterable to possible input types of pandas.concat

Hi,
in my opinion it's universally considered a best practice/pythonic to safe memory using a generator rather than a list e.g. when concatenating dataframes. Sadly for now that raises an exception in mypy since concat only excepts Union[Sequence[DataFrame], Mapping[str, DataFrame]]. Notice that generators are not considered sequences.

I would very much appreciate it if somebody could add iterables to the accepted input types of pandas.concat.

Thank you for your time.

check_all.sh fails when using a project level virtual environment

I am in the process of fleshing out a few pyi files with the definitions from Pandas.

My normal process for python development is to create a virtual environment on the root level of each project (to keep code segregated), like so:

python -m venv venv && . venv/bin/activate && pip install --upgrade pip && pip install -e .[dev]

After updating the pyi files and adding tests, everything looks okay, right up to the end of check_all.sh. When it is running the line && mypy tests \ this causes it to find a LOT (> 900 on my machine) of errors from packages in the venv folder. Sample output:

venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/typing.pyi:675: error: Return type becomes "Union[bool, Any]" due to an unfollowed import
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:7: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/tkinter/commondialog.pyi:8: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/_thread.pyi:43: error: Function is missing a type annotation for one or more arguments
venv/lib/python3.8/site-packages/packaging/_typing.py:34: error: Statement is unreachable

Similar lines to those continue for many more lines.

I did notice if I deleted no_silence_site_packages = True this goes away, but not sure the intention behind that setting, so I didn't want to delete it and cause downstream issues.

Missing Union with primitive (_DType) types in numpy functions

Many numpy functions allow the usage not only of an _ArrayLike (List or ndarray) argument but also simple primitive value.
This is currently often not allowed with the type hints.

Two examples of code that results in errors when checked with mypy:

myarr = np.array(1.0) -> No overload variant of "array" matches argument type "float"
np.append(myarr, 1.0) -> Argument 2 to "append" has incompatible type "float"; expected "Union[Array[Any], Sequence[Any]]"

This could be fixed by using a Union[_ArrayLike, _DType] instead of just _ArrayLike.
Because this might be done for a lot of functions, I have refrained from partially changing the code where I know numpy accepts primitive types and submitting a pull request. I think this is better implemented on a general scale with more overview than I currently have.

Error when typing Numpy arrays within NamedTuples

data-science-types==0.2.12
mypy==0.770
typing==3.6.4

Code to reproduce:

import numpy as np
from typing import List, NamedTuple

class Ok(NamedTuple):
    x: List[int]

class Problem(NamedTuple):
    x: np.ndarray[np.int64]  # TypeError: 'type' object is not subscriptable

pyplot savefig type too narrow

In VSCode with pyright, I'm trying to call savefig with a BytesIO object where the fname would be. I'm ending up with:

Argument of type "BytesIO" cannot be assigned to parameter "fname" of type "str | Path" in function "savefig"
  Type "BytesIO" cannot be assigned to type "str | Path"
    "BytesIO" is incompatible with "str"
    "BytesIO" is incompatible with "Path"Pyright (reportGeneralTypeIssues)

Since BytesIO (also I think files opened in wb mode? I'm not sure about that part) is definitely a valid target, I'd like to PR the type in. Would making the fname type Union[str, Path, BytesIO] be sufficient? Or would you prefer more types that could technically fit into the fname slot?

I should add: this is great work. A co-worker of mine and I were looking around for numpy typings and this repo offers such an improvement over the default typing

Generalize many types to Sequence

There are many types that are unions of List and np.ndarray and Series. These should probably all be transformed to use Sequence instead (which also would cover legitimate uses of Tuple).

"at" method is missing from pandas DataFrame stub

I'm working on a PR for this.

Pandas has no method 'read_hdf'

Script used:

import pandas as pd

x: pd.DataFrame = pd.read_hdf("your_hdf_here.hdf")

Missing pandas.to_datetime and pandas.Timestamp

pandas.to_datetime is not in the type stubs

Also pandas.Timestamp is missing

Misssing inplace keyword argument for pandas.Series.fillna

https://github.com/predictive-analytics-lab/data-science-types/blob/7213aab242fbef20b5bb3e8e2b28099d2711ade3/pandas-stubs/core/series.pyi#L124

No overload variant of "where"

Is there a work around for this when using the where function in numpy? error: No overload variant of "where" matches argument types "Any", "int", "int"

bug in ndarray: incompatible type error

I think there is a bug in ndarray type hint.
I think that in case of np.array that has one row with only integers (type int or int[64]) and one row with at least one float (e.g. type float), then it produces an error since rows have different types so conflict like "np.ndarray[float] != np.ndarray[int]" occurs.

Reproducing code example:

# bug.py
import numpy as np 
arr = np.array([[4.2, 2, 3.5], [12, 3, 6]])

and run mypy:

$ mypy bug.py

Error message:

bug.py:3: error: Argument 1 to "array" has incompatible type "List[object]"; expected "Union[List[bool], List[List[bool]], List[List[List[bool]]], List[List[List[List[bool]]]]]"

NumPy/Python/data-science-types versions information:

1.19.2 / 3.8.5 / 0.2.20

types missing (e.g. uint32)

I am new to type-hinting, but I thought I will give it a go :)

I noticed that quite some types are missing:
https://numpy.org/doc/stable/user/basics.types.html
https://github.com/predictive-analytics-lab/data-science-types/blob/master/numpy-stubs/__init__.pyi#L43

Is there a reason for these values missing (except that it is a lot of work to migrate all at once)? I might be able to free up some time to add these in a PR if you are interested.

_DtypeSpec type

Should there be a type that captures the type of thing that can be given as a dtype spec? I believe that this is Union[_str, Type[_np.dtype]], but I could be wrong. If we can identify a reasonable type for this, that might make a lot of typing smoother and more consistent.

Add TypedDict to DataFrame class Optional

Since PEP 589 in python 3.8, we can now create a class that inherits TypedDict and determine which type every key takes in a Dictionary, but it is not included at the DataFrame class optional, so it raises an error when running mypy.

I fixed locally by adding a variable:

_TypedDictLike = TypedDict

and passing it to the DataFrame class optional:

data: Optional[Union[_ListLike, DataFrame, Dict[_str, _ListLike], _TypedDictLike]]

I appreciate if you guys could implement it or if i could push a branch with this.

Thank you for your time!

Error on merge function of pandas data frames

Hi,
I'm using pyright in combination with the stubs provided by this repo. I'm getting a problem when I check the following code:

import pandas as pd

d = {'a': [1, 2]}
df = pd.DataFrame(data=d)

df_merged: pd.DataFrame = df.merge(right=df)

By running pyright against this code I get the following errors:

 3:19 - error: Argument of type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to parameter "data" of type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None" in function "__init__"
  Type "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" cannot be assigned to type "Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')] | DataFrame | Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[TypeVar('_T_co')]] | None"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Series[TypeVar('_DType')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Index[TypeVar('_T')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "ndarray[TypeVar('_DType')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "Sequence[TypeVar('_T_co')]"
    "Dict[_str, Series[TypeVar('_DType')] | Index[TypeVar('_T')] | ndarray[TypeVar('_DType')] | Sequence[int]]" is incompatible with "DataFrame"
    Cannot assign to "None"
      TypeVar "_VT" is invariant
... (reportGeneralTypeIssues)
  4:13 - error: "df.merge(right=df)" has type "Series[Unknown]" and is not callable (reportGeneralTypeIssues)
  4:1 - error: Type of "df_merged" is unknown (reportUnknownVariableType)

Ignoring error 3.19 (that however is not clear to me as well), I would like to focus on error 4:13. Why the function merge is said to have type Series[Unknown]?

To Reproduce
install data-science-types stubs:
pip install data-science-types

install pyright:
sudo npm install -g pyright

run the file with the code above:
pyright test.py

I've already posted the issue on pyright and I have been suggested to submit an issue here because pyright is behaving accordingly with the information provided by the sutbs.

Pandas 'SeriesGroupBy' has no method 'apply', 'groups', or 'get_group'

Script used:

import pandas as pd

df: pd.DataFrame = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "b"], index=["c", "d"])

grouped = df.groupby("a")["b"]
grouped_list = grouped.apply(list)

print(df)
print(grouped)
print(grouped_list)
print(grouped.groups)
print(grouped.get_group(1))

Pandas `DataFrame.drop_duplicates` missing keywords 'subset' and 'inplace'

Script used:

import pandas as pd

df: pd.DataFrame = pd.DataFrame([[1, 2], [1, 4]], columns=["a", "b"], index=["c", "d"])

df.drop_duplicates(subset=["a"], inplace=True)

print(df)

wearepal / data-science-types Goto Github PK

data-science-types's Introduction

Mypy type stubs for NumPy, pandas, and Matplotlib

NOTE: This is a work in progress

Installing

Examples

Array creation

Operations

Reductions

Philosophy

Contributing

Running CI locally (recommended)

Checking compliance with Mypy

Formatting with black

Pytest

Flake8

License

data-science-types's People

Contributors

Stargazers

Watchers

Forkers

data-science-types's Issues

Numpy problems

Reproducing code example:

Error message:

NumPy/Python/data-science-types versions information:

Recommend Projects

Recommend Topics

Recommend Org