The pyspark-stubs's discuss from zero323

Review changes in core pyspark in PySpark 2.4

Params should return self type.

Right now pyspark.ml.Params return corresponding mixin type. While technically speaking correct it is not very useful in practice.

Improve createDataFrame annotations

Right now createDataFrame are rather crude. It would be great refine these, to check dependencies between different type of arguments:

RDD[Literal] or List[Literal] requires schema string or DataType.
Schema (DDL string or DataType) is exclusive with samplingRatio.
verifySchema is meaningful only if schema has been provided.
Input should be Literal or Tuple / List of Literals, Tuples, List[Literal], Dict[Literal, Literal].

Consider making MLReadable / MLReader generic in terms of loaded object

Review changes in pyspark.ml in Spark 2.4

Inconsistent type of instance argument pyspark.ml.util.JavaMLWriter.init

Is Param should be JavaMLWriteable.

Review pyspark.ml.evaluation changes in Spark <= 2.3

Add static annotations for pyspark.ml.classification.{LinearSVC, LinearSVCModel}

Improve tests

To ensure correctness of the annotation we should validate these against actual code.

A good starting point would be to run mypy against all PySpark examples.

Review Column annotations

Currently there is a number of problems with pyspark.sql.column annotations. Some are related to the Mypy behavior:

other, like bitwise* to the vague upstream semantics (should we allow Any if the only literal type acceptable on runtime is int?).

Define more precise type for a DStream.transform* function

Following methods:

DStream.transform
DStream.transformWith
DStream.foreachRDD

depend on func providing __code__.co_argcount so Callable is not a good type bound.

Review Callable usage

At least in some cases we can replace Callable with more precise Protocols.

Add static annotations for pyspark.ml.feature.{BucketedRandomProjectionsLSH, BucketedRandomProjectionLSHModel}

Refine type of ResultIterable.data

ResultIterable.data currently we use typing.Iterable, while in fact we need an equivalent of Interection[Iterable, Sized]. This depends on python/typing#213

BlockMatrix initializer should take Matrix not MatrixEntry

pyspark-stubs/third_party/3/pyspark/mllib/linalg/distributed.pyi

Line 81 in 07b1c57

    
           def __init__(self, blocks: RDD[Tuple[Tuple[int, int], MatrixEntry]], rowsPerBlock: int, colsPerBlock: int, numRows: int = ..., numCols: int = ...) -> None: ...

Review changes in pyspark.ml in 3.0.0

Review changes in pyspark.mllib in 3.0.0

Provide more precise UDF annotations

Currently udfs (especially vectorized ones) have rather crude annotations, which don't really capture actual relationships between arguments.

This can be improved by using literal types and protocols.

Related to #125, #137

Review changes in pyspark.sql in Spark 2.4

Add static annotations for DataFrame.mapInPandas

Since usptream API is still under discussion (SPARK-28264), and improved UDF annotations are still work in progress (#142), let's keep quasi-dynamic (4a1da21) annotation for now, and revisit this later.

Review pyspark.ml.feature changes in Spark <= 2.3

Add static annotations for pyspark.ml.image

Update variable annotation to new style

We have a bunch of old style

foo = ....  # type: bar

annotations. Since Python 3.6 has been around for a while, and there is no indication that this project will be merged with main Spark repository, we could update these to new style, i.e.

foo: bar = ...

Must not defer during final iteration with new mypy analyzer

While using new mypy analyzer tests fail with

AssertionError: Must not defer during final iteration

Possibly related to python/mypy#7129

Interestingly it works in the second run, when MyPy cache exists.

For now we can freeze mypy version to e7ddba113d69055387996df33ceaace52b8c2c97 and revisit this later.

Repeated ParaMap definitions are incompatible with new mypy analyzer

Right now we define ParamMap in multiple places. It not only violates DRY, but is also incompatible with new mypy analyzer:

Cannot assign multiple types to name "ParamMap" without an explicit "Type[...]" annotation

Review changes in pyspark.streaming in Spark 2.4

Some functions take only Column type as argument

Hi,

I noticed while trying to use the upper function that if the argument passed is a String we got the following error:

py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.upper. Trace:
py4j.Py4JException: Method upper([class java.lang.String]) does not exist

It makes sense as only upper(e: Column) is implemented in Scala

Some methods (like floor) do have both Column and String parameter overload, but some (like upper) don't, so the stubs for those methods should indicate Column type instead of ColumnOrName.

HasWeightCol is missing from OneVsRestParams

Applies to 2.3, 2.4, and master

Add support for decorator UDFs

Currently we support only direct application of udf (and pandas_udf). In other words this will type check:

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

def f(x: int) -> int:
    return x

f_ = udf(f, "str")
f_(col("foo"))

but this, won't

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

@udf("str")
def g(x: int) -> int:
    return x

g(col("foo"))

foo.py:4: error: Argument 1 to "udf" has incompatible type "str"; expected "Callable[..., Any]"
foo.py:4: error: Argument 1 to "__call__" of "UserDefinedFunctionLike" has incompatible type "Callable[[int], int]"; expected "Union[Column, str]"
foo.py:8: error: "Column" not callable

I guess we can address that providing overloaded variant (not tested):

@overload
def udf(f: Callable[..., Any], returnType: DataTypeOrString = ...) -> Callable[..., Column]: ...
@overload
def udf(f: DataTypeOrString = ...) -> Callable[[Callable[..., Any]], Callable[..., Column]]: ...

Related to #142, #143

Blocked by python/mypy#7243

Review changes in pyspark.mllib in Spark 2.4

Review changes in core pyspark in 3.0.0

Add static annotations for pyspark.ml.feature.{LSHParams, LSHModel}

Add static annotations for pyspark.ml.feature.{MinHashLSH, MinHashLSHModel}

Add 'back compatibility' imports to spark package

As of v2.4.0, pyspark has some 'back compatibility' imports in the spark package that are missing in the stub:
https://github.com/apache/spark/blob/v2.4.0/python/pyspark/__init__.py#L114

Can we add them?

Review changes in pyspark.sql in 3.0.0

Add Python 3.7 to Travis profile

Refine RDD.toDF annotation

Right now we simply allow RDD[Tuple], but it is not very precise. A more precise annotations will require an extended literal definition.

Related to #115

Review changes in pyspark.ml.base in Spark < 2.3

Row annotations should be fixed

Current annotations follow the code with its __new__ applications. This is not useful in practice and should be fixed.

Possibly like a1dffe9

Consider testing against pytype

Right now we test only against Mypy. It might worth considering testing against pytype as well.

That however might be to much overhead. Keeping up with mypy alone is time consuming, and another type checker might make things even worse.

pyspark.ml.param.shared.TypeVar(T) conflicts with other TypeVar definitions

In shared we define T Typevar
Similarly we define T in different modules, as placeholders for Params types. For example in regression

Because pyspark.ml.param.shared are star imported this creates a conflict in the latest mypy builds (with new semantic analyzer enabled).

... error: Cannot redefine 'T' as a type variable

See python/mypy#7185

Review changes in pyspark.ml.classification in Spark <= 2.3

Enable new semantic analyzer once it is fixed

See #152

Consider using Protocols to replace NumberOrArray

Review changes in pyspark.streaming in 3.0.0

HasMaxIter should return Type[self] not HasMaxIter

Generic self

While simple cases work pretty well:

from pyspark import SparkContext

(SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: x + 1))

Pair RDD annotations are a bit unusable:

from pyspark import SparkContext

pairs = (SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: (x.lower(), 1)))

from operator import add

pairs.reduceByKey(add).first()[0].upper()

main.py:11: error: object has no attribute "upper"

without explicit type annotations:

key = pairs.reduceByKey(add).first()[0]  # type: str
key.upper()

It is also possible to pass incompatible objects:

def add(x: str, y: str) -> str:
    return x + y

pairs.reduceByKey(add)  # type checks

It could be a problem with current annotations or a Mypy issue:

This feature is experimental. Checking code with type annotations for self arguments is still not fully implemented. Mypy may disallow valid code or allow unsafe code.

zero323 / pyspark-stubs Goto Github PK

pyspark-stubs's Issues

Recommend Projects

Recommend Topics

Recommend Org