Coder Social home page Coder Social logo

zero323 / pyspark-stubs Goto Github PK

View Code? Open in Web Editor NEW
114.0 6.0 37.0 1.33 MB

Apache (Py)Spark type annotations (stub files).

License: Apache License 2.0

Python 100.00%
apache-spark python python-3 stub-files type-annotations pyspark mypy pep484

pyspark-stubs's Issues

Params should return self type.

Right now pyspark.ml.Params return corresponding mixin type. While technically speaking correct it is not very useful in practice.

Improve createDataFrame annotations

Right now createDataFrame are rather crude. It would be great refine these, to check dependencies between different type of arguments:

  • RDD[Literal] or List[Literal] requires schema string or DataType.
  • Schema (DDL string or DataType) is exclusive with samplingRatio.
  • verifySchema is meaningful only if schema has been provided.
  • Input should be Literal or Tuple / List of Literals, Tuples, List[Literal], Dict[Literal, Literal].

Improve tests

To ensure correctness of the annotation we should validate these against actual code.

A good starting point would be to run mypy against all PySpark examples.

Review Column annotations

Currently there is a number of problems with pyspark.sql.column annotations. Some are related to the Mypy behavior:

other, like bitwise* to the vague upstream semantics (should we allow Any if the only literal type acceptable on runtime is int?).

Provide more precise UDF annotations

Currently udfs (especially vectorized ones) have rather crude annotations, which don't really capture actual relationships between arguments.

This can be improved by using literal types and protocols.

Related to #125, #137

Update variable annotation to new style

We have a bunch of old style

foo = ....  # type: bar

annotations. Since Python 3.6 has been around for a while, and there is no indication that this project will be merged with main Spark repository, we could update these to new style, i.e.

foo: bar = ...

Some functions take only Column type as argument

Hi,

I noticed while trying to use the upper function that if the argument passed is a String we got the following error:

py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.upper. Trace:
py4j.Py4JException: Method upper([class java.lang.String]) does not exist

It makes sense as only upper(e: Column) is implemented in Scala

Some methods (like floor) do have both Column and String parameter overload, but some (like upper) don't, so the stubs for those methods should indicate Column type instead of ColumnOrName.

Add support for decorator UDFs

Currently we support only direct application of udf (and pandas_udf). In other words this will type check:

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

def f(x: int) -> int:
    return x

f_ = udf(f, "str")
f_(col("foo"))

but this, won't

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

@udf("str")
def g(x: int) -> int:
    return x

g(col("foo"))

foo.py:4: error: Argument 1 to "udf" has incompatible type "str"; expected "Callable[..., Any]"
foo.py:4: error: Argument 1 to "__call__" of "UserDefinedFunctionLike" has incompatible type "Callable[[int], int]"; expected "Union[Column, str]"
foo.py:8: error: "Column" not callable

I guess we can address that providing overloaded variant (not tested):

@overload
def udf(f: Callable[..., Any], returnType: DataTypeOrString = ...) -> Callable[..., Column]: ...
@overload
def udf(f: DataTypeOrString = ...) -> Callable[[Callable[..., Any]], Callable[..., Column]]: ...

Related to #142, #143

Blocked by python/mypy#7243

Refine RDD.toDF annotation

Right now we simply allow RDD[Tuple], but it is not very precise. A more precise annotations will require an extended literal definition.

Related to #115

Consider testing against pytype

Right now we test only against Mypy. It might worth considering testing against pytype as well.

That however might be to much overhead. Keeping up with mypy alone is time consuming, and another type checker might make things even worse.

Generic self

While simple cases work pretty well:

from pyspark import SparkContext

(SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: x + 1))

Pair RDD annotations are a bit unusable:

from pyspark import SparkContext

pairs = (SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: (x.lower(), 1)))
from operator import add

pairs.reduceByKey(add).first()[0].upper()
main.py:11: error: object has no attribute "upper"

without explicit type annotations:

key = pairs.reduceByKey(add).first()[0]  # type: str
key.upper()

It is also possible to pass incompatible objects:

def add(x: str, y: str) -> str:
    return x + y

pairs.reduceByKey(add)  # type checks 

It could be a problem with current annotations or a Mypy issue:

This feature is experimental. Checking code with type annotations for self arguments is still not fully implemented. Mypy may disallow valid code or allow unsafe code.

Simplify test matrix

Right now we explicitly test annotations against 3.5, 3.6 and 3.7. However that is a bit wasteful, as we already run mypy against multiple targets with different --python-version.

It seems that we could speed up the tests and reduce load on Travis, by keeping only the latest Python version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.