zero323 / pyspark-stubs Goto Github PK
View Code? Open in Web Editor NEWApache (Py)Spark type annotations (stub files).
License: Apache License 2.0
Apache (Py)Spark type annotations (stub files).
License: Apache License 2.0
Right now pyspark.ml.Params
return corresponding mixin type. While technically speaking correct it is not very useful in practice.
Right now createDataFrame
are rather crude. It would be great refine these, to check dependencies between different type of arguments:
RDD[Literal]
or List[Literal]
requires schema string or DataType
.samplingRatio
.verifySchema
is meaningful only if schema has been provided.Literal
or Tuple
/ List
of Literals
, Tuples
, List[Literal]
, Dict[Literal, Literal]
.Is Param
should be JavaMLWriteable
.
To ensure correctness of the annotation we should validate these against actual code.
A good starting point would be to run mypy
against all PySpark examples.
Currently there is a number of problems with pyspark.sql.column
annotations. Some are related to the Mypy behavior:
other, like bitwise*
to the vague upstream semantics (should we allow Any
if the only literal type acceptable on runtime is int
?).
Following methods:
DStream.transform
DStream.transformWith
DStream.foreachRDD
depend on func
providing __code__.co_argcount
so Callable
is not a good type bound.
At least in some cases we can replace Callable
with more precise Protocols
.
ResultIterable.data
currently we use typing.Iterable
, while in fact we need an equivalent of Interection[Iterable, Sized]
. This depends on python/typing#213
Since usptream API is still under discussion (SPARK-28264), and improved UDF annotations are still work in progress (#142), let's keep quasi-dynamic (4a1da21) annotation for now, and revisit this later.
We have a bunch of old style
foo = .... # type: bar
annotations. Since Python 3.6 has been around for a while, and there is no indication that this project will be merged with main Spark repository, we could update these to new style, i.e.
foo: bar = ...
While using new mypy analyzer tests fail with
AssertionError: Must not defer during final iteration
Possibly related to python/mypy#7129
Interestingly it works in the second run, when MyPy cache exists.
For now we can freeze mypy version to e7ddba113d69055387996df33ceaace52b8c2c97 and revisit this later.
Right now we define ParamMap
in multiple places. It not only violates DRY, but is also incompatible with new mypy analyzer:
Cannot assign multiple types to name "ParamMap" without an explicit "Type[...]" annotation
Hi,
I noticed while trying to use the upper
function that if the argument passed is a String we got the following error:
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.upper. Trace:
py4j.Py4JException: Method upper([class java.lang.String]) does not exist
It makes sense as only upper(e: Column) is implemented in Scala
Some methods (like floor) do have both Column and String parameter overload, but some (like upper) don't, so the stubs for those methods should indicate Column
type instead of ColumnOrName
.
Applies to 2.3, 2.4, and master
Currently we support only direct application of udf
(and pandas_udf
). In other words this will type check:
from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column
def f(x: int) -> int:
return x
f_ = udf(f, "str")
f_(col("foo"))
but this, won't
from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column
@udf("str")
def g(x: int) -> int:
return x
g(col("foo"))
foo.py:4: error: Argument 1 to "udf" has incompatible type "str"; expected "Callable[..., Any]"
foo.py:4: error: Argument 1 to "__call__" of "UserDefinedFunctionLike" has incompatible type "Callable[[int], int]"; expected "Union[Column, str]"
foo.py:8: error: "Column" not callable
I guess we can address that providing overloaded variant (not tested):
@overload
def udf(f: Callable[..., Any], returnType: DataTypeOrString = ...) -> Callable[..., Column]: ...
@overload
def udf(f: DataTypeOrString = ...) -> Callable[[Callable[..., Any]], Callable[..., Column]]: ...
Blocked by python/mypy#7243
As of v2.4.0
, pyspark has some 'back compatibility' imports in the spark
package that are missing in the stub:
https://github.com/apache/spark/blob/v2.4.0/python/pyspark/__init__.py#L114
Can we add them?
Right now we simply allow RDD[Tuple]
, but it is not very precise. A more precise annotations will require an extended literal definition.
Related to #115
Current annotations follow the code with its __new__
applications. This is not useful in practice and should be fixed.
Possibly like a1dffe9
Right now we test only against Mypy. It might worth considering testing against pytype as well.
That however might be to much overhead. Keeping up with mypy alone is time consuming, and another type checker might make things even worse.
shared
we define T
Typevar
T
in different modules, as placeholders for Params
types. For example in regression
Because pyspark.ml.param.shared
are star imported this creates a conflict in the latest mypy
builds (with new semantic analyzer enabled).
... error: Cannot redefine 'T' as a type variable
See python/mypy#7185
See #152
While simple cases work pretty well:
from pyspark import SparkContext
(SparkContext
.getOrCreate()
.textFile("README.md")
.flatMap(str.split)
.map(lambda x: x + 1))
Pair RDD annotations are a bit unusable:
from pyspark import SparkContext
pairs = (SparkContext
.getOrCreate()
.textFile("README.md")
.flatMap(str.split)
.map(lambda x: (x.lower(), 1)))
from operator import add
pairs.reduceByKey(add).first()[0].upper()
main.py:11: error: object has no attribute "upper"
without explicit type annotations:
key = pairs.reduceByKey(add).first()[0] # type: str
key.upper()
It is also possible to pass incompatible objects:
def add(x: str, y: str) -> str:
return x + y
pairs.reduceByKey(add) # type checks
It could be a problem with current annotations or a Mypy issue:
This feature is experimental. Checking code with type annotations for self arguments is still not fully implemented. Mypy may disallow valid code or allow unsafe code.
Right now we explicitly test annotations against 3.5, 3.6 and 3.7. However that is a bit wasteful, as we already run mypy against multiple targets with different --python-version
.
It seems that we could speed up the tests and reduce load on Travis, by keeping only the latest Python version.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.