Coder Social home page Coder Social logo

pyspark-stubs's Introduction

PySpark Stubs

Build Status PyPI version Conda Forge version

A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints.

Tests and configuration files have been originally contributed to the Typeshed project. Please refer to its contributors list and license for details.

Important

This project has been merged with the main Apache Spark repository (SPARK-32714). All further development for Spark 3.1 and onwards will be continued there.

For Spark 2.4 and 3.0, development of this package will be continued, until their official deprecation.

  • If your problem is specific to Spark 2.3 and 3.0 feel free to create an issue or open pull requests here.
  • Otherwise, please check the official Spark JIRA and contributing guidelines. If you create a JIRA ticket or Spark PR related to type hints, please ping me with [~zero323] or @zero323 respectively. Thanks in advance.

Motivation

  • Static error detection (see SPARK-20631)

    SPARK-20631

  • Improved autocompletion.

    Syntax completion

Installation and usage

Please note that the guidelines for distribution of type information is still work in progress (PEP 561 - Distributing and Packaging Type Information). Currently installation script overlays existing Spark installations (pyi stub files are copied next to their py counterparts in the PySpark installation directory). If this approach is not acceptable you can add stub files to the search path manually.

According to PEP 484:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH.

Moreover:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH. A default fallback directory that is always checked is shared/typehints/python3.5/ (or 3.6, etc.)

Please check usage before proceeding.

The package is available on PYPI:

pip install pyspark-stubs

and conda-forge:

conda install -c conda-forge pyspark-stubs

Depending on your environment you might also need a type checker, like Mypy or Pytype1, and autocompletion tool, like Jedi.

Editor Type checking Autocompletion Notes

Atom

2 3 Through plugins.

IPython / Jupyter Notebook

4
PyCharm

PyDev

5 ?
VIM / Neovim 6 7 Through plugins.
Visual Studio Code 8 9 Completion with plugin
Environment independent / other editors 10 11 Through Mypy and Jedi.

This package is tested against MyPy development branch and in rare cases (primarily important upstrean bugfixes), is not compatible with the preceding MyPy release.

PySpark Version Compatibility

Package versions follow PySpark versions with exception to maintenance releases - i.e. pyspark-stubs==2.3.0 should be compatible with pyspark>=2.3.0,<2.4.0. Maintenance releases (post1, post2, ..., postN) are reserved for internal annotations updates.

API Coverage:

As of release 2.4.0 most of the public API is covered. For details please check API coverage document.

See also

Disclaimer

Apache Spark, Spark, PySpark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This project is not owned, endorsed, or sponsored by The Apache Software Foundation.

Footnotes


  1. Not supported or tested.

  2. Requires atom-mypy or equivalent.

  3. Requires autocomplete-python-jedi or equivalent.

  4. It is possible to use magics to type check directly in the notebook. In general though, you'll have to export whole notebook to .py file and run type checker on the result.

  5. Requires PyDev 7.0.3 or later.

  6. TODO Using vim-mypy, syntastic or Neomake.

  7. With jedi-vim.

  8. With Mypy linter.

  9. With Python extension for Visual Studio Code.

  10. Just use your favorite checker directly, optionally combined with tool like entr.

  11. See Jedi editor plugins list.

pyspark-stubs's People

Contributors

bosscolo avatar braamling avatar carylee avatar charlietsai avatar chehsunliu avatar guangie88 avatar harpaj avatar jhereth avatar oliverw1 avatar pgrz avatar radeklat avatar sproshev avatar tpvasconcelos avatar utkarshgupta137 avatar yilin-sai avatar zero323 avatar zpencerq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pyspark-stubs's Issues

Improve createDataFrame annotations

Right now createDataFrame are rather crude. It would be great refine these, to check dependencies between different type of arguments:

  • RDD[Literal] or List[Literal] requires schema string or DataType.
  • Schema (DDL string or DataType) is exclusive with samplingRatio.
  • verifySchema is meaningful only if schema has been provided.
  • Input should be Literal or Tuple / List of Literals, Tuples, List[Literal], Dict[Literal, Literal].

Generic self

While simple cases work pretty well:

from pyspark import SparkContext

(SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: x + 1))

Pair RDD annotations are a bit unusable:

from pyspark import SparkContext

pairs = (SparkContext
    .getOrCreate()
    .textFile("README.md")
    .flatMap(str.split)
    .map(lambda x: (x.lower(), 1)))
from operator import add

pairs.reduceByKey(add).first()[0].upper()
main.py:11: error: object has no attribute "upper"

without explicit type annotations:

key = pairs.reduceByKey(add).first()[0]  # type: str
key.upper()

It is also possible to pass incompatible objects:

def add(x: str, y: str) -> str:
    return x + y

pairs.reduceByKey(add)  # type checks 

It could be a problem with current annotations or a Mypy issue:

This feature is experimental. Checking code with type annotations for self arguments is still not fully implemented. Mypy may disallow valid code or allow unsafe code.

Provide more precise UDF annotations

Currently udfs (especially vectorized ones) have rather crude annotations, which don't really capture actual relationships between arguments.

This can be improved by using literal types and protocols.

Related to #125, #137

Simplify test matrix

Right now we explicitly test annotations against 3.5, 3.6 and 3.7. However that is a bit wasteful, as we already run mypy against multiple targets with different --python-version.

It seems that we could speed up the tests and reduce load on Travis, by keeping only the latest Python version.

Refine RDD.toDF annotation

Right now we simply allow RDD[Tuple], but it is not very precise. A more precise annotations will require an extended literal definition.

Related to #115

Add support for decorator UDFs

Currently we support only direct application of udf (and pandas_udf). In other words this will type check:

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

def f(x: int) -> int:
    return x

f_ = udf(f, "str")
f_(col("foo"))

but this, won't

from pyspark.sql.functions import col, udf
from pyspark.sql.column import Column

@udf("str")
def g(x: int) -> int:
    return x

g(col("foo"))

foo.py:4: error: Argument 1 to "udf" has incompatible type "str"; expected "Callable[..., Any]"
foo.py:4: error: Argument 1 to "__call__" of "UserDefinedFunctionLike" has incompatible type "Callable[[int], int]"; expected "Union[Column, str]"
foo.py:8: error: "Column" not callable

I guess we can address that providing overloaded variant (not tested):

@overload
def udf(f: Callable[..., Any], returnType: DataTypeOrString = ...) -> Callable[..., Column]: ...
@overload
def udf(f: DataTypeOrString = ...) -> Callable[[Callable[..., Any]], Callable[..., Column]]: ...

Related to #142, #143

Blocked by python/mypy#7243

Some functions take only Column type as argument

Hi,

I noticed while trying to use the upper function that if the argument passed is a String we got the following error:

py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.upper. Trace:
py4j.Py4JException: Method upper([class java.lang.String]) does not exist

It makes sense as only upper(e: Column) is implemented in Scala

Some methods (like floor) do have both Column and String parameter overload, but some (like upper) don't, so the stubs for those methods should indicate Column type instead of ColumnOrName.

Update variable annotation to new style

We have a bunch of old style

foo = ....  # type: bar

annotations. Since Python 3.6 has been around for a while, and there is no indication that this project will be merged with main Spark repository, we could update these to new style, i.e.

foo: bar = ...

Consider testing against pytype

Right now we test only against Mypy. It might worth considering testing against pytype as well.

That however might be to much overhead. Keeping up with mypy alone is time consuming, and another type checker might make things even worse.

Review Column annotations

Currently there is a number of problems with pyspark.sql.column annotations. Some are related to the Mypy behavior:

other, like bitwise* to the vague upstream semantics (should we allow Any if the only literal type acceptable on runtime is int?).

Params should return self type.

Right now pyspark.ml.Params return corresponding mixin type. While technically speaking correct it is not very useful in practice.

Improve tests

To ensure correctness of the annotation we should validate these against actual code.

A good starting point would be to run mypy against all PySpark examples.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.