Comments (4)
Partially addressed by #119
from pyspark-stubs.
@zero323 According to the documentation, it is possible to create a DataFrame with the usage of the proper schema and the data which are a list e.g. List[tuple]
.
I have prepared the example of the code which is not proper against mypy
but it is proper against pyspark
and it works properly, at least in my opinion 😅
data = [('Alice', 1)]
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# a properly created DataFrame, unfortunately with lost "IntegerType"
spark.createDataFrame(data, ['name', 'age'])
# a properly created DataFrame unfortunately `mypy` is telling about incompatible used type
spark.createDataFrame(data, schema)
The exact error produced by mypy
is as followed:
error: Argument 1 to "createDataFrame" of "SparkSession" has incompatible type "List[Tuple[str, int]]"; expected "Union[RDD[Union[datetime, date, bool, int, float, str, Decimal]], List[Union[datetime, date, bool, int, float, str, Decimal]]]"
Could you tell me if your issue is addressing this thing too?
In the project where we are using your library, we are creating the DataFrame manually mostly for the unit tests. It is done this way because it is easier to have the test data as a simple Python list.
I can also prepare the pull request to resolve my specific problem but I know that it can be not enough for everything that you have in your mind. That's why I would like to consult it before 😉
The versions which I have used for this test:
Python 3.5.1
pyspark==2.4.3
pyspark-stubs==2.4.0.post5
from pyspark-stubs.
@redlickigrzegorz That's something I'd consider bug, rather than target of this particular enhancement proposal.
There are at least two problems here - the overload that has been partially matched is to wide, and there is no overload that is intended to match this case.
If you want to work on that I'd suggest three things:
- Narrowing down the
overload
that has been matched here by changingschema
toAtomicType
. We wan't to see something aroundwith your failing case.error: No overload variant of "createDataFrame" of "SparkSession" matches argument types "List[Tuple[str, int]]", "StructType" ...
- Adding another
overload
that targets this specific case ((Union[RDD[Union[List,Tuple]], Iterable[Union[List,Tuple]]], StructType)
) - Adding new data driven test case that confirms that all definitions work (that's probably the most consuming part here).
from pyspark-stubs.
from pyspark-stubs.
Related Issues (20)
- [SPARK-32517] Add StorageLevel.DISK_ONLY_3
- [SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable()
- [SPARK-32449] Add summary to MultilayerPerceptronClassificationModel
- [SPARK-29157] Add DataFrameWriterV2 to Python API
- [SPARK-31849] Make PySpark SQL exceptions more Pythonic HOT 1
- [SPARK-32010] Thread leaks in pinned thread mode
- Support string type in pyspark.sql.DataFrameReader.csv's schema parameter HOT 1
- [SPARK-31656] AFT blockify input vectors
- [SPARK-32719] Add Flake8 check for missing imports
- [SPARK-32319] Disallow the use of unused imports
- [SPARK-32798] Make unionByName optionally fill missing columns with nulls in PySpark
- RandomForestRegressor.{__init__, setParams} are missing leafCol
- Drop hasSummary from LinearRegressionTrainingSummary, GeneralizedLinearRegressionTrainingSummary and LogisticRegressionSummary
- [SPARK-32835] Add withField method to the pyspark Column class
- pyspark-stubs installed pyspark-2.4.4 and corrupt pre-installed pyspark-3.0.0 HOT 3
- `pyspark.rdd.RDD.histogram`'s `buckets` argument is incomplete
- How to handle java backend stubs HOT 2
- DataFrameLike does not have to_sql method HOT 2
- Wrong type in Dataframe.write.parquet
- Allow latest version of pyspark HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyspark-stubs.