Comments (11)
Great! Thanks for the thorough explanation. I'll doubtless have ore questions for you as I work on this.
from spark-connect-rs.
Let’s update the readme with the “done/open” and add any additional missing sections to the readme as well. Then let’s create one issue per core class that should be implemented and can be implemented.
There are some things like UDFs that would not be feasible because of how the remote spark cluster deserializes and runs the UDFs. Things like “toPandas” might be “toPolars” instead.
I’m still on the fence on translating an arrow recordbatch to the spark “row” representation when using collect. Im open for suggestions!
from spark-connect-rs.
Hi @sjrusso8 , can I take this issue? I will want more details about what "reviewing and matching" the documentation needs. For example, let's take Dataframe
. I see in pyspark
here that DataFrame has a set of existing docs. Do you want the rust docs to be word for word the same?
from spark-connect-rs.
Thanks for reaching out! You can take on this if you want.
Here is what I was thinking, and it's mostly just 2 parts. First, update the existing README and rust docs so that it somewhat matches up with the spark docs that you have linked. Like you mentioned on the DataFrame
object, each of the DataFrame
methods should have some type of documentation that is similar to the existing spark docs. For the methods that are widely used like select
, filter
, sort
, etc. we should probably provide a small example as well. Updating the README would just be marking open
or closed
correctly for the various sections, and even added new sections if you think it makes sense.
Second, would be to identify where there are missing parts on the core classes and we can make a longer issue tracker to start to build out those areas. For instance, DataFrame
currently does not implement methods like approxQuantile
, checkpoint
, observe
, etc. Classes like DataFrameNaFunctions
and DataFrameStatFunctions
are also not implemented.
This issue will be a way to create a cleaner roadmap of work to be completed :) I can also help with anything on this as well. I have slowly been making a list of gaps as well.
from spark-connect-rs.
starting now!
from spark-connect-rs.
@sjrusso8 do you want me to create new issue trackers for every pyspark core class?
from spark-connect-rs.
Follow up question: I am seeing unresolved paths in rust-analyzer
, such as spark::relation::RelType
. Looking into the spark subdirectory, those paths do indeed seem to be missing. How do I find them?
from spark-connect-rs.
Follow up question: I am seeing unresolved paths in
rust-analyzer
, such asspark::relation::RelType
. Looking into the spark subdirectory, those paths do indeed seem to be missing. How do I find them?
Did you refresh the git submodule? That’s probably because of the build step under “/core” that points to the spark connect protobuf located in the submodule. Make sure the submodule is checked out to the tag for 3.5.1 and not “main”
from spark-connect-rs.
I checked out version 3.5.1. The git submodule is up to date. taking a look at the build step now
from spark-connect-rs.
yep, works now!
from spark-connect-rs.
quick bump here @sjrusso8
from spark-connect-rs.
Related Issues (20)
- Write unit test(s) for Spark functions
- Feature: bindings for server side JS/TS using napi-rs
- Feature: Investigate WASM/WASI targets HOT 1
- Feature: Position/Keyword Args with SQL HOT 4
- #[allow(non_snake_case)] HOT 3
- Bug: Endpoint uses https scheme when use_ssl is false HOT 2
- Check example datasets into source control so they're easier to run HOT 1
- Deadlock: concurrent cloned spark sessions HOT 4
- Implement File Format Reader/Writer
- Implement createTable on Catalog
- Implement approxQuantile on DataFrame
- Implement checkpoint & localCheckpoint on the DataFrame
- Implement NA functions on DataFrame & DataFrameNaFunctions
- Implement DataFrameStatFunctions on the DataFrame HOT 1
- Epic: Implement Missing Spark Functions
- Epic: Spark 4.0 Connect Spec
- Remove Git Submodule and Copy the Proto into the Repo
- Epic: Core Client Kernel
- CICD: Release pipeline is broken
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-connect-rs.