Coder Social home page Coder Social logo

sjrusso8 / spark-connect-rs Goto Github PK

View Code? Open in Web Editor NEW
49.0 3.0 11.0 3.77 MB

Apache Spark Connect Client for Rust

Home Page: https://docs.rs/spark-connect-rs

License: Apache License 2.0

Rust 99.22% Shell 0.78%
grpc-client spark spark-connect spark-sql

spark-connect-rs's Introduction

Apache Spark Connect Client for Rust

This project houses the experimental client for Spark Connect for Apache Spark written in Rust

Current State of the Project

Currently, the Spark Connect client for Rust is highly experimental and should not be used in any production setting. This is currently a "proof of concept" to identify the methods of interacting with Spark cluster from rust.

The spark-connect-rs aims to provide an entrypoint to Spark Connect, and provide similar DataFrame API interactions.

Project Layout

├── core       <- core implementation in Rust
│   └─ spark   <- git submodule for apache/spark
├── rust       <- shim for 'spark-connect-rs' from core
├── examples   <- examples of using different aspects of the crate
├── datasets   <- sample files from the main spark repo

Future state would be to have additional bindings for other languages along side the top level rust folder.

Getting Started

This section explains how run Spark Connect Rust locally starting from 0.

Step 1: Install rust via rustup: https://www.rust-lang.org/tools/install

Step 2: Ensure you have a cmake and protobuf installed on your machine

Step 3: Run the following commands to clone the repo

git clone https://github.com/sjrusso8/spark-connect-rs.git
git submodule update --init --recursive

cargo build

Step 4: Setup the Spark Driver on localhost either by downloading spark or with docker.

With local spark:

  1. Download Spark distribution (3.5.1 recommended), unzip the package.

  2. Set your SPARK_HOME environment variable to the location where spark was extracted to,

  3. Start the Spark Connect server with the following command (make sure to use a package version that matches your Spark distribution):

$ $SPARK_HOME/sbin/start-connect-server.sh --packages "org.apache.spark:spark-connect_2.12:3.5.1,io.delta:delta-spark_2.12:3.0.0" \
      --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \
      --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
      --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

With docker:

  1. Start the Spark Connect server by leveraging the created docker-compose.yml in this repo. This will start a Spark Connect Server running on port 15002
$ docker compose up --build -d

Step 5: Run an example from the repo under /examples

Features

The following section outlines some of the larger functionality that are not yet working with this Spark Connect implementation.

  • done TLS authentication & Databricks compatability via the feature flag feature = 'tls'
  • open StreamingQueryManager
  • open UDFs or any type of functionality that takes a closure (foreach, foreachBatch, etc.)

SparkSession

Spark Session type object and its implemented traits

SparkSession API Comment
active open
addArtifact(s) open
addTag done
clearTags done
copyFromLocalToFs open
createDataFrame partial Partial. Only works for RecordBatch
getActiveSessions open
getTags done
interruptAll done splitn
interruptOperation done
interruptTag done
newSession open
range done
removeTag done
sql done
stop open
table done
catalog done Catalog
client done unstable developer api for testing only
conf done Conf
read done DataFrameReader
readStream done DataStreamReader
streams open Streams
udf open Udf - may not be possible
udtf open Udtf - may not be possible
version done

SparkSessionBuilder

SparkSessionBuilder API Comment
appName done
config done
master open
remote partial Validate using spark connection string

StreamingQueryManager

StreamingQueryManager API Comment
awaitAnyTermination open
get open
resetTerminated open
active open

StreamingQuery

StreamingQuery API Comment
awaitTermination done
exception done
explain done
processAllAvailable done
stop done
id done
isActive done
lastProgress done
name done
recentProgress done
runId done
status done

DataStreamReader

DataStreamReader API Comment
csv open
format done
json open
load done
option done
options done
orc open
parquet open
schema done
table open
text open

DataFrameReader

DataFrameReader API Comment
csv open
format done
json open
load done
option done
options done
orc open
parquet open
schema done
table done
text open

DataStreamWriter

Start a streaming job and return a StreamingQuery object to handle the stream operations.

DataStreamWriter API Comment
foreach
foreachBatch
format done
option done
options done
outputMode done Uses an Enum for OutputMode
partitionBy done
queryName done
start done
toTable done
trigger done Uses an Enum for TriggerMode

StreamingQueryListener

StreamingQueryListener API Comment
onQueryIdle open
onQueryProgress open
onQueryStarted open
onQueryTerminated open

UdfRegistration (may not be possible)

UDFRegistration API Comment
register open
registerJavaFunction open
registerJavaUDAF open

UdtfRegistration (may not be possible)

UDTFRegistration API Comment
register open

RuntimeConfig

RuntimeConfig API Comment
get done
isModifiable done
set done
unset done

Catalog

Catalog API Comment
cacheTable done
clearCache done
createExternalTale open
createTable open
currentCatalog done
currentDatabase done
databaseExists done
dropGlobalTempView done
dropTempView done
functionExists done
getDatabase done
getFunction done
getTable done
isCached done
listCatalogs done
listDatabases done
listFunctions done
listTables done
recoverPartitions done
refreshByPath done
refreshTable done
registerFunction open
setCurrentCatalog done
setCurrentDatabase done
tableExists done
uncacheTable done

DataFrame

Spark DataFrame type object and its implemented traits.

DataFrame API Comment
agg done
alias done
approxQuantile open
cache done
checkpoint open
coalesce done
colRegex done
collect done
columns done
corr done
count done
cov done
createGlobalTempView done
createOrReplaceGlobalTempView done
createOrReplaceTempView done
createTempView done
crossJoin done
crosstab done
cube done
describe done
distinct done
drop done
dropDuplicates done
dropDuplicatesWithinWatermark open Windowing functions are currently in progress
drop_duplicates done
dropna done
dtypes done
exceptAll done
explain done
fillna open
filter done
first done
foreach open
foreachPartition open
freqItems done
groupBy done
head done
hint done
inputFiles done
intersect done
intersectAll done
isEmpty done
isLocal open
isStreaming done
join done
limit done
localCheckpoint open
mapInPandas open TBD on this exact implementation
mapInArrow open TBD on this exact implementation
melt done
na open
observe open
offset done
orderBy done
persist done
printSchema done
randomSplit open
registerTempTable open
repartition done
repartitionByRange open
replace open
rollup done
sameSemantics done
sample done
sampleBy open
schema done
select done
selectExpr done
semanticHash done
show done
sort done
sortWithinPartitions done
sparkSession done
stat done
storageLevel done
subtract done
summary open
tail done
take done
to done
toDF done
toJSON partial Does not return an RDD but a long JSON formatted String
toLocalIterator open
toPandas to_polars & toPolars partial Convert to a polars::frame::DataFrame
new to_datafusion & toDataFusion done Convert to a datafusion::dataframe::DataFrame
transform done
union done
unionAll done
unionByName done
unpersist done
unpivot done
where done use filter instead, where is a keyword for rust
withColumn done
withColumns done
withColumnRenamed open
withColumnsRenamed done
withMetadata open
withWatermark open
write done
writeStream done
writeTo done

DataFrameWriter

Spark Connect should respect the format as long as your cluster supports the specified type and has the required jars

DataFrameWriter API Comment
bucketBy done
csv
format done
insertInto done
jdbc
json
mode done
option done
options done
orc
parquet
partitionBy
save done
saveAsTable done
sortBy done
text

DataFrameWriterV2

DataFrameWriterV2 API Comment
append done
create done
createOrReplace done
option done
options done
overwrite done
overwritePartitions done
partitionedBy done
replace done
tableProperty done
using done

Column

Spark Column type object and its implemented traits

Column API Comment
alias done
asc done
asc_nulls_first done
asc_nulls_last done
astype open
between open
cast done
contains done
desc done
desc_nulls_first done
desc_nulls_last done
dropFields done
endswith done
eqNullSafe open
getField open This is depreciated but will need to be implemented
getItem open This is depreciated but will need to be implemented
ilike done
isNotNull done
isNull done
isin done
like done
name done
otherwise open
over done Refer to Window for creating window specifications
rlike done
startswith done
substr done
when open
withField done
eq == done Rust does not like when you try to overload == and return something other than a bool. Currently implemented column equality like col('name').eq(col('id')). Not the best, but it works for now
addition + done
subtration - done
multiplication * done
division / done
OR | done
AND & done
XOR ^ done
Negate ~ done

Functions

Only a few of the functions are covered by unit tests.

Functions API Comment
abs done
acos done
acosh done
add_months done
aggregate open
approxCountDistinct open
approx_count_distinct done
array done
array_append done
array_compact done
array_contains open
array_distinct done
array_except done
array_insert open
array_intersect done
array_join open
array_max done
array_min done
array_position done
array_remove done
array_repeat done
array_sort open
array_union done
arrays_overlap open
arrays_zip done
asc done
asc_nulls_first done
asc_nulls_last done
ascii done
asin done
asinh done
assert_true open
atan done
atan2 done
atanh done
avg done
base64 done
bin done
bit_length done
bitwiseNOT open
bitwise_not done
broadcast open
bround open
bucket open
call_udf open
cbrt done
ceil done
coalesce done
col done
collect_list done
collect_set done
column done
concat done
concat_ws open
conv open
corr open
cos open
cosh open
cot open
count open
countDistinct open
count_distinct open
covar_pop done
covar_samp done
crc32 done
create_map done
csc done
cume_dist done
current_date done
current_timestamp done
date_add done
date_format open
date_sub done
date_trunc open
datediff done
dayofmonth done
dayofweek done
dayofyear done
days done
decode open
degrees done
dense_rank done
desc done
desc_nulls_first done
desc_nulls_last done
element_at open
encode open
exists open
exp done
explode done
explode_outer done
expm1 done
expr done
factorial done
filter open
first open
flatten done
floor done
forall open
format_number open
format_string open
from_csv open
from_json open
from_unixtime open
from_utc_timestamp open
functools open
get open
get_active_spark_context open
get_json_object open
greatest done
grouping done
grouping_id open
has_numpy open
hash done
hex done
hour done
hours done
hypot open
initcap done
inline done
inline_outer done
input_file_name done
inspect open
instr open
isnan done
isnull done
json_tuple open
kurtosis done
lag open
last open
last_day open
lead open
least done
length done
levenshtein open
lit done
localtimestamp done
locate open
log done
log10 done
log1p done
log2 done
lower done
lpad open
ltrim done
make_date open
map_concat done
map_contains_key open
map_entries done
map_filter open
map_from_arrays open
map_from_entries done
map_keys done
map_values done
map_zip_with open
max done
max_by open
md5 done
mean done
median done
min done
min_by open
minute done
mode open
monotonically_increasing_id done
month done
months done
months_between open
nanvl done
next_day open
np open
nth_value open
ntile done
octet_length done
overlay open
overload open
pandas_udf open
percent_rank done
percentile_approx open
pmod open
posexplode done
posexplode_outer done
pow done
product done
quarter done
radians done
raise_error open
rand done
randn done
rank done
regexp_extract open
regexp_replace open
repeat open
reverse done
rint done
round done
row_number done
rpad open
rtrim done
schema_of_csv open
schema_of_json open
sec done
second done
sentences open
sequence open
session_window open
sha1 done
sha2 open
shiftLeft open
shiftRight open
shiftRightUnsigned open
shiftleft open
shiftright open
shiftrightunsigned open
shuffle done
signum done
sin done
sinh done
size done
skewness done
slice open
sort_array open
soundex done
spark_partition_id done
split open
sqrt done
stddev done
stddev_pop done
stddev_samp done
struct open
substring open
substring_index open
sum done
sumDistinct open
sum_distinct open
sys open
tan done
tanh done
timestamp_seconds done
toDegrees open
toRadians open
to_csv open
to_date open
to_json open
to_str open
to_timestamp open
to_utc_timestamp open
transform open
transform_keys open
transform_values open
translate open
trim done
trunc open
try_remote_functions open
udf open
unbase64 done
unhex done
unix_timestamp open
unwrap_udt open
upper done
var_pop done
var_samp done
variance done
warnings open
weekofyear done
when open
window open
window_time open
xxhash64 done
year done
years done
zip_with open

Data Types

Data types are used for creating schemas and for casting columns to specific types

Column API Comment
ArrayType done
BinaryType done
BooleanType done
ByteType done
DateType done
DecimalType done
DoubleType done
FloatType done
IntegerType done
LongType done
MapType done
NullType done
ShortType done
StringType done
CharType done
VarcharType done
StructField done
StructType done
TimestampType done
TimestampNTZType done
DayTimeIntervalType done
YearMonthIntervalType done

Literal Types

Create Spark literal types from these rust types. E.g. lit(1_i64) would be a LongType() in the schema.

An array can be made like lit([1_i16,2_i16,3_i16]) would result in an ArrayType(Short) since all the values of the slice can be translated into literal type.

Spark Literal Type Rust Type Status
Null open
Binary &[u8] done
Boolean bool done
Byte open
Short i16 done
Integer i32 done
Long i64 done
Float f32 done
Double f64 done
Decimal open
String &str / String done
Date chrono::NaiveDate done
Timestamp chrono::DateTime<Tz> done
TimestampNtz chrono::NaiveDateTime done
CalendarInterval open
YearMonthInterval open
DayTimeInterval open
Array slice / Vec done
Map Create with the function create_map done
Struct Create with the function struct_col or named_struct done

Window & WindowSpec

For ease of use it's recommended to use Window to create the WindowSpec.

Window API Comment
currentRow done
orderBy done
partitionBy done
rangeBetween done
rowsBetween done
unboundedFollowing done
unboundedPreceding done
WindowSpec.orderBy done
WindowSpec.partitionBy done
WindowSpec.rangeBetween done
WindowSpec.rowsBetween done

spark-connect-rs's People

Contributors

abrassel avatar hntd187 avatar sjrusso8 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

spark-connect-rs's Issues

Bug: Endpoint uses https scheme when use_ssl is false

I already have a fix in mind for this issue, and I can make a PR if this issue is reviewed and approved, thanks!

Description

The current spark-connect-rs library configures all endpoints to https scheme. When tls feature is enabled, connection cannot be successfully made to server without TLS configured, for example, when connecting to a Spark cluster set up at localhost

Expected Behavior

When tls feature is enabled in spark-connect-rs crate, connection to server with / without TLS configured should both be successful

Proposed Fix

Default endpoint scheme to http, set it to https only when use_ssl=true is specified in connection string

Related Issue

Epic: Spark 4.0 Connect Spec

Description

Spark 4.0 implements changes to the connect proto. We will to analyze the spec and identify what has changed.

Additionally, we will need to support both a client for Spark 3.5 and Spark 4.0. There should be a feature flag on the client for 3_5 or 4_0

Check example datasets into source control so they're easier to run

The examples currently use paths that are for the Docker workflow.

It would be cool if the examples could also work with non-Docker setups (e.g. when I manually spin up Spark Connect on localhost).

Perhaps we can check in all those data files into this repo, so these examples can work out of the box with Docker and Spark Connect localhost.

Help getting spark-connect-rs running locally

Commands I ran (I am a total n00b):

  • install rust via rustup
  • Created a new crate: cargo new spark-connect-gist
  • edited Cargo.toml to include spark connect rs and tokio dependencies:
[dependencies]
spark-connect-rs = "0.0.1-beta.3"
tokio = { version = "1", features = ["full"] }

Here is the error message:

   Compiling spark-connect-gist v0.1.0 (/Users/matthew.powers/Documents/code/my_apps/spark-connect-gist)
    Finished dev [unoptimized + debuginfo] target(s) in 1.00s
     Running `target/debug/spark-connect-gist`
Error: tonic::transport::Error(Transport, hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 61, kind: ConnectionRefused, message: "Connection refused" })))

Implement File Format Reader/Writer

Description

Implement the initial methods to read and write .csv, .json, .orc, .parquet, and .text.

Consider creating ConfigOpts trait for each of those file options and have a custom struct represent the options for each of those file types.

Example with CSV

Create the Options and modify the opts object. The object is passed into the method.

let mut opts = CsvOptions::new()

opts.header = true;
opts.delimiter = b'|';

let df = spark.read().csv(path, opts)

Example of what the function signature might look like

impl DataFrameReader {
 ....
     pub fn csv<C: ConfigOpts>(path: &str, opts: Some(C))
}

Collect fails on large results

The collect() panics when returning a large result. The arrow-ipc streamreader is not parsing the data correctly.

Example:

use spark_connect_rs;

use spark_connect_rs::{SparkSession, SparkSessionBuilder};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut spark: SparkSession =
        SparkSessionBuilder::remote("sc://127.0.0.1:15002/;user_id=example_rs")
            .build()
            .await?;

    spark
        .clone()
        .range(None, 100000, 1, Some(1))
        .collect()
        .await
        .unwrap();

     Ok(())
}

This results in a panic

thread 'main' panicked at /home/sjrusso/Documents/code/projects/rust-projects/spark-connect-rs/src/session.rs:191:30:
called `Result::unwrap()` on an `Err` value: IpcError("Not expecting a schema when messages are read")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1073:23
   4: spark_connect_rs::session::SparkSession::consume_plan::{{closure}}
             at ./src/session.rs:191:12
   5: spark_connect_rs::dataframe::DataFrame::collect::{{closure}}
             at ./src/dataframe.rs:99:14
   6: sql::main::{{closure}}
             at ./examples/sql.rs:21:10
   7: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/park.rs:282:63
   8: tokio::runtime::coop::with_budget
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
  10: tokio::runtime::park::CachedParkThread::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/park.rs:282:31
  11: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/blocking.rs:66:9
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  13: tokio::runtime::context::runtime::enter_runtime
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
  14: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  15: tokio::runtime::runtime::Runtime::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/runtime.rs:349:45
  16: sql::main
             at ./examples/sql.rs:46:5
  17: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Feature: bindings for server side JS/TS using napi-rs

Description

Create similar bindings as with Rust but available in server side js (node, deno, bun, ...). The sdk should closely resemble the rust one, and only deviate when either necessary due to napi limitations, or when it is unidiomatic in JS.

napi.rs seems to be good crate to leverage and is relatively easy to use.

Early Experiment

The branch feat/napi contains a super quick pass at creating the bindings. The experiment only covers these areas

  1. Create a remote SparkSession
  2. Create a dataframe with .sql
  3. Modify the dataframe with select, and filter
  4. Perform "action" with count()
  5. Perform “action” with show()

There is a lot of use of clone() and some not great implementations to create a new empty dataframe to satisfy the napi requirements. The polars js interop is a good example of how the bindings might function.

Feature: Position/Keyword Args with SQL

Description

Implement the ability to use positional/keyword args with sql. Because of the differences between python and rust, the function arguments need to be clearly implemented.

The pyspark process for sql allows for literals and dataframes to be in one argument. However, rust probably won't take to kindly to that input arg. If a user passes in a DataFrame it will need to be handled with a SubqueryAlias and if it's a literal it will be passed in as either a positional or a keyword argument.

We might want to only allow for to inputs parameters are options. Something like this?

sql<T: ToLiteral>(self, sql_query: &str, col_args: Option<HashMap<String, T>>, df_args: Option<HashMap<String, DataFrame>>) -> DataFrame 

This could allow a user to do these variations.

spark.sql("SELECT * FROM table", None, None).await?;
let df = spark.range(...);

// create the hashmap

spark.sql("SELECT * FROM {df}" None, Some(df_hashmap)).await?;
let col = "name";

// create the hashmap

spark.sql("SELECT {col} FROM {df}", Some(col_hashmap), Some(df_hashmap)).await?;

Or should positional SQL be a completely different method all together? like sql_params? So that a user doesn't need to fuss with adding None x2 to all their sql statements.

Feature: Investigate WASM/WASI targets

Description

Being able to compile the rust bindings into wasm32-unknown-unknown and/or wasm32-wasi would be interesting. This could allow some interesting interactions between a browser and spark. A wasm32-wasi target would allow spark programs to run on any runtime.

Early Experiments

A feature flag under the core bindings for wasm already exists and does compile successfully to those targets mentioned above. The issue arises when trying to send a HTTP1.1 request with grpc-web to the Spark Connect server, which only accepts normal HTTP2 grpc requests. There are methods of standing up a proxy server with envoy to forward the gRPC browser request to the backend server. But this feels like a lot of effort for the client to do.

The branch feat/wasm contains the early experiment and trying to run wasm with wasmtime. Issue arises with using async code in wasm. There is probably a way to code it correctly, but I don't have time to finish the experiment

Remove Git Submodule and Copy the Proto into the Repo

Description

Git submodules are annoying, and the current submodule is only ever checked out at a specific release tag. It would be easier to just have the folder containing the copied protobufs.

I think it might look something like this in the repo

├── core       <- core implementation in Rust
│   │ - spark4_0 <- protobuf for spark 4.0.0
│   └─ spark3_5  <- protobuf for spark 3.5.1

Related to #61

#[allow(non_snake_case)]

Hi @sjrusso8 can you tell me the rationale behind exposing a camel cased API?

Seems like we should be fine just using idiomatic rust snake casing.

On a separate note, do you have an IM channel of some kind to chat about this project? I would love to iterate more rapidly with you.

Epic: Implement Missing Spark Functions

Description

This issue will be the organizing issue for all the remaining spark functions to method.

Based on the readme here is the list

  • aggregate
  • array_contains
  • array_insert
  • array_join
  • array_sort
  • arrays_overlap
  • assert_true
  • broadcast
  • bround
  • bucket
  • concat_ws
  • conv
  • corr
  • cos
  • cosh
  • cot
  • count
  • count_distinct
  • date_format
  • date_trunc
  • decode
  • element_at
  • encode
  • exists
  • filter
  • first
  • forall
  • format_number
  • format_string
  • from_csv
  • from_json
  • from_unixtime
  • from_utc_timestamp
  • get
  • get_json_object
  • grouping_id
  • hypot
  • inspect
  • instr
  • json_tuple
  • lag
  • last_day
  • lead
  • levenshtein
  • locate
  • lpad
  • make_date
  • map_contains_keys
  • map_filter
  • map_from_arrays
  • map_from_entries
  • map_zip_with
  • max_by
  • min_by
  • mode
  • months_between
  • next_day
  • np
  • nth_value
  • octet_length
  • overlay
  • overload
  • percentile_approx
  • pmod
  • raise_error
  • regexp_extract
  • regexp_replace
  • repeat
  • rpad
  • schema_of_csv
  • schema_of_json
  • sentences
  • sequences
  • session_window
  • sha2
  • shiftLeft
  • shiftRight
  • shiftRightUnsigned
  • slice
  • sort_array
  • split
  • substring
  • sum_distinct
  • sys
  • to_degrees
  • to_radians
  • to_csv
  • to_date
  • to_json
  • to_str
  • to_timestamp
  • to_utc_timestamp
  • transform
  • transform_keys
  • transform_values
  • translate
  • trunc
  • unix_timestamp
  • when
  • window
  • window_time
  • zip_with

Write unit test(s) for Spark functions

Description

There are many functions for Spark, and most of them are created via a macro. However, not all of them have unit test coverage. Create additional unit tests based on similar test conditions from the exiting Spark API test cases.

I have been mirror the docstring tests from the PySpark API for reference.

Deadlock: concurrent cloned spark sessions

Hi, thanks a lot for this nice crate!

I'd like to report a deadlock issue when a spark session is cloned and used concurrently.

#46 demonstrates a possible workflow leading to a deadlock.

The gist is, everywhere #[allow(clippy::await_holding_lock)] is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.

-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor.
-> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.

Therefore, I would suggest that we remove the 'clone' method from SparkSession, or replace rwlock with an asynchronous lock. What do you think of this?

Epic: Core Client Kernel

Description

Lots of other programming languages can access gRPC and Arrow, which means many different languages will be recreating the core client logic to handle the client requests and response handling.

The idea is that there could one kernel client that all other programming languages use, and then the specific language is left to implement the specific of the core spark objects.

What does this involve?

  1. Isolate the existing client.rs into core and move all other rust specific implementations into rust
  2. Remove any dependencies in client.rs that are only for the rust library
  3. Update error handling in client.rs to create a new ConnectClientError error type, (currently leverages SparkError)
  4. lots of more research and analysis into the feasibility of this kernel :)

Implement `DataFrameWriterV2`

Description

The initial DataFrameWriter is created but there is also another way to write data via DataFrameWriterV2. This method has a slightly different implementation and leverages a different proto command message.

I think the methods should mirror the ones found on the Spark API guide, and a new method should be added onto the DataFrame for writeTo.

Cleanup Documentation - Spark Core Classes

Description

The overall documentation needs to be reviewed and matched against the Spark Core Classes and Functions. For instance, the README should be accurate to what functions and methods are currently implemented compared to the existing Spark API.

However, there are probably a few misses that are either currently implemented but marked as open, or were accidentally excluded. Might consider adding a few sections for other classes like StreamingQueryManager, DataFrameNaFunctions, DataFrameStatFunctions, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.