rajasekarv / vega Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 207.0 2.1 MB

A new arguably faster implementation of Apache Spark from scratch in Rust

License: Apache License 2.0

Rust 99.07% Cap'n Proto 0.01% Dockerfile 0.42% Shell 0.50%

vega's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger journeycorner marcelbuesing onlookerliu wiltonlazary awesome-archive cjl3080434008 ivankabestwill 0xflotus twrodriguez kumbharumesh qezz ariesdevil yyljlyy gzsombor edprince juniuszhou vsrikarunyan ntrinquier shaunstanislauslau pathikritghosh iduartgomez tchigher mindcrime-forks nunb epalese alecmocatta guojerry sonntag hhtiger fishcus sailingzhuang chenpufeng ji3jin guanqingtao liuliuluk joe2hpimn debuggerwang deegue duzhanyuan stephen-liao lswpsa lizhanyang505 guihao-liang hhy5277 karfield logan-lu iclxl optionals louisdebroglie xunleer nimuyuhan leerisk lupes zhuangchaoxi sequoiayanglive koven2049 xiangminghu learn2pro flycash zgzwelldone roy-engineering liexusong daniuli ez-max mars2018 lh6868 buffernihility ybbid witium byteflyfunny mj520 yzu2ustc hanjb chuyuanlinzi foxxnuaa ablozhou sathis-ai backwardn jianantian jackyfriendly hugcoday yangleimiao hgfkeep acproject yutiansut cryan2016 yh0926 awesomedatatool zhuzhengyi atlaspilotpuppy sunguo susurhe yyz940922 chummyhe89 chitralverma ithawk capeskychung b-xiang alanding1989

vega's Issues

Add a connector for Azure Data Lake Store

Similarly to #51, do likewise for Azure storage solution: implemente a read/write connector. If possible try to follow a common interface/pattern.

implement keyby

Use this as backend for java Spark?

As Spark uses RDD under the hood, would it be possible and make sense to use native_spark as the backend for the official java Spark version?

Tracking issue: async integration

After some talk we have decided to take a careful gradual approach to integrate async into the library.

Adding asynchronous computation is a large departure from the reference Spark implementation, and may change how we do certain things or what is possible (like certain optimizations that rely on stack allocation in our case) in ways that are not yet clear.

Therefore, is preferred to take a gradual approach as we explore the design space and evolve the library. The original work can be seen at #67, some work done in that preliminary PR will be ported to the main branch and more steeps will be taken to make testing and comparing both versions easily while we experiment.

Meanwhile an async branch will be maintained and kept in sync with the master branch.

Preliminary work

Finalize shuffle fetcher asynchronous version (maintain compatibility with the sync caller for now; joining the handle output).
Port the work to handle async runtime instantiation/handling to integrate with third parties.
Port the work so tasks at the scheduler are ran asynchronously.
Port the work so the worker executor is asynchrnous.
Use capnp async readers (fix remaining issues within the worker executor).
Automate testing and profiling for distributed mode so we can check for regressions/gains.
Add a number of tests to cover for different kinds of workloads so both versions can be compared.

Future work

Explore if it's possible to chain iterators on the stack to take advantage of compiler optimizations for narrow dependency tasks. (Check compiler assembly output to check the compiler optimizations.)
Add micro-benchmarks to check for performance characteristics of the previous tasks.
Architecturize the execution of this kind of work in a way that both the sync and async versions can take advantage of possible optimizations.

implement tree reduce for rdd

remove the requirement of hosts.conf in executors by sending master details during executor deployment

implement tree aggregate function

implement count approx for rdd

Koalas-like implementation

Hi
I read in your paper (https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c) that you wanted to be inspired by panda for implementing dataframes (and api). You could consider basing your implementation on Koalas (https://github.com/databricks/koalas)
regards

Why Capnp ?

Hi,
I was wondering why do you use capnp instead of rust library that would be easier to install ?
I feel like the benefits dont overcome the installation overhead.

Improve error handling and trazability on executor and user code crashes

This issue can be mentored for anyone who may want to help.

While this has improved we still have a whole lot of unwrapping around; since we still are in a very early phase is not necessary to go crazy on this since many things will change several times (probably). But, that said, and while definitively in some places panics and abortion should happen if anything goes wrong, for the sake of traceability better error handling is good to have even if the whole application ends crashing.

In particular we should inventory where exactly are the boundaries between executor ran code and driver ran code, and crashes in user code and executors should be handled, reported to and gracefully handled by the driver, which then should take a clear plan of action depending on the error (e.g. if the executor detects a problem in one of its threads while running code action should be one, if it dies due to some other reason other, etc.).

First task should be to inventory all the call places where is necessary to take action (only a fraction of all the unwraps really) and then extend/modify methods to return proper Return types which then can be used to shut down, signal drivers, clean up, etc.

Related to #26 and #25 (shall end fixing up that issue).

tracking issue for deployment related issues

[Core] Roadmap to 0.1.0

This is a tracking issue for the roadmap to 01.0 potential release of the core crate/package.

implement randomsplit for rdd

Fix non working examples

There are a couple of examples that won't work because the necessary data is not available:

file_read
parquet_column_read

We should either provide a small sample with the tests or change the tests with some fake data so those examples can be executed.

Use kubernetes for scheduling

Directly using kubernetes scheduling both nicely integrates into cloud providers but also saves you code to maintain.

remove serialization of duplicate data in dependencies along with task

Preliminary benchmarcks?

Everything is in the title, I understand that the project is young and it needs time to get faster than spark.
I'm just asking the current state, out of curiosity.

implement sorting functions and ordered_rdd

related to tracking issue #55

implement filter for rdd

implement is_empty for rdd

implement max and min for rdd

Handling resources destruction when program exits

Ctrl-C handling, proper destruction of resources in case of panic and remove explicit drop executor logic. Instead of cloning Context like currently, create a single context and wrap it inside ref count and move resource destruction logic like deleting all temp files and closing all spinned processes inside Drop trait.

Deadlock while partitioning

As talked in Gitter, while developing union I found out a problem where the application enters a deadlock while resolving the partitioning or computation of a dag. The workign branch is: https://github.com/iduartgomez/native_spark/tree/dev

The error is reproducible executing:

#[test]
fn test_error() {
    let sc = CONTEXT.clone();
    let join = || {
        let col1 = vec![
            (1, ("A".to_string(), "B".to_string())),
            (2, ("C".to_string(), "D".to_string())),
            (3, ("E".to_string(), "F".to_string())),
            (4, ("G".to_string(), "H".to_string())),
        ];
        let col1 = sc.parallelize(col1, 4);
        let col2 = vec![
            (1, "A1".to_string()),
            (1, "A2".to_string()),
            (2, "B1".to_string()),
            (2, "B2".to_string()),
            (3, "C1".to_string()),
            (3, "C2".to_string()),
        ];
        let col2 = sc.parallelize(col2, 4);
        col2.join(col1.clone(), 4)
    };
    let join1 = join();
    let join2 = join();
    let res = join1.union(join2).unwrap().collect().unwrap();
    assert_eq!(res.len(), 12);
}

Inside some executor there is a thread panic over here:

let mut stream_r = std::io::BufReader::new(&mut stream);
let message_reader = serialize_packed::read_message(&mut stream_r, r).unwrap()

implement zip partitions for rdd

implement pipe function

implement subtract for rdd

Refactor scheduler to remove duplicity/unused code

We got right now:

local_scheduler.rs
distributed_scheduler.rs
base_scheduler.rs
dag_scheduler.rs

local and distributed have some duplicate code still (basically the event loop and run job) which could be factored into a common trait (or pull it inside impl_common_scheduler_funcs macro too).

Then we have currently the base_scheduler (or NativeScheduler trait) which should be merged with dag_scheduler and made clear. Initially NativeScheduler was created to hide implementationfrom the public API (DAGScheduler can in theory be implemented by the user, but first a clear API should be found), NativeScheduler should implement DAGScheduler if we decide to go along this path. Then DAHScheduler should be required to context which would be generic over it (I guess).

Nothign to pressing but must do some clean up around all this eventually.

Utilization of Arrow/Rust Datafusion

Hi, I just read about Datafusion:
https://github.com/apache/arrow/tree/master/rust/datafusion

Would the SQL query planning, etc. be helpful for native_spark?

implement zip

ZippedRdd

related to tracking issue #55

Add API cargo doc documentation

Even if we are not publishing yet to crates.io would be nice to have the cargo doc documentation generated and uploaded to a branch here somewhere so we can reference it in the documentation/readme.

No graceful shutdown on panic inside executors.

Right now when there is a panic inside an executor the process is left open (at least in local mode) indefinitely and doe snot shutdown, the only way to terminate it is by sending SIGKILL to the master.

OS: Linux
Architecture: x86_64
Replication: Just write "assert!(false)" inside a map function to be executed (MapRDD).

remove capnp dependency using code based RPC

Add rust-toolchain file

You can use the toolchain file to specify the nightly version.

See TiKV for an example, another project using nightly Rust.

rdd ops

How to run the codes in examples?

I create a test_native_spark project and copy codes in make_rdd.rs to the project's main.rs
Then run this project using cargo +nightly-2019-09-11 run got some errors:

thread 'main' panicked at 'Unable to open the file: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

How can I solve this?

Join contribute ?

Can I join the collaboration, I am eager to join, I think I can contribute

Spark is a trademark of the Apache Software Foundation

see: https://spark.apache.org/trademarks.html

Improve application configuration execution/deployment

Right now the way we are doing configuration is a bit lacklustre in the following way: we are using clap to parse many of the configuration parameters, passing them by command line argument, this creates a problem where in an user created application it will collide with their own command line arguments.

Similarly, this already collides with cargo own optional parameters, for example something like this will fail: cargo test -- --test-threads=1.

We must provide a more elegant and ergonomic way to pass configuration parameters which may not collide with user (or generated, e.g. cargo) code. A first approach is to add/revamp the configuration file we are already using (hosts.conf) to include more configuration parameters, which we would eventually have to do anyway. Additionally, centralize all the environment variables configuration managment (under env.rs) on initialization and document that, so the user can use those to set up any required parameters.

Also for local execution and testing, many of the defaults could be provided (e.g. NS_LOCAL_IP) so they don't require to be provided either by env variable or argument parameter (e.g. Spark itself assigns a free local ip if necessary when executing in local mode).

implement groupby for rdd

implement take ordered for rdd

Support integration with Elastic

See for example how Elastic can be integrated with the original Apache Spark. Apart from Elastic, integration with Logstash might be useful too, see e.g. example of setting up Kafka, Spark and Logstash.

Since there are Rust alternatives to both Logstach and Elastic it might make sense to integrate with them too:

Vector - Logstash alternative, see https://vector.dev/
Sonic - Elasticsearch alternative
Toshi - one more Elasticsearch alternative

build failed!!

errorddeMacBook-Pro:native_spark d$ cargo build
Compiling native_spark v0.1.0 (/Users/d/Work/opensource/native_spark)
Compiling bincode v1.2.0
Compiling serde_closure v0.2.7
Compiling rustc_version v0.2.3
error: failed to run custom build command for native_spark v0.1.0 (/Users/d/Work/opensource/native_spark)

Caused by:
process didn't exit successfully: /Users/d/Work/opensource/native_spark/target/debug/build/native_spark-3382f7e3c05897a6/build-script-build (exit code: 101)
--- stderr
thread 'main' panicked at 'capnpc compiling issue: Error { kind: Failed, description: "Error while trying to execute capnp compile: Failed: No such file or directory (os error 2). Please verify that version 0.5.2 or higher of the capnp executable is installed on your system. See https://capnproto.org/install.html" }', src/libcore/result.rs:1165:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

warning: build failed, waiting for other jobs to finish...
error: build failed

textFile
wholeTextFiles
binary files | binary records
Hadoop* family of methods