rajasekarv / vega Goto Github PK
View Code? Open in Web Editor NEWA new arguably faster implementation of Apache Spark from scratch in Rust
License: Apache License 2.0
A new arguably faster implementation of Apache Spark from scratch in Rust
License: Apache License 2.0
Similarly to #51, do likewise for Azure storage solution: implemente a read/write connector. If possible try to follow a common interface/pattern.
As Spark uses RDD under the hood, would it be possible and make sense to use native_spark as the backend for the official java Spark version?
After some talk we have decided to take a careful gradual approach to integrate async into the library.
Adding asynchronous computation is a large departure from the reference Spark implementation, and may change how we do certain things or what is possible (like certain optimizations that rely on stack allocation in our case) in ways that are not yet clear.
Therefore, is preferred to take a gradual approach as we explore the design space and evolve the library. The original work can be seen at #67, some work done in that preliminary PR will be ported to the main branch and more steeps will be taken to make testing and comparing both versions easily while we experiment.
Meanwhile an async branch will be maintained and kept in sync with the master branch.
Hi
I read in your paper (https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c) that you wanted to be inspired by panda for implementing dataframes (and api). You could consider basing your implementation on Koalas (https://github.com/databricks/koalas)
regards
Hi,
I was wondering why do you use capnp instead of rust library that would be easier to install ?
I feel like the benefits dont overcome the installation overhead.
This issue can be mentored for anyone who may want to help.
While this has improved we still have a whole lot of unwrapping around; since we still are in a very early phase is not necessary to go crazy on this since many things will change several times (probably). But, that said, and while definitively in some places panics and abortion should happen if anything goes wrong, for the sake of traceability better error handling is good to have even if the whole application ends crashing.
In particular we should inventory where exactly are the boundaries between executor ran code and driver ran code, and crashes in user code and executors should be handled, reported to and gracefully handled by the driver, which then should take a clear plan of action depending on the error (e.g. if the executor detects a problem in one of its threads while running code action should be one, if it dies due to some other reason other, etc.).
First task should be to inventory all the call places where is necessary to take action (only a fraction of all the unwraps really) and then extend/modify methods to return proper Return types which then can be used to shut down, signal drivers, clean up, etc.
This is a tracking issue for the roadmap to 01.0 potential release of the core
crate/package.
There are a couple of examples that won't work because the necessary data is not available:
We should either provide a small sample with the tests or change the tests with some fake data so those examples can be executed.
Directly using kubernetes scheduling both nicely integrates into cloud providers but also saves you code to maintain.
Everything is in the title, I understand that the project is young and it needs time to get faster than spark.
I'm just asking the current state, out of curiosity.
related to tracking issue #55
Ctrl-C handling, proper destruction of resources in case of panic and remove explicit drop executor logic. Instead of cloning Context like currently, create a single context and wrap it inside ref count and move resource destruction logic like deleting all temp files and closing all spinned processes inside Drop trait.
As talked in Gitter, while developing union
I found out a problem where the application enters a deadlock while resolving the partitioning or computation of a dag. The workign branch is: https://github.com/iduartgomez/native_spark/tree/dev
The error is reproducible executing:
#[test]
fn test_error() {
let sc = CONTEXT.clone();
let join = || {
let col1 = vec![
(1, ("A".to_string(), "B".to_string())),
(2, ("C".to_string(), "D".to_string())),
(3, ("E".to_string(), "F".to_string())),
(4, ("G".to_string(), "H".to_string())),
];
let col1 = sc.parallelize(col1, 4);
let col2 = vec![
(1, "A1".to_string()),
(1, "A2".to_string()),
(2, "B1".to_string()),
(2, "B2".to_string()),
(3, "C1".to_string()),
(3, "C2".to_string()),
];
let col2 = sc.parallelize(col2, 4);
col2.join(col1.clone(), 4)
};
let join1 = join();
let join2 = join();
let res = join1.union(join2).unwrap().collect().unwrap();
assert_eq!(res.len(), 12);
}
Inside some executor there is a thread panic over here:
let mut stream_r = std::io::BufReader::new(&mut stream);
let message_reader = serialize_packed::read_message(&mut stream_r, r).unwrap()
We got right now:
local_scheduler.rs
distributed_scheduler.rs
base_scheduler.rs
dag_scheduler.rs
local and distributed have some duplicate code still (basically the event loop and run job) which could be factored into a common trait (or pull it inside impl_common_scheduler_funcs
macro too).
Then we have currently the base_scheduler (or NativeScheduler
trait) which should be merged with dag_scheduler and made clear. Initially NativeScheduler
was created to hide implementationfrom the public API (DAGScheduler can in theory be implemented by the user, but first a clear API should be found), NativeScheduler should implement DAGScheduler if we decide to go along this path. Then DAHScheduler should be required to context which would be generic over it (I guess).
Nothign to pressing but must do some clean up around all this eventually.
Hi, I just read about Datafusion:
https://github.com/apache/arrow/tree/master/rust/datafusion
Would the SQL query planning, etc. be helpful for native_spark?
ZippedRdd
related to tracking issue #55
Even if we are not publishing yet to crates.io would be nice to have the cargo doc documentation generated and uploaded to a branch here somewhere so we can reference it in the documentation/readme.
Right now when there is a panic inside an executor the process is left open (at least in local mode) indefinitely and doe snot shutdown, the only way to terminate it is by sending SIGKILL to the master.
OS: Linux
Architecture: x86_64
Replication: Just write "assert!(false)" inside a map function to be executed (MapRDD).
You can use the toolchain file to specify the nightly version.
See TiKV for an example, another project using nightly Rust.
I create a test_native_spark
project and copy codes in make_rdd.rs
to the project's main.rs
Then run this project using cargo +nightly-2019-09-11 run
got some errors:
thread 'main' panicked at 'Unable to open the file: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
How can I solve this?
Can I join the collaboration, I am eager to join, I think I can contribute
Right now the way we are doing configuration is a bit lacklustre in the following way: we are using clap
to parse many of the configuration parameters, passing them by command line argument, this creates a problem where in an user created application it will collide with their own command line arguments.
Similarly, this already collides with cargo
own optional parameters, for example something like this will fail: cargo test -- --test-threads=1
.
We must provide a more elegant and ergonomic way to pass configuration parameters which may not collide with user (or generated, e.g. cargo) code. A first approach is to add/revamp the configuration file we are already using (hosts.conf) to include more configuration parameters, which we would eventually have to do anyway. Additionally, centralize all the environment variables configuration managment (under env.rs) on initialization and document that, so the user can use those to set up any required parameters.
Also for local execution and testing, many of the defaults could be provided (e.g. NS_LOCAL_IP
) so they don't require to be provided either by env variable or argument parameter (e.g. Spark itself assigns a free local ip if necessary when executing in local mode).
See for example how Elastic can be integrated with the original Apache Spark. Apart from Elastic, integration with Logstash might be useful too, see e.g. example of setting up Kafka, Spark and Logstash.
Since there are Rust alternatives to both Logstach and Elastic it might make sense to integrate with them too:
errorddeMacBook-Pro:native_spark d$ cargo build
Compiling native_spark v0.1.0 (/Users/d/Work/opensource/native_spark)
Compiling bincode v1.2.0
Compiling serde_closure v0.2.7
Compiling rustc_version v0.2.3
error: failed to run custom build command for native_spark v0.1.0 (/Users/d/Work/opensource/native_spark)
Caused by:
process didn't exit successfully: /Users/d/Work/opensource/native_spark/target/debug/build/native_spark-3382f7e3c05897a6/build-script-build
(exit code: 101)
--- stderr
thread 'main' panicked at 'capnpc compiling issue: Error { kind: Failed, description: "Error while trying to execute capnp compile
: Failed: No such file or directory (os error 2). Please verify that version 0.5.2 or higher of the capnp executable is installed on your system. See https://capnproto.org/install.html" }', src/libcore/result.rs:1165:5
note: run with RUST_BACKTRACE=1
environment variable to display a backtrace.
warning: build failed, waiting for other jobs to finish...
error: build failed
I believe, it will be useful for other applications too.
Implement a connector to read/write from/to AWS S3. For inspiration maybe look at a the HDFS FS interface. If possible try to come with a common interface we could reuse for other cloud providers (and potentially any "fs-like" source).
I downloaded the source and trying to build with clippy there are quite a few lint warning but specially some lint errors like: clone double reference, drop on copy types, etc.
If there is no special reason why clippy is not being run I would advice to run it and clean up the code (specially the error lints). I am up to do a PR cleaning it up if you wish.
Having command by command documentation, either within or linked to in the readme, on how to setup a cluster and run an example would make this project more approachable.
Basically what's in the comments for #11 but extended to cover distributed mode.
For core RDD ops we understand those which spawn in the original Apache Spark from SparkContext and/or the base RDD class and friends:
SC:
Non-goals for this tracking issue are any I/O related ops as we are tracking those elsewhere and doing things a little bit differently:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.