Coder Social home page Coder Social logo

codingjaguar / spark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/spark

0.0 2.0 0.0 153.57 MB

Mirror of Apache Spark

License: Apache License 2.0

Shell 0.72% Batchfile 0.13% R 2.65% Makefile 0.04% C 0.01% Scala 78.11% Java 9.51% JavaScript 0.35% CSS 0.08% Python 8.37% Thrift 0.01% Groff 0.03%

spark's Introduction

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at "Building Spark". For developing Spark using an IDE, see Eclipse and IntelliJ.

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

spark's People

Contributors

mateiz avatar rxin avatar pwendell avatar tdas avatar joshrosen avatar mengxr avatar liancheng avatar ankurdave avatar marmbrus avatar zsxwing avatar jegonzal avatar srowen avatar yhuai avatar scrapcodes avatar shivaram avatar cloud-fan avatar aarondav avatar viirya avatar andrewor14 avatar chenghao-intel avatar kayousterhout avatar jkbradley avatar sarutak avatar yanboliang avatar jerryshao avatar sryza avatar holdenk avatar scwf avatar karenfeng avatar dennybritz avatar

Watchers

 avatar scope.wind avatar

spark's Issues

Dec 4 Cache Plan

Set level matching for Filter

Match the filter expression without effect of its orders

  • Code changes
  • Add test
  • Style/Test cases check
  • Jira
  • Pul request
  • Contact committer for auto test

final stage

  1. Poster
  2. Reporter
  3. Baidu data evaluation
  4. Synthetic data evaluation
  5. Distributed data evaluation
  6. Set of filter expression: bug fix
  7. PR: set
  8. Projection set
  9. Filter merge
  10. PR: projection + filter

Meeting note 10/18

  1. generate a synthetic workload
  2. initial implementation:
    1. how to index cache? Serialize the list of tuples [(column, predict), โ€ฆ]
    2. how to store the cache items? use the cache() provided by sparksql
    3. when executing a query, the cache planner asks logical planner for all the tables, columns and the predicates applied on them, then pass a list of key value pairs to cache manager. cache manager is responsible for inserting callbacks to spark so that the internal results will be materialized.
  3. next implementation:
    1. cache the joined tables
  4. Workflow analyze
    1. Use cache planner to analyze all the workflow at first
    2. Cache planner then start the sparkSQL plans sequentially.
  5. RDD collector:
    1. Cache manager need to collect data, maybe from spark

phase 1 plan (Oct.18 - Oct 28)

Test preparation

Generate the test cases for future usage, and overcome difficulties on DevOps.

  • Finish generate sample data
  • Investigate on TPC-C and try to use test case on that
  • Write sample iterative test cases
  • benchmark on origin sparkSQL

Developing first prototype

  • Read codes on SparkSQL, learn how to add Cache planter
  • Read codes on Spark, learn how to collect data from Spark.
  • Design the interface of CachePlaner and CacheManager.
  • Code the prototype
  • UnitTest
  • Test cases on the benchmark

Yang's Plan:

  • Learn the scala in one hour
  • Finish the Test preparation steps

Jiang's Plan

  • Read Spark & SparkSQL code

Cache replace strategy

1
For this dilemma:

  1. if we cache table a, table b, table a.join.b
  2. a and b are much less than the a.join.b, what cache replacement strategy is going to take?

2
For this dilemma:

  1. cache table a.join.b, table a and table b are in disk
  2. table a and table b can be filtered a lot from the predicate. a.join.b is very large
    Choose which?

A solution for both of these dilemma is a sophisticated cost estimator.

Proposal of the project

  • Finish the draft version
  • Discussion with professor on 9/30 12:40 - 13:00
  • After meeting discussion
    • Time line
    • Steps
    • Working hours
  • Finish the final version

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.