Coder Social home page Coder Social logo

spark-ceph-example's Introduction

Spark & Ceph

This repo contains a collection of the files for launching a Ceph cluster and learning how to integrate Ceph S3 with Spark.

Prerequisites:

Create a Ceph cluster

To create a VM using vagrant with VirtualBox. When the operation is completed, execute vagrant ssh <name> command to enter VM:

$ vagrant up
$ vagrant ssh ceph-aio

Spark s3a example

This guide is talking about Spark get data from S3, you can follow below steps to produce an environment used to testing.

First, go to /vagrant dir and run:

$ cd /vagrant
$ ./spark-initial.sh
$ jps
22194 SecondaryNameNode
22437 Jps
21350 ResourceManager
21469 NodeManager
21998 DataNode
21822 NameNode

And then put a data file to S3 using s3client:

$ ./create-s3-account.sh
$ source test-key.sh
$ ./s3client create files
$ cat <<EOF > words.txt
AA A CC AA DD AA EE AE AE
CC DD AA EE CC AA AA DD AA EE AE AE
EOF

$ ./s3client upload files words.txt /
Upload [words.txt] success ...

Now execute spark-shell to type the follow step:

$ spark-shell --master yarn
scala> val textFile = sc.textFile("s3a://files/words.txt")
textFile: org.apache.spark.rdd.RDD[String] = s3a://files/words.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26

scala> counts.saveAsTextFile("s3a://files/output")
[Stage 0:>                                                          (0 + 2) / 2]

Use s3client command to list output files:

$ ./s3client list files
---------- [files] ----------
output/_SUCCESS     	0                   	2017-03-28T17:06:59.775Z
output/part-00000   	35                  	2017-03-28T17:06:58.413Z
output/part-00001   	6                   	2017-03-28T17:06:59.056Z
words.txt           	44                  	2017-03-28T17:04:33.141Z

You also can use hdfs command to list files and display content:

$ hadoop fs -ls s3a://files/output
Found 3 items
-rw-rw-rw-   1          0 2017-03-28 17:06 s3a://files/output/_SUCCESS
-rw-rw-rw-   1         35 2017-03-28 17:06 s3a://files/output/part-00000
-rw-rw-rw-   1          6 2017-03-28 17:06 s3a://files/output/part-00001

$ hadoop fs -cat s3a://files/output/part-00000
(DD,2)
(EE,2)
(AE,2)
(CC,3)
(AA,5)

spark-ceph-example's People

Contributors

kairen avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ys610zz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.