Coder Social home page Coder Social logo

henridf / apache-spark-node Goto Github PK

View Code? Open in Web Editor NEW
143.0 143.0 14.0 794 KB

Node.js bindings for Apache Spark DataFrame APIs

Home Page: https://henridf.github.io/apache-spark-node

License: Apache License 2.0

JavaScript 97.31% Shell 0.10% Scala 2.47% Batchfile 0.13%
data-frame node spark

apache-spark-node's People

Contributors

doron2402 avatar henridf avatar tobilg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apache-spark-node's Issues

Adding external JARs

I was trying to access the spark-csv package, but I was unsuccessful by either

a) Adding it to the java.classpath like

java.classpath.push("/tmp/jarCache/spark-csv_2.10-1.3.0.jar");

b) Specifying the --jar parameter in the args for 'NodeSparkSubmit'

Is there a suggested way to do this that I am missing?

unionAll not working?

Hi there; perhaps I'm misunderstanding how this should work, but shouldn't the below code simply create a union of the two data sets and output the count? Instead I'm getting a java.lang.NullPointerException on the unionAll call?

var spark = require('apache-spark-node');
var context = spark([], process.env.ASSEMBLY_JAR);
var sqlContext = context.sqlContext;
var sqlFunctions = context.sqlFunctions;

var data1 = [
  {"name":"Michael"},
  {"name":"Andy", "age":30},
  {"name":"Justin", "age": 19}
];

var data2 = [
  {"name":"Fiona"},
  {"name":"Jenny", "age":30},
  {"name":"Sandra", "age":30},
  {"name":"Alexa", "age": 19}
];

var df1 = sqlContext.createDataFrame(data1);
var df2 = sqlContext.createDataFrame(data2);

console.log(df1.countSync());
console.log(df2.countSync());

var u = df1.unionAll(df2);
console.log(u.countSync());

Querying a cassandra DB via spark

Hey there,

As the title says, i am trying to query an existing cassandra DB from nodejs using your library. I am using a spark cluster on a LAN

Here's what i have done so far :
using :

  • CentOS 7
  • node 4.4.4
  • [email protected]
  • spark 1.6.1
  • cassandra 2.2.5
  • spark-cassandra-connector 1.6.0-M1

From the root of my project :

ASSEMBLY_JAR=/usr/share/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node \
--master spark://192.168.1.101:7077 --conf spark.cores.max=4 \
--jars /root/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.6.0-M1-36-g220aa37.jar

Once i have access to the command line i tried to do

spark-node> sqlContext.sql("Select count(*) from mykeyspace.mytable")

but of course i get a

Error: Error creating class
org.apache.spark.sql.AnalysisException: Table not found: `mykeyspace`.`mytable`;

i then tried to adapt a snippet of scala i've seen on a stack overflow post

var df = sqlContext
  .read()
  .format("org.apache.spark.sql.cassandra")
  .option("table", "mytable")
  .option("keyspace", "mykeyspace")
  .load(null, function(err, res) { console.log(err); console.log(res) }) 

but all i get is a

Error: Error running instance method
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark-packages.org

The problem surely comes from the fact that i don't understand half of how everything is linked together, that's why i'm here asking for some help about this issue. All i need is a way to execute basic sql functions (with only WHERE clauses) over one cassandra table.

I recon this project seems no longer maintained, but this is as far as i can see the simpler solution i have seen so far (solutions like eclairJS have way more functionalities than i need, at the cost of an increased complexity and maybe less performance) and it would just fill my needs.

Add Dockerfile

To make it easy to get started (for those who don't have a local installation of Spark and/or node)

Let's reactivate the project?

Hi @henridf! I really love heavy computing and have been working with it here in Brazil for like, one year now, and I'm really tired of can't use my beloved Node.js with such powerful engine that is Spark, so I really want to help with this project, to reactivate and try to enhance this ecosystem within JavaScript back-end developers!

Thanks in advance and really congrats to start this! Let's update the dependencies to the newest Node.js version and the new Spark version too, to keep it updated!

Best regards,
Igor Franca

How to connect to s Spark Standalone Cluster?

I was trying to connect to a running standalone cluster via specifying either conf.setMaster("spark://192.168.200.180:7077") or supplying --master spark://192.168.200.180:7077 in the NodeSparkSubmit arguments.

Output:

15/12/21 14:43:08 INFO AppClient$ClientEndpoint: Connecting to master spark://192.168.200.180:7077...
15/12/21 14:43:28 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@4f36718b rejected from java.util.concurrent.ThreadPoolExecutor@4e9f4716[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
15/12/21 14:43:28 INFO DiskBlockManager: Shutdown hook called
15/12/21 14:43:28 INFO ShutdownHookManager: Shutdown hook called

The Spark master is up, I can do a

$ ./spark-shell --master spark://192.168.200.180:7077 --num-executors=2 --total-executor-cores=4
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/12/21 14:49:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/21 14:49:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/21 14:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/21 14:49:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala> 

Do you have any idea where this goes south?

Question: Is it somehow possible to use a Context and execute statements in parallel?

I was trying to reuse an existing Context, und run some statements in parallel.

As it seems, this is not possible due to the nature of node-java. I opened an issue at joeferner/node-java#290 but have not received a definitive answer yet.

As the Spark docs say,

multiple parallel jobs can run simultaneously if they were submitted from separate threads

which I didn't find a solution for, yet. Do you have any idea?

Failed to start spark-node with No Java runtime present error.

I encountered problem as titled.
As shown next, Java 1.8 is installed but spark-node failed with error of "No Java runtime present".

$ java -version
java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
$ node_modules/apache-spark-node/bin/spark-node 
No Java runtime present, requesting install.

Additional information of my environment is as next.

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.11.4
BuildVersion:   15E65

Error when trying to start / Readme inaccurate

When I try to start I get the folllowing error:

ASSEMBLY_JAR=/Users/tobilg/Development/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node 
/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207
  var clazz = java.findClassSync(name); // TODO: change to Class.forName when classloader issue is resolved.
                   ^

Error: Could not create class org.apache.spark.deploy.NodeSparkSubmit
java.lang.NoClassDefFoundError: org/apache/spark/deploy/NodeSparkSubmit
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.NodeSparkSubmit
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

    at Error (native)
    at Java.java.import (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207:20)
    at spark_parseArgs (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:21:38)
    at Object.sqlContext (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:33:5)
    at Object.<anonymous> (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/bin/spark-node:23:26)
    at Module._compile (module.js:434:26)
    at Object.Module._extensions..js (module.js:452:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:475:10)

Furthermore, the https://github.com/henridf/apache-spark-node#running paragraph states that one should run bin/node-spark, which is non-existant.

Error with ASSEMBLY_JAR

Hello,

I have a cluster of three VM and I installed HDP 2.5. slave1-virtual-machine is one of these three VM.

My directory visuVelov is the directory where I issued :
npm install apache-spark-node

So, I execute in this directory :
env ASSEMBLY_JAR="/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar" node_modules/apache-spark-node/bin/spark-node

But I have this error :
image

I don't understand...

How to fix this error ?

Failure at installation

e:\Projects\apache-spark-node\node_modules\java>if not defined npm_config_node_g
yp (node "D:\Program Files\nodejs\node_modules\npm\bin\node-gyp-bin....\node_
modules\node-gyp\bin\node-gyp.js" rebuild ) else (node rebuild )
gyp WARN install got an error, rolling back install
gyp ERR! configure error
gyp ERR! stack Error: incorrect header check
gyp ERR! stack at Zlib._handle.onerror (zlib.js:366:17)
gyp ERR! System Windows_NT 6.1.7601
gyp ERR! command "node" "D:\Program Files\nodejs\node_modules\npm\node_modu
les\node-gyp\bin\node-gyp.js" "rebuild"
gyp ERR! cwd e:\Projects\apache-spark-node\node_modules\java
gyp ERR! node -v v0.12.7
gyp ERR! node-gyp -v v2.0.1
gyp ERR! not ok
npm ERR! Windows_NT 6.1.7601
npm ERR! argv "D:\Program Files\nodejs\node.exe" "D:\Program Files\nodejs
\node_modules\npm\bin\npm-cli.js" "install"
npm ERR! node v0.12.7
npm ERR! npm v2.11.3
npm ERR! code ELIFECYCLE

npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! This is most likely a problem with the java package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get their info via:
npm ERR! npm owner ls java
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR! e:\Projects\apache-spark-node\npm-debug.log

Async APIs

All of the API calls are currently synchronous, which is pretty unusual for a nodejs library. There needs to be a corresponding async set of APIs.

Following the node convention, the current synchronous functions should be postfixed Sync and the new async ones should have un-postfixed names.

Make spark-node ES7-aware

With #32 there's now the support for promisified Spark actions, which seem like a natural match for me for the ES7 async/await syntax. This provides the possibility to write "quasi-synchronous code style" which in fact gets executed asynchronously.

E.g.

spark-node> var df = await sqlContext.read().jsonPromised("./data/people.json")
true
spark-node> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

false
spark-node> 

I created a solution with babel.js on the fly transpilation and can create a new PR for this if you like.

Support concurrent jobs

Once we have an async API (#26), we should provide a way to launch jobs in different underlying java threads, so that users can run multiple jobs concurrently.

df.toJSON() not working

I try to get a DataFrame as a JSON structure via df.toJSON(). The command completes, but the output is

spark-node> var df = sqlContext.read().json("./data/people.json")
spark-node> df.toJSON()
nodeJava_org_apache_spark_rdd_MapPartitionsRDD {
  'org$apache$spark$rdd$RDD$$evidence$1': nodeJava_scala_reflect_ClassTag__anon_1 {} }
spark-node> 

Is it possible to get a DataFrame as JSON with the current version?

Streaming support from kafka ?

Hi,
I have an use case where my spark code has to read streaming data from kafka and perform some business logic and write it to HDFS. My entire business logic is currently in node.js. Does this repo supports this use case ? I also see that the examples described are in interactive mode with spark-node, on similar lines how can i run my java script code via spark-submit ?

Make comments jsdoc compatible

Similar to what was done in #18 for lib/DataFrame.js in 9675dd3, but for the other API wrappers in lib/

  • DataFrame.js
  • DataFrameReader.js
  • GroupedData.js
  • functions.js
  • sqlContext.js
  • Column.js

Row.js doesn't need to be documented (at least yet) -- it is only used internally so far.

Problem with spark-node

Hi guys,

I installed via npm into my project and I downloaded the spark-assembly, but when I try to run spark-node I get the following error:

Command:

ASSEMBLY_JAR=/Users/pola/spark-test/bin/spark-assembly-1.6.0-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node

Error:

/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3
true; if ("value" in descriptor) descriptor.writable = true; Object.defineProp
                                                                    ^
TypeError: Cannot redefine property: length
    at Function.defineProperty (native)
    at defineProperties (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:296)
    at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:495
    at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:33:5
    at Object.<anonymous> (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:2415:3)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

What am I missing?

I am using:

apache-spark-node: 0.3.3
node: v0.10.28

Thanks in advance,
PoLa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.