henridf / apache-spark-node Goto Github PK

View Code? Open in Web Editor NEW

143.0 143.0 14.0 794 KB

Node.js bindings for Apache Spark DataFrame APIs

Home Page: https://henridf.github.io/apache-spark-node

License: Apache License 2.0

JavaScript 97.31% Shell 0.10% Scala 2.47% Batchfile 0.13%

data-frame node spark

apache-spark-node's People

Contributors

Stargazers

Watchers

Forkers

tobilg bmannava-invn saibabanadh jung-kim doron2402 njaulj ralic ifa6 happyastronaut mustafa-colakoglu mannharleen ghfork bluesockets liufanghua2012

apache-spark-node's Issues

Write up instructions for using with a notebook

Beaker seems like the best candidate, given its support for javascript and its polyglot nature.

My guess/hope is that this should be pretty simple...

Question: Integrate Scala.js for missing objects?

I just stumbled over Scala.js, and asked myself if parts of it's code could be used to provide missing JS objects like Seq or others...

Unable to install module with node version of 8.1.4

Hello henridf,

Unable to install module with node version of 8.1.4. Any help would be appreciated.

Thanks,
Piyush

Adding external JARs

I was trying to access the spark-csv package, but I was unsuccessful by either

a) Adding it to the java.classpath like

java.classpath.push("/tmp/jarCache/spark-csv_2.10-1.3.0.jar");

b) Specifying the --jar parameter in the args for 'NodeSparkSubmit'

Is there a suggested way to do this that I am missing?

Hi there; perhaps I'm misunderstanding how this should work, but shouldn't the below code simply create a union of the two data sets and output the count? Instead I'm getting a java.lang.NullPointerException on the unionAll call?

var spark = require('apache-spark-node');
var context = spark([], process.env.ASSEMBLY_JAR);
var sqlContext = context.sqlContext;
var sqlFunctions = context.sqlFunctions;

var data1 = [
  {"name":"Michael"},
  {"name":"Andy", "age":30},
  {"name":"Justin", "age": 19}
];

var data2 = [
  {"name":"Fiona"},
  {"name":"Jenny", "age":30},
  {"name":"Sandra", "age":30},
  {"name":"Alexa", "age": 19}
];

var df1 = sqlContext.createDataFrame(data1);
var df2 = sqlContext.createDataFrame(data2);

console.log(df1.countSync());
console.log(df2.countSync());

var u = df1.unionAll(df2);
console.log(u.countSync());

Querying a cassandra DB via spark

Hey there,

As the title says, i am trying to query an existing cassandra DB from nodejs using your library. I am using a spark cluster on a LAN

Here's what i have done so far :
using :

CentOS 7
node 4.4.4
[email protected]
spark 1.6.1
cassandra 2.2.5
spark-cassandra-connector 1.6.0-M1

From the root of my project :

ASSEMBLY_JAR=/usr/share/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node \
--master spark://192.168.1.101:7077 --conf spark.cores.max=4 \
--jars /root/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.6.0-M1-36-g220aa37.jar

Once i have access to the command line i tried to do

spark-node> sqlContext.sql("Select count(*) from mykeyspace.mytable")

but of course i get a

Error: Error creating class
org.apache.spark.sql.AnalysisException: Table not found: `mykeyspace`.`mytable`;

i then tried to adapt a snippet of scala i've seen on a stack overflow post

var df = sqlContext
  .read()
  .format("org.apache.spark.sql.cassandra")
  .option("table", "mytable")
  .option("keyspace", "mykeyspace")
  .load(null, function(err, res) { console.log(err); console.log(res) })

but all i get is a

Error: Error running instance method
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark-packages.org

The problem surely comes from the fact that i don't understand half of how everything is linked together, that's why i'm here asking for some help about this issue. All i need is a way to execute basic sql functions (with only WHERE clauses) over one cassandra table.

I recon this project seems no longer maintained, but this is as far as i can see the simpler solution i have seen so far (solutions like eclairJS have way more functionalities than i need, at the cost of an increased complexity and maybe less performance) and it would just fill my needs.

Add promise version of async functions

as per the discussion in #28 starting with this comment: #28 (comment)

Add Dockerfile

To make it easy to get started (for those who don't have a local installation of Spark and/or node)

Let's reactivate the project?

Hi @henridf! I really love heavy computing and have been working with it here in Brazil for like, one year now, and I'm really tired of can't use my beloved Node.js with such powerful engine that is Spark, so I really want to help with this project, to reactivate and try to enhance this ecosystem within JavaScript back-end developers!

Thanks in advance and really congrats to start this! Let's update the dependencies to the newest Node.js version and the new Spark version too, to keep it updated!

Best regards,
Igor Franca

How to connect to s Spark Standalone Cluster?

I was trying to connect to a running standalone cluster via specifying either conf.setMaster("spark://192.168.200.180:7077") or supplying --master spark://192.168.200.180:7077 in the NodeSparkSubmit arguments.

Output:

15/12/21 14:43:08 INFO AppClient$ClientEndpoint: Connecting to master spark://192.168.200.180:7077...
15/12/21 14:43:28 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@4f36718b rejected from java.util.concurrent.ThreadPoolExecutor@4e9f4716[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
15/12/21 14:43:28 INFO DiskBlockManager: Shutdown hook called
15/12/21 14:43:28 INFO ShutdownHookManager: Shutdown hook called

The Spark master is up, I can do a

$ ./spark-shell --master spark://192.168.200.180:7077 --num-executors=2 --total-executor-cores=4
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/12/21 14:49:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/21 14:49:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/21 14:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/21 14:49:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala>

Do you have any idea where this goes south?

node-gyp binding error on windows10

node-gyp rebuild produce many error....
any idea?

Question: Is it somehow possible to use a Context and execute statements in parallel?

I was trying to reuse an existing Context, und run some statements in parallel.

As it seems, this is not possible due to the nature of node-java. I opened an issue at joeferner/node-java#290 but have not received a definitive answer yet.

As the Spark docs say,

multiple parallel jobs can run simultaneously if they were submitted from separate threads

which I didn't find a solution for, yet. Do you have any idea?

Failed to start spark-node with No Java runtime present error.

I encountered problem as titled.
As shown next, Java 1.8 is installed but spark-node failed with error of "No Java runtime present".

$ java -version
java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
$ node_modules/apache-spark-node/bin/spark-node 
No Java runtime present, requesting install.

Additional information of my environment is as next.

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.11.4
BuildVersion:   15E65

Use linter

jshint? eslint? https://github.com/babel/babel-eslint ?

Read data from mysql

Hello is possible read data from mysql ?

Regards.

Error when trying to start / Readme inaccurate

When I try to start I get the folllowing error:

ASSEMBLY_JAR=/Users/tobilg/Development/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node 
/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207
  var clazz = java.findClassSync(name); // TODO: change to Class.forName when classloader issue is resolved.
                   ^

Error: Could not create class org.apache.spark.deploy.NodeSparkSubmit
java.lang.NoClassDefFoundError: org/apache/spark/deploy/NodeSparkSubmit
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.NodeSparkSubmit
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

    at Error (native)
    at Java.java.import (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207:20)
    at spark_parseArgs (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:21:38)
    at Object.sqlContext (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:33:5)
    at Object.<anonymous> (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/bin/spark-node:23:26)
    at Module._compile (module.js:434:26)
    at Object.Module._extensions..js (module.js:452:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:475:10)

Furthermore, the https://github.com/henridf/apache-spark-node#running paragraph states that one should run bin/node-spark, which is non-existant.

Error with ASSEMBLY_JAR

Hello,

I have a cluster of three VM and I installed HDP 2.5. slave1-virtual-machine is one of these three VM.

My directory visuVelov is the directory where I issued :
npm install apache-spark-node

So, I execute in this directory :
env ASSEMBLY_JAR="/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar" node_modules/apache-spark-node/bin/spark-node

But I have this error :

I don't understand...

How to fix this error ?

Write js wrapper class for Column (in lib/Column.js)

Methods on Column can be used in js programs (and are already exposed in some docs, such as the example in the comments of Functions.when()).

Failure at installation

e:\Projects\apache-spark-node\node_modules\java>if not defined npm_config_node_g
yp (node "D:\Program Files\nodejs\node_modules\npm\bin\node-gyp-bin....\node_
modules\node-gyp\bin\node-gyp.js" rebuild ) else (node rebuild )
gyp WARN install got an error, rolling back install
gyp ERR! configure error
gyp ERR! stack Error: incorrect header check
gyp ERR! stack at Zlib._handle.onerror (zlib.js:366:17)
gyp ERR! System Windows_NT 6.1.7601
gyp ERR! command "node" "D:\Program Files\nodejs\node_modules\npm\node_modu
les\node-gyp\bin\node-gyp.js" "rebuild"
gyp ERR! cwd e:\Projects\apache-spark-node\node_modules\java
gyp ERR! node -v v0.12.7
gyp ERR! node-gyp -v v2.0.1
gyp ERR! not ok
npm ERR! Windows_NT 6.1.7601
npm ERR! argv "D:\Program Files\nodejs\node.exe" "D:\Program Files\nodejs
\node_modules\npm\bin\npm-cli.js" "install"
npm ERR! node v0.12.7
npm ERR! npm v2.11.3
npm ERR! code ELIFECYCLE

npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! This is most likely a problem with the java package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get their info via:
npm ERR! npm owner ls java
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR! e:\Projects\apache-spark-node\npm-debug.log

Add CI

probably Travis

Async APIs

All of the API calls are currently synchronous, which is pretty unusual for a nodejs library. There needs to be a corresponding async set of APIs.

Following the node convention, the current synchronous functions should be postfixed Sync and the new async ones should have un-postfixed names.

Integration of external packages with inheritance on existing classes

In the case of the Elasticsearch Spark plugin, the SparkContext is amended with additional methods, such as esRDD() and makeRDD().

I wasn't yet successful in using this plugin, because it's not really clear to me how to add the external classes in this case.

With the Spark CSV plugin, this seems to be no issue.

Build and publish docs

Use jsdoc to build docs in some local directory and publish them somewhere.

Make spark-node ES7-aware

With #32 there's now the support for promisified Spark actions, which seem like a natural match for me for the ES7 async/await syntax. This provides the possibility to write "quasi-synchronous code style" which in fact gets executed asynchronously.

E.g.

spark-node> var df = await sqlContext.read().jsonPromised("./data/people.json")
true
spark-node> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

false
spark-node>

I created a solution with babel.js on the fly transpilation and can create a new PR for this if you like.

Would you like the "spark" module name on npm?

Hi @henridf !

David from Particle here, we used to have "spark", but I think it makes sense for you guys to have it. Thoughts?

Thanks!
David

Support concurrent jobs

Once we have an async API (#26), we should provide a way to launch jobs in different underlying java threads, so that users can run multiple jobs concurrently.

df.toJSON() not working

I try to get a DataFrame as a JSON structure via df.toJSON(). The command completes, but the output is

spark-node> var df = sqlContext.read().json("./data/people.json")
spark-node> df.toJSON()
nodeJava_org_apache_spark_rdd_MapPartitionsRDD {
  'org$apache$spark$rdd$RDD$$evidence$1': nodeJava_scala_reflect_ClassTag__anon_1 {} }
spark-node>

Is it possible to get a DataFrame as JSON with the current version?

Can this be used as a Apache Spark Node.js API or will there only be a shell?

See title

Streaming support from kafka ?

Hi,
I have an use case where my spark code has to read streaming data from kafka and perform some business logic and write it to HDFS. My entire business logic is currently in node.js. Does this repo supports this use case ? I also see that the examples described are in interactive mode with spark-node, on similar lines how can i run my java script code via spark-submit ?

Make comments jsdoc compatible

Similar to what was done in #18 for lib/DataFrame.js in 9675dd3, but for the other API wrappers in lib/

Row.js doesn't need to be documented (at least yet) -- it is only used internally so far.

Problem with spark-node

Hi guys,

I installed via npm into my project and I downloaded the spark-assembly, but when I try to run spark-node I get the following error:

Command:

ASSEMBLY_JAR=/Users/pola/spark-test/bin/spark-assembly-1.6.0-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node

Error:

/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3
true; if ("value" in descriptor) descriptor.writable = true; Object.defineProp
                                                                    ^
TypeError: Cannot redefine property: length
    at Function.defineProperty (native)
    at defineProperties (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:296)
    at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:495
    at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:33:5
    at Object.<anonymous> (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:2415:3)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

What am I missing?

I am using:

apache-spark-node: 0.3.3
node: v0.10.28

Thanks in advance,
PoLa

unable to install

npm-debug.txt