henridf / apache-spark-node Goto Github PK
View Code? Open in Web Editor NEWNode.js bindings for Apache Spark DataFrame APIs
Home Page: https://henridf.github.io/apache-spark-node
License: Apache License 2.0
Node.js bindings for Apache Spark DataFrame APIs
Home Page: https://henridf.github.io/apache-spark-node
License: Apache License 2.0
Beaker seems like the best candidate, given its support for javascript and its polyglot nature.
My guess/hope is that this should be pretty simple...
I just stumbled over Scala.js, and asked myself if parts of it's code could be used to provide missing JS objects like Seq
or others...
Hello henridf,
Unable to install module with node version of 8.1.4. Any help would be appreciated.
Thanks,
Piyush
I was trying to access the spark-csv
package, but I was unsuccessful by either
a) Adding it to the java.classpath
like
java.classpath.push("/tmp/jarCache/spark-csv_2.10-1.3.0.jar");
b) Specifying the --jar
parameter in the args
for 'NodeSparkSubmit'
Is there a suggested way to do this that I am missing?
Hi there; perhaps I'm misunderstanding how this should work, but shouldn't the below code simply create a union of the two data sets and output the count? Instead I'm getting a java.lang.NullPointerException on the unionAll call?
var spark = require('apache-spark-node');
var context = spark([], process.env.ASSEMBLY_JAR);
var sqlContext = context.sqlContext;
var sqlFunctions = context.sqlFunctions;
var data1 = [
{"name":"Michael"},
{"name":"Andy", "age":30},
{"name":"Justin", "age": 19}
];
var data2 = [
{"name":"Fiona"},
{"name":"Jenny", "age":30},
{"name":"Sandra", "age":30},
{"name":"Alexa", "age": 19}
];
var df1 = sqlContext.createDataFrame(data1);
var df2 = sqlContext.createDataFrame(data2);
console.log(df1.countSync());
console.log(df2.countSync());
var u = df1.unionAll(df2);
console.log(u.countSync());
Hey there,
As the title says, i am trying to query an existing cassandra DB from nodejs using your library. I am using a spark cluster on a LAN
Here's what i have done so far :
using :
From the root of my project :
ASSEMBLY_JAR=/usr/share/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node \
--master spark://192.168.1.101:7077 --conf spark.cores.max=4 \
--jars /root/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.6.0-M1-36-g220aa37.jar
Once i have access to the command line i tried to do
spark-node> sqlContext.sql("Select count(*) from mykeyspace.mytable")
but of course i get a
Error: Error creating class
org.apache.spark.sql.AnalysisException: Table not found: `mykeyspace`.`mytable`;
i then tried to adapt a snippet of scala i've seen on a stack overflow post
var df = sqlContext
.read()
.format("org.apache.spark.sql.cassandra")
.option("table", "mytable")
.option("keyspace", "mykeyspace")
.load(null, function(err, res) { console.log(err); console.log(res) })
but all i get is a
Error: Error running instance method
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark-packages.org
The problem surely comes from the fact that i don't understand half of how everything is linked together, that's why i'm here asking for some help about this issue. All i need is a way to execute basic sql functions (with only WHERE clauses) over one cassandra table.
I recon this project seems no longer maintained, but this is as far as i can see the simpler solution i have seen so far (solutions like eclairJS have way more functionalities than i need, at the cost of an increased complexity and maybe less performance) and it would just fill my needs.
as per the discussion in #28 starting with this comment: #28 (comment)
To make it easy to get started (for those who don't have a local installation of Spark and/or node)
Hi @henridf! I really love heavy computing and have been working with it here in Brazil for like, one year now, and I'm really tired of can't use my beloved Node.js with such powerful engine that is Spark, so I really want to help with this project, to reactivate and try to enhance this ecosystem within JavaScript back-end developers!
Thanks in advance and really congrats to start this! Let's update the dependencies to the newest Node.js version and the new Spark version too, to keep it updated!
Best regards,
Igor Franca
I was trying to connect to a running standalone cluster via specifying either conf.setMaster("spark://192.168.200.180:7077")
or supplying --master spark://192.168.200.180:7077
in the NodeSparkSubmit
arguments.
Output:
15/12/21 14:43:08 INFO AppClient$ClientEndpoint: Connecting to master spark://192.168.200.180:7077...
15/12/21 14:43:28 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@4f36718b rejected from java.util.concurrent.ThreadPoolExecutor@4e9f4716[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/21 14:43:28 INFO DiskBlockManager: Shutdown hook called
15/12/21 14:43:28 INFO ShutdownHookManager: Shutdown hook called
The Spark master is up, I can do a
$ ./spark-shell --master spark://192.168.200.180:7077 --num-executors=2 --total-executor-cores=4
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/12/21 14:49:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/21 14:49:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/21 14:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/21 14:49:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 14:49:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.
scala>
Do you have any idea where this goes south?
node-gyp rebuild produce many error....
any idea?
I was trying to reuse an existing Context, und run some statements in parallel.
As it seems, this is not possible due to the nature of node-java
. I opened an issue at joeferner/node-java#290 but have not received a definitive answer yet.
As the Spark docs say,
multiple parallel jobs can run simultaneously if they were submitted from separate threads
which I didn't find a solution for, yet. Do you have any idea?
I encountered problem as titled.
As shown next, Java 1.8 is installed but spark-node failed with error of "No Java runtime present".
$ java -version
java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
$ node_modules/apache-spark-node/bin/spark-node
No Java runtime present, requesting install.
Additional information of my environment is as next.
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.11.4
BuildVersion: 15E65
jshint? eslint? https://github.com/babel/babel-eslint ?
Hello is possible read data from mysql ?
Regards.
When I try to start I get the folllowing error:
ASSEMBLY_JAR=/Users/tobilg/Development/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node
/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207
var clazz = java.findClassSync(name); // TODO: change to Class.forName when classloader issue is resolved.
^
Error: Could not create class org.apache.spark.deploy.NodeSparkSubmit
java.lang.NoClassDefFoundError: org/apache/spark/deploy/NodeSparkSubmit
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.NodeSparkSubmit
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at Error (native)
at Java.java.import (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/node_modules/java/lib/nodeJavaBridge.js:207:20)
at spark_parseArgs (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:21:38)
at Object.sqlContext (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/lib/spark.js:33:5)
at Object.<anonymous> (/Users/tobilg/WebstormProjects/Spark-Wrapper/node_modules/apache-spark-node/bin/spark-node:23:26)
at Module._compile (module.js:434:26)
at Object.Module._extensions..js (module.js:452:10)
at Module.load (module.js:355:32)
at Function.Module._load (module.js:310:12)
at Function.Module.runMain (module.js:475:10)
Furthermore, the https://github.com/henridf/apache-spark-node#running paragraph states that one should run bin/node-spark
, which is non-existant.
Hello,
I have a cluster of three VM and I installed HDP 2.5. slave1-virtual-machine is one of these three VM.
My directory visuVelov is the directory where I issued :
npm install apache-spark-node
So, I execute in this directory :
env ASSEMBLY_JAR="/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar" node_modules/apache-spark-node/bin/spark-node
I don't understand...
How to fix this error ?
Methods on Column can be used in js programs (and are already exposed in some docs, such as the example in the comments of Functions.when()
).
e:\Projects\apache-spark-node\node_modules\java>if not defined npm_config_node_g
yp (node "D:\Program Files\nodejs\node_modules\npm\bin\node-gyp-bin....\node_
modules\node-gyp\bin\node-gyp.js" rebuild ) else (node rebuild )
gyp WARN install got an error, rolling back install
gyp ERR! configure error
gyp ERR! stack Error: incorrect header check
gyp ERR! stack at Zlib._handle.onerror (zlib.js:366:17)
gyp ERR! System Windows_NT 6.1.7601
gyp ERR! command "node" "D:\Program Files\nodejs\node_modules\npm\node_modu
les\node-gyp\bin\node-gyp.js" "rebuild"
gyp ERR! cwd e:\Projects\apache-spark-node\node_modules\java
gyp ERR! node -v v0.12.7
gyp ERR! node-gyp -v v2.0.1
gyp ERR! not ok
npm ERR! Windows_NT 6.1.7601
npm ERR! argv "D:\Program Files\nodejs\node.exe" "D:\Program Files\nodejs
\node_modules\npm\bin\npm-cli.js" "install"
npm ERR! node v0.12.7
npm ERR! npm v2.11.3
npm ERR! code ELIFECYCLE
npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! This is most likely a problem with the java package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get their info via:
npm ERR! npm owner ls java
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! e:\Projects\apache-spark-node\npm-debug.log
probably Travis
All of the API calls are currently synchronous, which is pretty unusual for a nodejs library. There needs to be a corresponding async set of APIs.
Following the node convention, the current synchronous functions should be postfixed Sync
and the new async ones should have un-postfixed names.
In the case of the Elasticsearch Spark plugin, the SparkContext
is amended with additional methods, such as esRDD()
and makeRDD()
.
I wasn't yet successful in using this plugin, because it's not really clear to me how to add the external classes in this case.
With the Spark CSV plugin, this seems to be no issue.
Use jsdoc to build docs in some local directory and publish them somewhere.
With #32 there's now the support for promisified Spark actions, which seem like a natural match for me for the ES7 async/await syntax. This provides the possibility to write "quasi-synchronous code style" which in fact gets executed asynchronously.
E.g.
spark-node> var df = await sqlContext.read().jsonPromised("./data/people.json")
true
spark-node> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
false
spark-node>
I created a solution with babel.js on the fly transpilation and can create a new PR for this if you like.
Hi @henridf !
David from Particle here, we used to have "spark", but I think it makes sense for you guys to have it. Thoughts?
Thanks!
David
Once we have an async API (#26), we should provide a way to launch jobs in different underlying java threads, so that users can run multiple jobs concurrently.
I try to get a DataFrame as a JSON structure via df.toJSON()
. The command completes, but the output is
spark-node> var df = sqlContext.read().json("./data/people.json")
spark-node> df.toJSON()
nodeJava_org_apache_spark_rdd_MapPartitionsRDD {
'org$apache$spark$rdd$RDD$$evidence$1': nodeJava_scala_reflect_ClassTag__anon_1 {} }
spark-node>
Is it possible to get a DataFrame as JSON with the current version?
See title
Hi,
I have an use case where my spark code has to read streaming data from kafka and perform some business logic and write it to HDFS. My entire business logic is currently in node.js. Does this repo supports this use case ? I also see that the examples described are in interactive mode with spark-node, on similar lines how can i run my java script code via spark-submit ?
Hi guys,
I installed via npm into my project and I downloaded the spark-assembly, but when I try to run spark-node I get the following error:
Command:
ASSEMBLY_JAR=/Users/pola/spark-test/bin/spark-assembly-1.6.0-hadoop2.6.0.jar node_modules/apache-spark-node/bin/spark-node
Error:
/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3
true; if ("value" in descriptor) descriptor.writable = true; Object.defineProp
^
TypeError: Cannot redefine property: length
at Function.defineProperty (native)
at defineProperties (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:296)
at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:3:495
at /Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:33:5
at Object.<anonymous> (/Users/pola/spark-test/node_modules/apache-spark-node/js/functions.js:2415:3)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
What am I missing?
I am using:
apache-spark-node: 0.3.3
node: v0.10.28
Thanks in advance,
PoLa
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.