Coder Social home page Coder Social logo

hydrospheredata / mist Goto Github PK

View Code? Open in Web Editor NEW
326.0 41.0 67.0 10.2 MB

Serverless proxy for Spark cluster

Home Page: http://hydrosphere.io/mist/

License: Apache License 2.0

Scala 95.32% Shell 0.80% Python 3.08% Java 0.04% Groovy 0.77%
apache-spark api big-data serverless

mist's Introduction

Build Status Build Status Maven Central Docker Hub Pulls

Hydrosphere Mist

Join the chat at https://gitter.im/Hydrospheredata/mist

Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model for Spark applications.

Please see our quick start guide and documentation

Features:

  • Spark Function as a Service. Deploy Spark functions rather than notebooks or scripts.
  • Spark Cluster and Session management. Fully managed Spark sessions backed by on-demand EMR, Hortonworks, Cloudera, DC/OS and vanilla Spark clusters.
  • Typesafe programming framework that clearly defines inputs and outputs of every Spark job.
  • REST HTTP & Messaging (MQTT, Kafka) API for Scala & Python Spark jobs.
  • Multi-cluster mode: Seamless Spark cluster on-demand provisioning, autoscaling and termination(pending) Cluster of Spark Clusters

It creates a unified API layer for building enterprise solutions and microservices on top of a Spark functions.

Mist use cases

High Level Architecture

High Level Architecture

Contact

Please report bugs/problems to: https://github.com/Hydrospheredata/mist/issues.

http://hydrosphere.io/

LinkedIn

Facebook

Twitter

mist's People

Contributors

andreynenashev avatar anselmevignon avatar asaushkin avatar barbara615 avatar blvp avatar denissimon avatar dos65 avatar esmeneev avatar gitter-badger avatar icekhan13 avatar ingvarch avatar kineticcookie avatar leonid133 avatar maksimtereshin avatar mkf-simpson avatar mstolbov avatar peanig avatar pyct avatar spushkarev avatar stevencasey avatar tidylobster avatar werowe avatar zajs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mist's Issues

Http doesn't serve `ui` paths on scala 2.10

logs from DebugDirectives:

[INFO] [03/09/2017 15:09:48.468] [mist-akka.actor.default-dispatcher-2] [ActorSystem(mist)] Client ReST: Response for
  Request : HttpRequest(HttpMethod(GET),http://localhost:2004/ui/,List(Host: localhost:2004, Connection: keep-alive, Upgrade-Insecure-Requests: 1, User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36, Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8, Accept-Encoding: gzip, deflate, sdch, br, Accept-Language: ru-RU, ru;q=0.8, en-US;q=0.6, en;q=0.4),HttpEntity.Strict(none/none,ByteString()),HttpProtocol(HTTP/1.1))
  Response: Rejected(List(TransformationRejection(<function1>), TransformationRejection(<function1>)))

starting mist locally

I have a problem to execute mist locally.

steps performed:

git clone this repository 
cd mist
git checkout v0.6.5
sbt -DsparkVersion=2.0.2 assembly 
./bin/mist start master
[INFO] [11/30/2016 11:14:15.847] [main] [akka.remote.Remoting] Starting remoting
[INFO] [11/30/2016 11:14:16.062] [main] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:2551]
[INFO] [11/30/2016 11:14:16.082] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Starting up...
[INFO] [11/30/2016 11:14:16.194] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Registered cluster JMX MBean [akka:type=Cluster]
[INFO] [11/30/2016 11:14:16.194] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Started up successfully
[INFO] [11/30/2016 11:14:16.210] [mist-akka.actor.default-dispatcher-4] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Metrics will be retreived from MBeans, and may be incorrect on some platforms. To increase metric accuracy add the 'sigar.jar' to the classpath and the appropriate platform-specific native libary to 'java.library.path'. Reason: java.lang.ClassNotFoundException: org.hyperic.sigar.Sigar
[INFO] [11/30/2016 11:14:16.216] [mist-akka.actor.default-dispatcher-4] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Metrics collection has started successfully
[INFO] [11/30/2016 11:14:16.251] [mist-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Node [akka.tcp://[email protected]:2551] is JOINING, roles []
16/11/30 11:14:16 INFO Master$: 0
16/11/30 11:14:16 INFO Master$: 2551
[INFO] [11/30/2016 11:14:16.277] [mist-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Leader is moving node [akka.tcp://[email protected]:2551] to [Up]

But mist is not available on the default localhost 2003 port. After checking the configuration I found out that default configuration and docker configuration differ.

curl --header "Content-Type: application/json" -X POST http://localhost:2004/api/simple-context --data '{"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}'
curl: (7) Failed to connect to localhost port 2003: Connection refused

Is there a minimal configuration available somewhere to get started quickly (locally)?
Can I specify a configuration on startup of mist?

Exception in thread "Thread-1" java.lang.ExceptionInInitializerError
	at io.hydrosphere.mist.master.WorkerManager$$anon$1.run(WorkerManager.scala:37)
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'runner'
	at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
	at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:170)
	at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:184)
	at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
	at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:246)
	at io.hydrosphere.mist.MistConfig$Workers$.<init>(MistConfig.scala:110)
	at io.hydrosphere.mist.MistConfig$Workers$.<clinit>(MistConfig.scala)

Add Artifactory as a repository to pull Mist jobs from

Currently we can pull and execute jobs from local file system and from HDFS.
It would be great to support Artifactory as industry standard for enterprise grade deployments.

Basically it is required to implement JobFile interface.

Testing could be done the same way as it's done for HDFSJobFile - creating a docker container with Artifactory and test jars/python files.

@mozinrat - FYI

is notebook part of mist?

Hi,
I am unable to find notebook module in repository. Hydrosphere github link for notebook is not showing anything related to notebook.Do you hae any plans to release notebook on github
WIll it be a part of Mist ?
Thanks

None of the urls are loading when Loadcomplete > Record user scenario is turned on

Hi,

I just installed LoadComplete4.50 tool for load testing of a web application (http). I clicked on "Record User Scenario", IE11 launched in private mode. Recording window with red light turned on opened. Entered the url in address bar and pressed enter key."This page canโ€™t be displayed" error is displayed.

Noticed same issue with other browsers Chrome and Firefox also.I tried with different Cache modes like "Clear Cache" and "Maintain Cache(Not recommended)". Still getting the error"This page canโ€™t be displayed".

I tried with "https://www.google.com/". Same error "This page canโ€™t be displayed"is shown.

Can anyone help me solving this issue.
My chrome version is 54.0.2840.99 m
My firefox version is 47.0.2

import session.implicits._ doesn't work

Spark Version: 2.0.2
Scala Version: 2.11.8

When I tried a simple example to import spark.implicits._ I get the following error while sbt build:

My code:

import io.hydrosphere.mist.lib.{MistJob, SQLSupport}
import org.apache.spark.sql._

object SimpleContext extends MistJob with SQLSupport{
  /** Contains implementation of spark job with ordinary [[org.apache.spark.SparkContext]]
    * Abstract method must be overridden
    *
    * based on https://github.com/Hydrospheredata/mist/blob/master/examples/src/main/scala/SimpleContext.scala
    *
    * @param digits user parameter digits
    * @return result of the job
    */
 import session.implicits._
 def execute(digits:Seq[Int]): Map[String, Any] = {
        val mydf = digits.toSeq.toDF("number")
        Map("result" -> mydf.map(x => x.getInt(0) * 3).collect())
  }
}

Error:

[info] Loading project definition from /home/user/mist/apache-spark-restAPI-example/project
[info] Set current project to sparkMist (in build file:/home/user/mist/apache-spark-restAPI-example/)
[info] Compiling 1 Scala source to /home/user/mist/apache-spark-restAPI-example/target/scala-2.11/classes...
[error] /home/user/mist/apache-spark-restAPI-example/src/main/scala/SimpleContext.scala:13: stable identifier required, but SimpleContext.this.session.implicits found.
[error]  import session.implicits._
[error]                 ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

Please help me how I can import spark.implicits._

Thanks..!!

Driver Fault Tolerance Implementation - Mist

@nphadke2 and I were working on adding driver fault tolerance to Apache Spark for Batch Processing but Mist seems to have solved that problem. We had a few questions how Mist does it.

  1. The README says โ€œself healing after driver program failureโ€

    • a. What does this entail?

    • b. How does Mist recognize that a driver has failed? Does it use some sort of heartbeat to ensure that the driver is alive and running?

    • c. What happens after the driver program and/or driver machine fails? Does a new driver get instantiated, and if so, how is a machine selected to host the driver program? Does the new driver read the state which was recorded to get back up to speed or is the whole application restarted from the beginning?

    • d. If job state is persisted, does this mean that Mist periodically pulls or gets updates about job state from the driver and stores it locally?
      d. Is the restart of the driver program made transparent to the workers? Any guidance on where in the source code this is implemented would be much appreciated.

  2. This document explains how you can have multiple Spark contexts running Spark jobs in parallel using the same group of worker machines.

    • a. Can you configure the second driver in such a way that it executes the same jobs as the first driver if the first driver fails?
    • b. Would this involve periodically sharing job state between driver one and driver two? I am asking because if this can be done, the turnaround time for continuing the job after first failure could significantly reduce.
  3. To test the fault tolerance of the utilities Mist provides, we simulated a long-running job by adding artificial sleep statements. During this sleep period, we printed out the IP addresses of the machine running the application using two alternate suggestions: https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/166589#166589 and https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/166520#166520. Yet we always get an IP address of 172.17.0.2, which seems to be the IP address of the Docker container.

    • a. Are we to conclude that the job is being run within the Docker container and not on a Spark worker?
    • b. While this application was sleeping, we killed the master to see how Mist reacts to the killing of the driver program. However, our print statements kept printing with the IP address listed above. This means, we believe, that the application/job is not being run on the Apache Spark cluster but in the docker container. Are our conclusions correct?

Kill a MistJob using rest call

How to kill a running job using rest API call?

And second thing is, (which is not related to this) so is a new JVM started per SparkContext?

fast model evaluation

You mentioned that you are working on a high throughput API for mist.
Maybe https://github.com/combust-ml/mleap is helpful.

Synchronous real-time API for high throughput - we are working on adding a model serving support for online queries with low latency.

Dependency not available for version 0.0.8 in mist maven repo.

When I am trying to include dependency for mist using

<dependency>
    <groupId>io.hydrosphere</groupId>
    <artifactId>mist</artifactId>
    <version>0.8.0</version>
</dependency>

it fails as unresolved dependency: io.hydrosphere#mist;0.8.0: not found.

Can anybody look into this. or is there any other way to download dependency.

S3 Jobs Resolver

Add ability to pull Spark jobs from S3.
See existing implementations of MavenArtifactResolver & HDFSResolver

Java support for MistJobs

Currently there is Scala and Python API.
It would be great to support Java as well.

There are 2 types of Mist jobs:

  1. Heavy Spark jobs extended from MistJob
  2. Realtime MLMistJob

Java MistJob API sub-tasks and challenges:

  1. Implement a Java mirror of MistJob trait and deal with multiple inheritance for Java
  2. Implement Java version of typed parameters mapper. See ExternalJar

Java MistJob API sub-tasks and challenges:

  1. Port Scala implicits from LocalTransformers into Java API
  2. Create Java API for LocalData

@mozinrat - FYI - any thoughts and contributions are welcome!

Can't get ML Example to Work

Hi, I'm following the examples from: https://github.com/Hydrospheredata/mist/blob/master/docs/use-cases/ml-realtime.md and I cannot get the job/routes to show up in the MIST http UI. Here's what I'm doing:

  1. Starting the docker container (from project root)
  2. Building the examples as sbt "project mistLibSpark2" package (build is successful)
  3. in ./configs/, I'm have the following:

jar_path = "./mist-lib-spark2/target/scala-2.11/mist-lib-spark2_2.11-0.11.0.jar" //I've also tried variations of ./ and /

genericModel = {
path = ${jar_path}
className = "DTreeClassificationJob$"
namespace = "foo"
}

And... there's nothing populating in the router, jobs, settings, etc... UI. I'm running on Windows too.

Thank you!

Problem running mist on own Apache Spark Cluster

Hi,

I am trying to run a Mist Job on a custom Apache Spark cluster and I am getting the following error:

17/01/31 00:52:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@6b9689d rejected from java.util.concurrent.ThreadPoolExecutor@38f018f4[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

These are the steps I followed:

  1. Set up a small Apache cluster of 3 machines using Apache Spark version 2.1.0.
  2. Created an image using the Dockerfile after compiling Mist and specifying the Spark Version 2.1.0
  3. Set the master URL in docker.conf to the IP and port of my apache spark cluster master.
  4. Created a sample application using the examples given in the MIST directory and created an endpoint using router.conf.
  5. Did the following POST request: curl --header "Content-Type: application/json" -X POST http://<IP:PORT>/api/log-search --data '{"digits":5}'

I would appreciate any help on this as I am trying to test whether Mist provides Driver fault tolerance for Spark Batch Processing jobs.

Please let me know if I can provide any more information.

Thank you

Check spark version compatibility on context initialization

If mist started on wrong spark version error is not clear

Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
    at akka.remote.serialization.MiscMessageSerializer.<init>(MiscMessageSerializer.scala:71)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
    at scala.util.Try$.apply(Try.scala:161)

question regarding session in examples

In the MLClassification example line 9 is using
val training = session.createDataFrame(Seq(
the session object. Where can I find out more about this object, IntelliJ tells me, that it is not available.

how to run

I am using spark on a MacBook installed via brew.

what is the problem? spark home seems to be set

sudo ./bin/mist start master
SPARK_HOME is not set
t201-034:mist geoHeil$ echo $SPARK_HOME
/usr/local/Cellar/apache-spark/2.0.2
t201-034:mist geoHeil$ sudo echo $SPARK_HOME
/usr/local/Cellar/apache-spark/2.0.2

Not able to detect custom (datasource) classpath in spark-sql with mist

I have custom data source which I use in sqlContext.read.format("c##.###.bigdata.fileprocessing.sparkjobs.fileloader.DataSource"). This works well via spark-submit in local & yarn mode. But, when I invoke the same job via Mist, it throws following exception:

Failed to find data source: c.p.b.f.sparkjobs.fileloader.DataSource. Please find packages at http://spark-packages.org

Curl Command invoked:
curl --header "Content-Type: application/json" -X POST http://localhost:2004/api/fileProcess --data '{"path": "/root/fileprocessing/bigdata-fileprocessing-all.jar", "className": "c##.###.bigdata.fileprocessing.util.LoaderMistApp$", "parameters": {"configId": "FILE1"}, "namespace": "foo"}'

bigdata-fileprocessing-all.jar has the DataSource class.

Added following in router.conf
fileProcess = {
path = ${fp_path}
className = "c##.###.bigdata.fileprocessing.util.LoaderMistApp$"
namespace = "foo"
}

Mist Version: 0.10.0
Spark Version: 1.6.1

Error Stacktrace:
mist_stacktrace.txt

Introduce new rest api

Now we are working on new rest api. Main goal is to build new ui over it and to give availability to comfortably work with long-time jobs via http.

Glossary changes:

  • route now is endpoint - some entity that configured to run jobs (artifact path, className)
  • job now is fact of endpoint execution

Short overview:

  • Endpoints:

    • get all GET /v2/api/endpoints
    • get by id GET /v2/api/endpoints/{id}
    • start job POST /v2/api/endpoints/{id}
      DATA: jobs arguments { "arg-name": "arg-value", ...}
    • job history GET /v2/api/endpoints/{id}/jobs
  • Jobs:

    • get info by id GET /v2/api/jobs/{id}
  • Workers:

    • get all GET /v2/api/workers
    • stop DELETE /v2/api/workers/{id}

Dev branch feature/new_api

spark.executor.cores setting has no effect while creating spark context

With below config

mist.context-defaults.spark-conf = {
spark.master = "spark://host:7077"
spark.executor.cores = 2
spark.executor.memory = "512m"
}

Even though the executor cores are set to 2, the job occupies all the available cores on spark cluster, it looks like this setting is not taking effect.

Mist throwing exception for drill queries

My jobs with any drill query are not working with mist and I keep getting the following exception. The same job works fine with spark-submit.

17/06/15 05:59:04 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.lang.NoClassDefFoundError: oadd/org/apache/log4j/Logger
at oadd.org.apache.zookeeper.Login.(Login.java:44)
at oadd.org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:226)
at oadd.org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:131)
at oadd.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:949)
at oadd.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1003)
17/06/15 05:59:05 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

Here's a sample drill query job I wrote which is working fine with spark-submit,

def execute(): Map[String, Any] = {
     
    Class.forName("org.apache.drill.jdbc.Driver")
    val connection = DriverManager.getConnection("jdbc:drill:zk=localhost:5181/drill/demo_mapr_com-drillbits;schema=dfs", "root", "mapr")
    val statement = connection.createStatement()
    val query = "select * from  dfs.tmp.`employees` limit 10"
    val resultSet = statement.executeQuery(query)
    var list : List[String] = List()
    
    while(resultSet.next()){  
      println(resultSet.getString(1));
      list = list ++ List(resultSet.getString(1))
    }
    Map("result" -> list)
  }

Also find attached the response I get from the mist API.

response.txt

Same Job submitted twice

@dos65
Hi,
When I send an API request to mist server from Mist UI, it submits the same job twice (with same ID) to the spark cluster. I can see this from the spark cluster UI. Although the result is returned from the first job execution result.

Below is the screenshot attached:
spark-ui

Please check this..!!
Thanks

Reconsider `streaming job` feature

Current state of jobs with StreamingSupport is that they are support only json format for publishing. Also they are limited by broker and topic selection (the can use only one that configured on mist server).
It is does not seems convenient for end-users, we cannot predict the way that they will to publish events or force they to using only mist's available configuration.

Maybe it would be nice to do mist-lib-streaming with Mqtt/Kafka Producer for spark?

@spushkarev @mkf-simpson ping

Logging mist API calls

How can I log all mist API calls in rolling file logging? Right now it prints on the console log.

Please help me..!!

UNRESOLVED DEPENDENCIES of io.hydrosphere#mist-api-spark2;0.11.0

Hi,

I am using spark2.0.2. I have included

libraryDependencies += "io.hydrosphere" % "mist-api-spark2" % "0.11.0"

and sbt couldn't find the dependency.

Error:

[warn] 	module not found: io.hydrosphere#mist-api-spark2;0.11.0
[warn] ==== local: tried
[warn]   /home/ravi/.ivy2/local/io.hydrosphere/mist-api-spark2/0.11.0/ivys/ivy.xml
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/io/hydrosphere/mist-api-spark2/0.11.0/mist-api-spark2-0.11.0.pom
[info] Resolving jline#jline;2.12.1 ...
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: io.hydrosphere#mist-api-spark2;0.11.0: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 
[warn] 	Note: Unresolved dependencies path:
[warn] 		io.hydrosphere:mist-api-spark2:0.11.0 (/home/ravi/Work/build.sbt#L17-18)
[warn] 		  +- com.ravi:mistexample_2.11:0.1
sbt.ResolveException: unresolved dependency: io.hydrosphere#mist-api-spark2;0.11.0: not found
	at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
	at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
	at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
	at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
	at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
	at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
	at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
	at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
	at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
	at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
	at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
	at xsbt.boot.Using$.withResource(Using.scala:10)
	at xsbt.boot.Using$.apply(Using.scala:9)
	at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
	at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
	at xsbt.boot.Locks$.apply0(Locks.scala:31)
	at xsbt.boot.Locks$.apply(Locks.scala:28)
	at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
	at sbt.IvySbt.withIvy(Ivy.scala:128)
	at sbt.IvySbt.withIvy(Ivy.scala:125)
	at sbt.IvySbt$Module.withModule(Ivy.scala:156)
	at sbt.IvyActions$.updateEither(IvyActions.scala:168)
	at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1481)
	at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1477)
	at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1512)
	at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1510)
	at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
	at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1515)
	at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1509)
	at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
	at sbt.Classpaths$.cachedUpdate(Defaults.scala:1532)
	at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1459)
	at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1411)
	at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
	at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
	at sbt.std.Transform$$anon$4.work(System.scala:63)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
	at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
	at sbt.Execute.work(Execute.scala:237)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
	at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
	at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
[error] (*:update) sbt.ResolveException: unresolved dependency: io.hydrosphere#mist-api-spark2;0.11.0: not found

Do I need to add any resolvers to the build.sbt?

@dos65

Docs: update getting started pages

Currently that page is not so friendly for new people.
I think that page should contains parts bellow:

  • Download and try run job from examples
  • Basic api methods explanation (run/cancel/check status)
  • How to write new 'MistJob' and run it (maybe publish samples for different projects sbt, maven, python)
  • Configuration explanation and examples for cluster mode

[question]

What differentiates you from simply using something like spark jobserver?
Do you plan to integrate AP testing similar to http://pipeline.io/ or seldon.io or h2o's steam platform?

Running Mist on Mapr Yarn

To run mist on yarn, I configured configs/default.conf
mist.context-defaults.spark-conf = {
spark.master = "yarn-client"
}

But I get exception, Exception in thread "main" org.apache.spark.SparkException: Found both spark.driver.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former

When I run plain spark-submit (without mist), I get following warning:
SPARK_JAVA_OPTS was detected (set to ' -Dhadoop.login=hybrid -Dmapr_sec_enabled=true').
This is deprecated in Spark 1.0+.

As I am using MAPR cluster, in the configuration setup they might be internally using SPARK_JAVA_OPTS. And changing this might have repercussions.

To resolve the issue, one way is to avoid mentioning spark.driver.extraJavaOptions in mist flow. Can you point out where this setting is configured ? or can help out with any alternate solution.

Problem with examples

I try to follow along with the examples:

curl --header "Content-Type: application/json" -X POST http://localhost:2003/jobs --data '{"path": "./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar", "className": "SimpleSQLContext$", "parameters": {"file": "/path_to_mist/examples/resources/SimpleSQLContextData.json"}, "namespace": "fooB"}'

but get a class not found an exception. What is wrong? (I am executing this in the provided docker environment)

In the documentation you say that "the example jar would be available in the docker container". But ./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar nor /path_to_mist/examples/resources/SimpleSQLContextData.jso seem to be in the container.

How should I upload the jar? is this somehow similar to spark-jobserver?

Separate logs between master/workers/job running

  • Find way to split up logs created from job running and logs from worker
  • Collect all logs on master (and expose their via rest api)
  • mist-lib - replace current Publisher by some Logger, that works like particular logger. Publish that logs via async publishers / websocket
  • provide fix for #172

Wrong mapping of input parameters

Hi @dos65 ,
I found a bug in this library. When I send few string parameters to the job the mist server job executor maps them incorrectly.

My code is as follows:

def execute(FromDate:String,ToDate:String,query:String,rows:Int,Separator:String): Map[String,Any] = {
	println("fromdate: " + FromDate)
	println("todate: " + ToDate)
	println("query: " + query)
	println("rows: " + rows)
	println("separator: " + Separator)
	
	return Map("success" -> true)
}

My input:

{
	"FromDate": "2017-05-17",
	"ToDate": "2017-05-18",
	"query": "select * from tablename",
	"rows": 5,
	"Separator": ";"
}

And the output in the mist log shows:

fromdate: 2017-05-17
todate: select * from tablename
query: 2017-05-18
rows: 5
separator: ;

It mapped well when I sent 2 string parameters only. Please look into this issue..!!

Spark settings for a namespace via REST/CLI

Does Mist support providing spark config and other settings per namespace via REST Http Request or CLI ? From the documentation, we can provides settings for an namespace in mist config file. If a namespace setting doesn't exist in config file then default context settings are used. However, I want to dynamically provide these configurations via REST/CLI. Is it possible?

Working on my first "own" mist job

I want to write my first own mist job to get more familiar with mist. I set up a git repo https://github.com/geoHeil/apache-spark-restAPI-example which hopefully can serve as sort of tutorial to mist when my questions are answered ; )

I would like to be able to debug the spark job / run it locally as well. https://github.com/geoHeil/apache-spark-restAPI-example/blob/master/src/main/scala/SimpleContext.scala shows the current status. For now, it is unclear to me:

  • if mist supports collect async / handles promises
  • how should a sparkSession be instantiated locally to debug / test run the mist job? Would you consider the workaround of defining override def runTheJOb(session: SparkSession, parameters: Map[String, Any]): Map[String, Any] correct / good style?
  • why do you require an absolute relative (from ./bin/mist) path in
    curl --header "Content-Type: application/json" -X POST http://localhost:2003/jobs --data '{"path": "absolutePathToJar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}' When I try to set the jar to ~/Downloadas/my.jar it is no longer found
  • why is it mandatory to set a path when calling the API via rest? I thought that setting it in the config was enough.
simple-context = {
  path = "/my/absolute/path/target/scala-2.11/sparkmist_2.11-0.0.1.SNAPSHOT.jar"
  className = "SimpleContext$"
  namespace = "foo"
}

Am I missing something here? If not explicitly setting a path in the request I get the example path again.

curl --header "Content-Type: application/json" -X POST http://localhost:2003/api/simple-context --data '{"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}'
{"success":false,"payload":{},"errors":["io.hydrosphere.mist.jobs.JobFile$NotFoundException"],"request":{"path":"./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar","className":"SimpleContext$","namespace":"foo","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}}}geoHeil:mistSample geoHeil$ 

support for maven/nexus artifacts and scheduling

  1. Need to ask if there is support for pulling jars from maven/nexus as spark packages convention
    groupid:artifactid:version
  2. Can you point me to documentation about how scheduling works when we are submitting jobs through different applications. Lets say analysis job to run only after etl job is completed. Same aggregation job only after analysis job. How do resource allocation will happen in that case.
  3. Spark Contexts orchestration can you point me to more documentation/examples.

Any plans for supporting java also?

Regards and lots of thanks for awesome work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.