hydrospheredata / mist Goto Github PK

View Code? Open in Web Editor NEW

326.0 41.0 67.0 10.2 MB

Serverless proxy for Spark cluster

Home Page: http://hydrosphere.io/mist/

License: Apache License 2.0

Scala 95.32% Shell 0.80% Python 3.08% Java 0.04% Groovy 0.77%

apache-spark api big-data serverless

mist's Introduction

Hydrosphere Mist

Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model for Spark applications.

Please see our quick start guide and documentation

Features:

Spark Function as a Service. Deploy Spark functions rather than notebooks or scripts.
Spark Cluster and Session management. Fully managed Spark sessions backed by on-demand EMR, Hortonworks, Cloudera, DC/OS and vanilla Spark clusters.
Typesafe programming framework that clearly defines inputs and outputs of every Spark job.
REST HTTP & Messaging (MQTT, Kafka) API for Scala & Python Spark jobs.
Multi-cluster mode: Seamless Spark cluster on-demand provisioning, autoscaling and termination(pending)

It creates a unified API layer for building enterprise solutions and microservices on top of a Spark functions.

High Level Architecture

Contact

Please report bugs/problems to: https://github.com/Hydrospheredata/mist/issues.

http://hydrosphere.io/

Facebook

Twitter

mist's People

Contributors

Stargazers

Watchers

Forkers

werowe srianjani uangel-iiot krishna-atkalikar jcrutsinger anselmevignon barbara615 msayapina kineticcookie mozinrat dos65 optimuse mabidm aniamin kpeeples leonid133 forpankil alainlompo ninthdecimal kli-kli mikewlange animeshinvinci vaquarkhan swshubham edric-shen arunma aks1981 snowch sambitdixit ravi-code-ranjan gitter-badger chingloong veeraravi r0610205 stevencasey olefine oopsoutofmemory earandap karimtouma lunaluan rogervaas vishalbelsare ramandeep30 vvvvbox yogyang wicaksana anniyanvr awinneu ravimadhusudhan8 cgoss balauppalapati prabhatsharma sunilpentapati itsbigspark ghr85 jreissup shreshtha26 stjordanis sathya-reddy-m kusumakarb kyprifog kaitumisuuringute-keskus shainnif stanpalatnik eternalerrors birblewin erdal-pb

mist's Issues

Dealing with LocalData and a lot of columns

At https://github.com/Hydrospheredata/mist/blob/master/docs/use-cases/ml-realtime.md you demo how to use LocalData for real-time scoring.

My pipeline (loaded via spark pipeline persistence) tries to join the input record with additional data. This is not possible, as the only operation supported is withColumn. Would you suggest to start a long running local spark context here?

Do you have plans to add this in the future?

Http doesn't serve `ui` paths on scala 2.10

logs from DebugDirectives:

[INFO] [03/09/2017 15:09:48.468] [mist-akka.actor.default-dispatcher-2] [ActorSystem(mist)] Client ReST: Response for
  Request : HttpRequest(HttpMethod(GET),http://localhost:2004/ui/,List(Host: localhost:2004, Connection: keep-alive, Upgrade-Insecure-Requests: 1, User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36, Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8, Accept-Encoding: gzip, deflate, sdch, br, Accept-Language: ru-RU, ru;q=0.8, en-US;q=0.6, en;q=0.4),HttpEntity.Strict(none/none,ByteString()),HttpProtocol(HTTP/1.1))
  Response: Rejected(List(TransformationRejection(<function1>), TransformationRejection(<function1>)))

starting mist locally

I have a problem to execute mist locally.

steps performed:

git clone this repository 
cd mist
git checkout v0.6.5
sbt -DsparkVersion=2.0.2 assembly

./bin/mist start master
[INFO] [11/30/2016 11:14:15.847] [main] [akka.remote.Remoting] Starting remoting
[INFO] [11/30/2016 11:14:16.062] [main] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:2551]
[INFO] [11/30/2016 11:14:16.082] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Starting up...
[INFO] [11/30/2016 11:14:16.194] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Registered cluster JMX MBean [akka:type=Cluster]
[INFO] [11/30/2016 11:14:16.194] [main] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Started up successfully
[INFO] [11/30/2016 11:14:16.210] [mist-akka.actor.default-dispatcher-4] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Metrics will be retreived from MBeans, and may be incorrect on some platforms. To increase metric accuracy add the 'sigar.jar' to the classpath and the appropriate platform-specific native libary to 'java.library.path'. Reason: java.lang.ClassNotFoundException: org.hyperic.sigar.Sigar
[INFO] [11/30/2016 11:14:16.216] [mist-akka.actor.default-dispatcher-4] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Metrics collection has started successfully
[INFO] [11/30/2016 11:14:16.251] [mist-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Node [akka.tcp://[email protected]:2551] is JOINING, roles []
16/11/30 11:14:16 INFO Master$: 0
16/11/30 11:14:16 INFO Master$: 2551
[INFO] [11/30/2016 11:14:16.277] [mist-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://mist)] Cluster Node [akka.tcp://[email protected]:2551] - Leader is moving node [akka.tcp://[email protected]:2551] to [Up]

But mist is not available on the default localhost 2003 port. After checking the configuration I found out that default configuration and docker configuration differ.

curl --header "Content-Type: application/json" -X POST http://localhost:2004/api/simple-context --data '{"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}'
curl: (7) Failed to connect to localhost port 2003: Connection refused

Is there a minimal configuration available somewhere to get started quickly (locally)?
Can I specify a configuration on startup of mist?

Exception in thread "Thread-1" java.lang.ExceptionInInitializerError
	at io.hydrosphere.mist.master.WorkerManager$$anon$1.run(WorkerManager.scala:37)
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'runner'
	at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
	at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:170)
	at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:184)
	at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
	at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:246)
	at io.hydrosphere.mist.MistConfig$Workers$.<init>(MistConfig.scala:110)
	at io.hydrosphere.mist.MistConfig$Workers$.<clinit>(MistConfig.scala)

Timeout Duration hard-coded

Hi,
Please check the code of HTTPService.scala in v0.10.0

You have hard coded the timeout duration to 1 minute.

The commit which changed it is :
92013b8#diff-6a087df3b5cb9d4f41ea26aca75745c7R128

Please fix it.

@mkf-simpson @dos65

logback.xml is present in published jar

The latest published artifact includes a logback.xml in the published jar: https://maven-repository.com/artifact/io.hydrosphere/mist_2.11/0.9.0. This is a very bad practice and logback.xml should only be provided by the final application.

Add Artifactory as a repository to pull Mist jobs from

Currently we can pull and execute jobs from local file system and from HDFS.
It would be great to support Artifactory as industry standard for enterprise grade deployments.

Basically it is required to implement JobFile interface.

Testing could be done the same way as it's done for HDFSJobFile - creating a docker container with Artifactory and test jars/python files.

@mozinrat - FYI

Invalid routes configuration lead to no response for ListRoutes

There is 2 cases:

when config has invalid format
when path value is invalid (not existed file, jar with not existed class, declared class is not subclass of MistJob)

is notebook part of mist?

Hi,
I am unable to find notebook module in repository. Hydrosphere github link for notebook is not showing anything related to notebook.Do you hae any plans to release notebook on github
WIll it be a part of Mist ?
Thanks

None of the urls are loading when Loadcomplete > Record user scenario is turned on

Hi,

I just installed LoadComplete4.50 tool for load testing of a web application (http). I clicked on "Record User Scenario", IE11 launched in private mode. Recording window with red light turned on opened. Entered the url in address bar and pressed enter key."This page can’t be displayed" error is displayed.

Noticed same issue with other browsers Chrome and Firefox also.I tried with different Cache modes like "Clear Cache" and "Maintain Cache(Not recommended)". Still getting the error"This page can’t be displayed".

I tried with "https://www.google.com/". Same error "This page can’t be displayed"is shown.

Can anyone help me solving this issue.
My chrome version is 54.0.2840.99 m
My firefox version is 47.0.2

import session.implicits._ doesn't work

Spark Version: 2.0.2
Scala Version: 2.11.8

When I tried a simple example to import spark.implicits._ I get the following error while sbt build:

My code:

import io.hydrosphere.mist.lib.{MistJob, SQLSupport}
import org.apache.spark.sql._

object SimpleContext extends MistJob with SQLSupport{
  /** Contains implementation of spark job with ordinary [[org.apache.spark.SparkContext]]
    * Abstract method must be overridden
    *
    * based on https://github.com/Hydrospheredata/mist/blob/master/examples/src/main/scala/SimpleContext.scala
    *
    * @param digits user parameter digits
    * @return result of the job
    */
 import session.implicits._
 def execute(digits:Seq[Int]): Map[String, Any] = {
        val mydf = digits.toSeq.toDF("number")
        Map("result" -> mydf.map(x => x.getInt(0) * 3).collect())
  }
}

Error:

[info] Loading project definition from /home/user/mist/apache-spark-restAPI-example/project
[info] Set current project to sparkMist (in build file:/home/user/mist/apache-spark-restAPI-example/)
[info] Compiling 1 Scala source to /home/user/mist/apache-spark-restAPI-example/target/scala-2.11/classes...
[error] /home/user/mist/apache-spark-restAPI-example/src/main/scala/SimpleContext.scala:13: stable identifier required, but SimpleContext.this.session.implicits found.
[error]  import session.implicits._
[error]                 ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

Please help me how I can import spark.implicits._

Thanks..!!

Driver Fault Tolerance Implementation - Mist

@nphadke2 and I were working on adding driver fault tolerance to Apache Spark for Batch Processing but Mist seems to have solved that problem. We had a few questions how Mist does it.

The README says “self healing after driver program failure”
- a. What does this entail?
- b. How does Mist recognize that a driver has failed? Does it use some sort of heartbeat to ensure that the driver is alive and running?
- c. What happens after the driver program and/or driver machine fails? Does a new driver get instantiated, and if so, how is a machine selected to host the driver program? Does the new driver read the state which was recorded to get back up to speed or is the whole application restarted from the beginning?
- d. If job state is persisted, does this mean that Mist periodically pulls or gets updates about job state from the driver and stores it locally?
  d. Is the restart of the driver program made transparent to the workers? Any guidance on where in the source code this is implemented would be much appreciated.
This document explains how you can have multiple Spark contexts running Spark jobs in parallel using the same group of worker machines.
- a. Can you configure the second driver in such a way that it executes the same jobs as the first driver if the first driver fails?
- b. Would this involve periodically sharing job state between driver one and driver two? I am asking because if this can be done, the turnaround time for continuing the job after first failure could significantly reduce.
To test the fault tolerance of the utilities Mist provides, we simulated a long-running job by adding artificial sleep statements. During this sleep period, we printed out the IP addresses of the machine running the application using two alternate suggestions: https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/166589#166589 and https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/166520#166520. Yet we always get an IP address of 172.17.0.2, which seems to be the IP address of the Docker container.
- a. Are we to conclude that the job is being run within the Docker container and not on a Spark worker?
- b. While this application was sleeping, we killed the master to see how Mist reacts to the killing of the driver program. However, our print statements kept printing with the IP address listed above. This means, we believe, that the application/job is not being run on the Apache Spark cluster but in the docker container. Are our conclusions correct?

MistLib - SparkSession should be a `val`

see #169

Kill a MistJob using rest call

How to kill a running job using rest API call?

And second thing is, (which is not related to this) so is a new JVM started per SparkContext?

fast model evaluation

You mentioned that you are working on a high throughput API for mist.
Maybe https://github.com/combust-ml/mleap is helpful.

Synchronous real-time API for high throughput - we are working on adding a model serving support for online queries with low latency.

Python - old `WithHiveSupport` jobs does not work on Spark 2.0

If there is python job with WithHiveSupport or WithSqlSupport and it is running on Spark 2.0 hive_context and sql_context is not initialized, but job can be written in old style

Broadcast variables - classpath problem

Broadcast variable throw ClassCastException on value method if it has type that defined inside job-artifact.

Original report: #204 (comment)

Make MistConfig reloadable on the fly

Right now we have a global instance of config wrapper.

It would be great to make it reloadable on the fly, so we could change the config without Mist restart.

Dependency not available for version 0.0.8 in mist maven repo.

When I am trying to include dependency for mist using

<dependency>
    <groupId>io.hydrosphere</groupId>
    <artifactId>mist</artifactId>
    <version>0.8.0</version>
</dependency>

it fails as unresolved dependency: io.hydrosphere#mist;0.8.0: not found.

Can anybody look into this. or is there any other way to download dependency.

S3 Jobs Resolver

Add ability to pull Spark jobs from S3.
See existing implementations of MavenArtifactResolver & HDFSResolver

Java support for MistJobs

Currently there is Scala and Python API.
It would be great to support Java as well.

There are 2 types of Mist jobs:

Heavy Spark jobs extended from MistJob
Realtime MLMistJob

Java MistJob API sub-tasks and challenges:

Implement a Java mirror of MistJob trait and deal with multiple inheritance for Java
Implement Java version of typed parameters mapper. See ExternalJar

Java MistJob API sub-tasks and challenges:

Port Scala implicits from LocalTransformers into Java API
Create Java API for LocalData

@mozinrat - FYI - any thoughts and contributions are welcome!

Can't get ML Example to Work

Hi, I'm following the examples from: https://github.com/Hydrospheredata/mist/blob/master/docs/use-cases/ml-realtime.md and I cannot get the job/routes to show up in the MIST http UI. Here's what I'm doing:

Starting the docker container (from project root)
Building the examples as sbt "project mistLibSpark2" package (build is successful)
in ./configs/, I'm have the following:

jar_path = "./mist-lib-spark2/target/scala-2.11/mist-lib-spark2_2.11-0.11.0.jar" //I've also tried variations of ./ and /

genericModel = {
path = ${jar_path}
className = "DTreeClassificationJob$"
namespace = "foo"
}

And... there's nothing populating in the router, jobs, settings, etc... UI. I'm running on Windows too.

Thank you!

Get rid of using mist version in default router configuration

router-config contains version in path to examples. That leads to we should not forget to change that config during release.

Problem running mist on own Apache Spark Cluster

Hi,

I am trying to run a Mist Job on a custom Apache Spark cluster and I am getting the following error:

17/01/31 00:52:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@6b9689d rejected from java.util.concurrent.ThreadPoolExecutor@38f018f4[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

These are the steps I followed:

Set up a small Apache cluster of 3 machines using Apache Spark version 2.1.0.
Created an image using the Dockerfile after compiling Mist and specifying the Spark Version 2.1.0
Set the master URL in docker.conf to the IP and port of my apache spark cluster master.
Created a sample application using the examples given in the MIST directory and created an endpoint using router.conf.
Did the following POST request: curl --header "Content-Type: application/json" -X POST http://<IP:PORT>/api/log-search --data '{"digits":5}'

I would appreciate any help on this as I am trying to test whether Mist provides Driver fault tolerance for Spark Batch Processing jobs.

Please let me know if I can provide any more information.

Thank you

Check spark version compatibility on context initialization

If mist started on wrong spark version error is not clear

Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
    at akka.remote.serialization.MiscMessageSerializer.<init>(MiscMessageSerializer.scala:71)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
    at scala.util.Try$.apply(Try.scala:161)

python sklearn integration

Do you plan to integrate MLMistJob for the python API as well? So far I could not find anything: https://github.com/Hydrospheredata/mist/search?utf8=%E2%9C%93&q=MLMistJob

question regarding session in examples

In the MLClassification example line 9 is using
val training = session.createDataFrame(Seq(
the session object. Where can I find out more about this object, IntelliJ tells me, that it is not available.

how to run

I am using spark on a MacBook installed via brew.

what is the problem? spark home seems to be set

sudo ./bin/mist start master
SPARK_HOME is not set
t201-034:mist geoHeil$ echo $SPARK_HOME
/usr/local/Cellar/apache-spark/2.0.2
t201-034:mist geoHeil$ sudo echo $SPARK_HOME
/usr/local/Cellar/apache-spark/2.0.2

Docs: configuration examples for running on yarn cluster

Not able to detect custom (datasource) classpath in spark-sql with mist

I have custom data source which I use in sqlContext.read.format("c##.###.bigdata.fileprocessing.sparkjobs.fileloader.DataSource"). This works well via spark-submit in local & yarn mode. But, when I invoke the same job via Mist, it throws following exception:

Failed to find data source: c.p.b.f.sparkjobs.fileloader.DataSource. Please find packages at http://spark-packages.org

Curl Command invoked:
curl --header "Content-Type: application/json" -X POST http://localhost:2004/api/fileProcess --data '{"path": "/root/fileprocessing/bigdata-fileprocessing-all.jar", "className": "c##.###.bigdata.fileprocessing.util.LoaderMistApp$", "parameters": {"configId": "FILE1"}, "namespace": "foo"}'

bigdata-fileprocessing-all.jar has the DataSource class.

Added following in router.conf
fileProcess = {
path = ${fp_path}
className = "c##.###.bigdata.fileprocessing.util.LoaderMistApp$"
namespace = "foo"
}

Mist Version: 0.10.0
Spark Version: 1.6.1

Error Stacktrace:
mist_stacktrace.txt

Introduce new rest api

Now we are working on new rest api. Main goal is to build new ui over it and to give availability to comfortably work with long-time jobs via http.

Glossary changes:

route now is endpoint - some entity that configured to run jobs (artifact path, className)
job now is fact of endpoint execution

Short overview:

Endpoints:
- get all GET /v2/api/endpoints
- get by id GET /v2/api/endpoints/{id}
- start job POST /v2/api/endpoints/{id}
  DATA: jobs arguments { "arg-name": "arg-value", ...}
- job history GET /v2/api/endpoints/{id}/jobs
Jobs:
- get info by id GET /v2/api/jobs/{id}
Workers:
- get all GET /v2/api/workers
- stop DELETE /v2/api/workers/{id}

Dev branch feature/new_api

spark.executor.cores setting has no effect while creating spark context

With below config

mist.context-defaults.spark-conf = {
spark.master = "spark://host:7077"
spark.executor.cores = 2
spark.executor.memory = "512m"
}

Even though the executor cores are set to 2, the job occupies all the available cores on spark cluster, it looks like this setting is not taking effect.

Mist throwing exception for drill queries

My jobs with any drill query are not working with mist and I keep getting the following exception. The same job works fine with spark-submit.

17/06/15 05:59:04 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.lang.NoClassDefFoundError: oadd/org/apache/log4j/Logger
at oadd.org.apache.zookeeper.Login.(Login.java:44)
at oadd.org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:226)
at oadd.org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:131)
at oadd.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:949)
at oadd.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1003)
17/06/15 05:59:05 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

Here's a sample drill query job I wrote which is working fine with spark-submit,

def execute(): Map[String, Any] = {
     
    Class.forName("org.apache.drill.jdbc.Driver")
    val connection = DriverManager.getConnection("jdbc:drill:zk=localhost:5181/drill/demo_mapr_com-drillbits;schema=dfs", "root", "mapr")
    val statement = connection.createStatement()
    val query = "select * from  dfs.tmp.`employees` limit 10"
    val resultSet = statement.executeQuery(query)
    var list : List[String] = List()
    
    while(resultSet.next()){  
      println(resultSet.getString(1));
      list = list ++ List(resultSet.getString(1))
    }
    Map("result" -> list)
  }

Also find attached the response I get from the mist API.

response.txt

Same Job submitted twice

@dos65
Hi,
When I send an API request to mist server from Mist UI, it submits the same job twice (with same ID) to the spark cluster. I can see this from the spark cluster UI. Although the result is returned from the first job execution result.

Below is the screenshot attached:

Please check this..!!
Thanks

Reconsider `streaming job` feature

Current state of jobs with StreamingSupport is that they are support only json format for publishing. Also they are limited by broker and topic selection (the can use only one that configured on mist server).
It is does not seems convenient for end-users, we cannot predict the way that they will to publish events or force they to using only mist's available configuration.

Maybe it would be nice to do mist-lib-streaming with Mqtt/Kafka Producer for spark?

@spushkarev @mkf-simpson ping

building with scala 2.11

The examples in git reference scala 2.11 but on my machine scala 2.10 is generated.

Logging mist API calls

How can I log all mist API calls in rolling file logging? Right now it prints on the console log.

Please help me..!!

UNRESOLVED DEPENDENCIES of io.hydrosphere#mist-api-spark2;0.11.0

Hi,

I am using spark2.0.2. I have included

libraryDependencies += "io.hydrosphere" % "mist-api-spark2" % "0.11.0"

and sbt couldn't find the dependency.

Error:

[warn] 	module not found: io.hydrosphere#mist-api-spark2;0.11.0
[warn] ==== local: tried
[warn]   /home/ravi/.ivy2/local/io.hydrosphere/mist-api-spark2/0.11.0/ivys/ivy.xml
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/io/hydrosphere/mist-api-spark2/0.11.0/mist-api-spark2-0.11.0.pom
[info] Resolving jline#jline;2.12.1 ...
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: io.hydrosphere#mist-api-spark2;0.11.0: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 
[warn] 	Note: Unresolved dependencies path:
[warn] 		io.hydrosphere:mist-api-spark2:0.11.0 (/home/ravi/Work/build.sbt#L17-18)
[warn] 		  +- com.ravi:mistexample_2.11:0.1
sbt.ResolveException: unresolved dependency: io.hydrosphere#mist-api-spark2;0.11.0: not found
	at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
	at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
	at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
	at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
	at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
	at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
	at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
	at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
	at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
	at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
	at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
	at xsbt.boot.Using$.withResource(Using.scala:10)
	at xsbt.boot.Using$.apply(Using.scala:9)
	at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
	at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
	at xsbt.boot.Locks$.apply0(Locks.scala:31)
	at xsbt.boot.Locks$.apply(Locks.scala:28)
	at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
	at sbt.IvySbt.withIvy(Ivy.scala:128)
	at sbt.IvySbt.withIvy(Ivy.scala:125)
	at sbt.IvySbt$Module.withModule(Ivy.scala:156)
	at sbt.IvyActions$.updateEither(IvyActions.scala:168)
	at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1481)
	at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1477)
	at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1512)
	at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1510)
	at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
	at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1515)
	at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1509)
	at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
	at sbt.Classpaths$.cachedUpdate(Defaults.scala:1532)
	at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1459)
	at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1411)
	at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
	at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
	at sbt.std.Transform$$anon$4.work(System.scala:63)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
	at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
	at sbt.Execute.work(Execute.scala:237)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
	at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
	at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
[error] (*:update) sbt.ResolveException: unresolved dependency: io.hydrosphere#mist-api-spark2;0.11.0: not found

Do I need to add any resolvers to the build.sbt?

@dos65

Docs: update getting started pages

Currently that page is not so friendly for new people.
I think that page should contains parts bellow:

Download and try run job from examples
Basic api methods explanation (run/cancel/check status)
How to write new 'MistJob' and run it (maybe publish samples for different projects sbt, maven, python)
Configuration explanation and examples for cluster mode

[question]

What differentiates you from simply using something like spark jobserver?
Do you plan to integrate AP testing similar to http://pipeline.io/ or seldon.io or h2o's steam platform?

State sharing across jobs running in same namespace

Can we share state/objects across multiple jobs running in same namespace/context ? These multiple jobs may not be running in parallel.

Running Mist on Mapr Yarn

To run mist on yarn, I configured configs/default.conf
mist.context-defaults.spark-conf = {
spark.master = "yarn-client"
}

But I get exception, Exception in thread "main" org.apache.spark.SparkException: Found both spark.driver.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former

When I run plain spark-submit (without mist), I get following warning:
SPARK_JAVA_OPTS was detected (set to ' -Dhadoop.login=hybrid -Dmapr_sec_enabled=true').
This is deprecated in Spark 1.0+.
As I am using MAPR cluster, in the configuration setup they might be internally using SPARK_JAVA_OPTS. And changing this might have repercussions.

To resolve the issue, one way is to avoid mentioning spark.driver.extraJavaOptions in mist flow. Can you point out where this setting is configured ? or can help out with any alternate solution.

Context JVMs in Yarn cluster mode for resource management

Does Mist support running context JVMs in Yarn cluster mode ? This will enable resource management of these context JVM processes/drivers by Yarn Resource manager.
If not, is there any plan to do it in future ?

model selection

Do you have any plans like https://arxiv.org/pdf/1612.03079.pdf to support model selection / multi armed bandits?

Problem with examples

I try to follow along with the examples:

curl --header "Content-Type: application/json" -X POST http://localhost:2003/jobs --data '{"path": "./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar", "className": "SimpleSQLContext$", "parameters": {"file": "/path_to_mist/examples/resources/SimpleSQLContextData.json"}, "namespace": "fooB"}'

but get a class not found an exception. What is wrong? (I am executing this in the provided docker environment)

In the documentation you say that "the example jar would be available in the docker container". But ./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar nor /path_to_mist/examples/resources/SimpleSQLContextData.jso seem to be in the container.

How should I upload the jar? is this somehow similar to spark-jobserver?

Add support for win/cygwin

Currently scripts from bin and some code(that starts workers) designed only for Unix systems.

Separate logs between master/workers/job running

Find way to split up logs created from job running and logs from worker
Collect all logs on master (and expose their via rest api)
mist-lib - replace current Publisher by some Logger, that works like particular logger. Publish that logs via async publishers / websocket
provide fix for #172

Wrong mapping of input parameters

Hi @dos65 ,
I found a bug in this library. When I send few string parameters to the job the mist server job executor maps them incorrectly.

My code is as follows:

def execute(FromDate:String,ToDate:String,query:String,rows:Int,Separator:String): Map[String,Any] = {
	println("fromdate: " + FromDate)
	println("todate: " + ToDate)
	println("query: " + query)
	println("rows: " + rows)
	println("separator: " + Separator)
	
	return Map("success" -> true)
}

My input:

{
	"FromDate": "2017-05-17",
	"ToDate": "2017-05-18",
	"query": "select * from tablename",
	"rows": 5,
	"Separator": ";"
}

And the output in the mist log shows:

fromdate: 2017-05-17
todate: select * from tablename
query: 2017-05-18
rows: 5
separator: ;

It mapped well when I sent 2 string parameters only. Please look into this issue..!!

Spark settings for a namespace via REST/CLI

Does Mist support providing spark config and other settings per namespace via REST Http Request or CLI ? From the documentation, we can provides settings for an namespace in mist config file. If a namespace setting doesn't exist in config file then default context settings are used. However, I want to dynamically provide these configurations via REST/CLI. Is it possible?

Working on my first "own" mist job

I want to write my first own mist job to get more familiar with mist. I set up a git repo https://github.com/geoHeil/apache-spark-restAPI-example which hopefully can serve as sort of tutorial to mist when my questions are answered ; )

I would like to be able to debug the spark job / run it locally as well. https://github.com/geoHeil/apache-spark-restAPI-example/blob/master/src/main/scala/SimpleContext.scala shows the current status. For now, it is unclear to me:

if mist supports collect async / handles promises
how should a sparkSession be instantiated locally to debug / test run the mist job? Would you consider the workaround of defining override def runTheJOb(session: SparkSession, parameters: Map[String, Any]): Map[String, Any] correct / good style?
why do you require an absolute relative (from ./bin/mist) path in
curl --header "Content-Type: application/json" -X POST http://localhost:2003/jobs --data '{"path": "absolutePathToJar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}' When I try to set the jar to ~/Downloadas/my.jar it is no longer found
why is it mandatory to set a path when calling the API via rest? I thought that setting it in the config was enough.

simple-context = {
  path = "/my/absolute/path/target/scala-2.11/sparkmist_2.11-0.0.1.SNAPSHOT.jar"
  className = "SimpleContext$"
  namespace = "foo"
}

Am I missing something here? If not explicitly setting a path in the request I get the example path again.

curl --header "Content-Type: application/json" -X POST http://localhost:2003/api/simple-context --data '{"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}'
{"success":false,"payload":{},"errors":["io.hydrosphere.mist.jobs.JobFile$NotFoundException"],"request":{"path":"./examples/target/scala-2.11/mist_examples_2.11-0.0.2.jar","className":"SimpleContext$","namespace":"foo","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}}}geoHeil:mistSample geoHeil$

support for maven/nexus artifacts and scheduling

Need to ask if there is support for pulling jars from maven/nexus as spark packages convention
groupid:artifactid:version
Can you point me to documentation about how scheduling works when we are submitting jobs through different applications. Lets say analysis job to run only after etl job is completed. Same aggregation job only after analysis job. How do resource allocation will happen in that case.
Spark Contexts orchestration can you point me to more documentation/examples.

Any plans for supporting java also?

Regards and lots of thanks for awesome work.

hydrospheredata / mist Goto Github PK

mist's Introduction

Hydrosphere Mist

High Level Architecture

Contact

mist's People

Contributors

Stargazers

Watchers

Forkers

mist's Issues

Java MistJob API sub-tasks and challenges:

Java MistJob API sub-tasks and challenges:

Recommend Projects

Recommend Topics

Recommend Org