big-data-europe / docker-hadoop-spark-workbench Goto Github PK

[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.

Shell 42.50% Makefile 57.50%

docker-hadoop-spark-workbench's People

Contributors

Stargazers

Watchers

Forkers

xiaods sahil-sharma roacobb solarmicrobe theanalyticalspace brianray nkhare beacloudgenius lvnilesh snowsky yuandra albre116 bwv988 dawson2000 miciav thuongdinh-agilityio mschenone prof-schacht nanne007 314452004 mwin007 vaidyanath xalperte estebanxie carolusian ravirajadrangi minhqnguyen shadowsama rumverse zyq001 elevy30 aymen-mouelhi wentixiaogege yanqingluo slivemoonglade yinshengkui sos3k ericchaves guicaro nirvanesque thejamesmarq xiaomin0322 theansguy drsnowbird frankiegu sernle sepulrator jrabary btomala mutabazi arthurzks semanticbeeng caxton warinza qingniufly acodefarmer arvindshmicrosoft softwarevamp phani111 iskytek jtatman fhorta recursivecurry agentlab jamestung1990 ar3balaji velebak sksundaram-learning imjerrybao preseries iloop2 ozoesono 2892931976 synergiator ashley-betts70 minhtribk12 sleepingla bryandsy hanhongyuan wanly3643 bossjones raohammad zhaobonny gitter-badger ptzagk m-schenone huxili lytanble mmm mfcardenas bigdata06 neel6384 liviust epocolis tflowers-taec choiruzain msellamitn javrombay claudiouzelac crypticsai

docker-hadoop-spark-workbench's Issues

Great repo! But how can we use Pyspark?

The notebook currently available is scala? If I understand correctly. How can we use Pyspark?

docker-compose scale spark-worker=3 & spark-submit

according to this tutorial that you posted in the main README.md of this repository, I first followed the instructions provided in the README.md
docker-compose up -d
and it worked; but then when I try to use docker-compose scale spark-worker=3 I receive the error:
ERROR: for dockerhadoopsparkworkbench_spark-worker_2 driver failed programming external connectivity on endpoint dockerhadoopsparkworkbench_spark-worker_2 (079d21c97e12d288aea5246c5eb575f245161c330639fcea35c899056a2e8af2): Bind for 0.0.0.0:8081 failed: port is already allocated
and the same error even with the worker-3. This because the port is already used by the first worker allocated with docker-compose up -d, is it an issue of the code or I'm making something wrong (I'm new to docker, sorry)?

And moreover, when I try to use spark-submit:
/usr/local/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --class uk.ac.ncl.NGS_SparkGATK.Pipeline --master local[*] NGS-SparkGATK.jar HelloWorld
It works, while if as master I use spark://spark-master:7077 as suggested in the tutorial reported in the README.md I receive the error Failed to connect to master spark-master:7077
Which is the correct IP address to use in order to submit the spark job?

I hope to be clear in explaining my problem,
I'm waiting for your kind response

Q: How can you scale out on multiple hosts?

Hello.

This is a very useful setup - well done.
Is it possible to scale out on multiple hosts, for a trully distributed cluster?

If yes, could you provide some info on how to do it?
I can write it up, if I figure it out, in step-by-step manner an add it in documentation on a PR if you'd like that.

TaskSchedulerImpl: Initial job has not accepted any resources

Tried running both docker-compose-hive.yml and docker-compose.yml on a 4 core 8gb VM running Ubuntu 14.

Spark notebook always waiting for kernel to start
Tried connecting to spark instance via a scala program using a spark driver (scala 2.10, spark 1.6.2) in the local machine (need to open port 7077 in the docker.yml file first).
Connected successfully, worker is detected on the spark UI (port 8080).
Tried creating a data file using a text file on the local machine, worker seems to be running. However receive message "WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources" every 15 seconds.
Check the spark UI and shown worker is running however when the job seems to be not executed. Hangs without any progress. It is the only application in the spark cluster.

Possibility :

not enough resources (previously it was 2 core, 8gb and I increased it 4 core, 8gb). Only running a single worker and a single app.. There should be enough resources I think.. I set the spark conf to only run single core with 1gb of ram per worker
the spark worker cannot communicate with the spark master (spark master successfully detected the worker in the UI, and when tried to see the worker details it is redirected to the worker local ip that are not accessible from the outside).. Thinking this might be the problem ?

Connect spark-notebook to spark cluster

Hi,

I'm trying to connect spark notebook to the spark cluster. By default it runs the notebooks in a local spark (the notebook jobs never appear in the spark master page) and when I try to connect it to the cluster created by the docker compose file the kernel dies.

Following spark-notebook's documentation on this, I'm adding the following to the notebook's metadata:

  "customSparkConf": {
    "spark.app.name": "Notebook",
    "spark.master": "spark://spark-master:7077",
    "spark.executor.memory": "1G"
  },

Is there anything else I need to do/add?

MapReduce jobs causes Namenode to stop

Hi,
we are facing the problem that when we are running our MapReduce jobs, the Namenode crushes, therefore our jobs cannot be completed. We have kept the hadoop.env as is here from this repository and the only thing that we changed was to expose the port 9000 in the docker-compose.yml so we can run our jobs.

We tested the same code in the AWS EMR in production as well as in our local Hadoop installation and the jobs are working fine.

The error we are having is the following:

java.io.EOFException: End of File Exception between local host is: "Administrators-MacBook-Pro-3.local/192.168.1.49"; destination host is: "192.168.99.100":9000; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 ...

Once we obtained this error, we tried to reload http://192.168.99.100:50070/dfshealth.html#tab-overview and there was no response, hence, the namenode was not working anymore.

Any suggestions ? I have the feeling that there is a misconfiguration somewhere but we are not sure what.

Thanks!

sparkContext not found in spark notebook

I tried to start the script in core/simple but the error shows that
<console>:15: error: not found: value sparkContext sparkContext.getConf.toDebugString ^ <console>:12: error: not found: value globalScope globalScope.sparkContext.setJobGroup("cell-958C08DD7D9E4D528E649879CA2912E4", "run-1510003533026: sparkContext.getConf.toDebugString") ^
I wondering what happens?

I use docker-compose.yml to start up the service

Copying files to HDFS

Hello Ivan,

First, thanks for your response on Twitter, and for the whole project.

The issues, I'm facing is that I went through your Blog post here:
https://medium.com/@ivanermilov/scalable-spark-hdfs-setup-using-docker-2fd0ffa1d6bf

I've created the network, and then used the commands in this repo to start my cluster. CMDs used:

docker-compose -f docker-compose-hive.yml up -d namenode hive-metastore-postgresql
docker-compose -f docker-compose-hive.yml up -d datanode hive-metastore
docker-compose -f docker-compose-hive.yml up -d hive-server
docker-compose -f docker-compose-hive.yml up -d spark-master spark-worker spark-notebook hue

Now all is working, and I can see that by checking the interfaces for each service. The only issue that isn't moving any step forward is copying files. I've done these with no help:

FIRST TRY: docker run -it --rm --env-file=../hadoop-hive.env --net hadoop uhopper/hadoop hadoop fs -mkdir -p /user/root
I noticed that maybe this uhopper/hadoop isn't from the same cluster, so I did these:
docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoopbde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root

AND
docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root
docker run -it --rm --env-file=../hadoop-hive.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root
None of these worked. All give me the same error message:

Configure host resolver to only use files
-mkdir: java.net.UnknownHostException: namenode
Usage: hadoop fs [generic options] -mkdir [-p] <path> ...

Notes:

I modified the compose files and added this:
HOST_RESOLVER=files_only
And added an entry in my /etc/hosts for the namenode. But still nothing!

Is there something I'm missing here?

Thank you.

links to any official mailing lists /chats/etc.

@earthquakesan is there any official gitter/googlegroup whatever to discuss workbench? At https://www.big-data-europe.eu I see only official email

Unable to view worker details or Jobs

I recently ran the docker stack on my dev machine (OSX 10.11.5 (15F34)) running Docker Native (Version 1.12.1 (build: 12133)).

I've tried some of the notebooks in the 'core' folder and these appear to work.

I'm now trying to run the Twitter Stream example which appears to be running a Spark Job, but I'm not seeing any output from the unordered list or geo widgets. In fact, they don't seem to be displaying at all.

I'm trying to diagnose the problem, but when I click the 'open SparkUI' button in the notebook, it tries to open http://192.168.0.10:4040/, this fails to load. I'm using Docker native so all my containers are addressable at local host. I also couldn't see the 4040 port being mapped in the docker-compose file. I tried adding it to the docker master and worker, but it still fails to load.

I've also found that I can load the Spark master web UI at http://localhost:8080/. This works and I can see one worker. When I try and click on the worker it takes me to http://192.168.0.6:8081/ which fails to load. I've tried adding port 8081 to the docker compose file for both the master and worker, no dice.

So this issue is primarily about being unable to access the worker web UI and SparkUI via the notebook. If the widgets not appearing is a common issue, I'd like to know how to solve that too.

Thanks.

Move build.sh scripts to develop branch

Put build-all.sh, build.sh scripts into dev branch.
Master branch should contain docker-compose.yml pointing to Dockerfiles

Hue is not working

Hi
I only execute "docker-compose up -d". I do not make any changes
The Hue (HDFS Filebrowser) is not working. After login I get this error

How can i solve it?

Thanks

command: ["./wait-for-it.sh"] for swarm mode

As I understood from the issue tracker Docker team does not want to add depends_on support to the swarm. In such case it will make sense to add https://github.com/vishnubob/wait-for-it to check the datanode always starts after namenode

There are 1 datanode(s) running and 1 node(s) are excluded in this operation

Hi,
when I try to write parquet file into HDFS, I have below issue:

File /data_set/hello/crm_last_month/2534cb7a-fc07-401e-bdd3-2299e7e657ea.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

    at org.apache.hadoop.ipc.Client.call(Client.java:1475)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy288.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy289.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1455)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1251)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)

Hue docker image not availahle

Greetings,

I tried to run <docker-compose up -d> and got an ERROR (hue image not found) and one another as:
Unsupported config option for hue service: 'container_name'
Might be because of non-availability of hue docker image.
Can you please redirect me from where I can pull hue image or if I get hue's official image then what corresponding changes do I need to do in the code to run the set-up?
Kindly suggest.
Thanks.

Question: how to connect from external client (intellij) to spark?

Hi, I tried looking at http://localhost:8080 and saw that sparkmaster is at: spark://26a6a4b99956:7077 i'm not sure i was doing the correct thing but i tried creating spark conf (in local intellij) with: val conf = new SparkConf().setAppName("myapp").setMaster("spark://26a6a4b99956:7077")

but when tried to run some spark code i get:

17/12/31 13:10:40 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://26a6a4b99956:7077...
17/12/31 13:10:40 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 26a6a4b99956:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)

I tried playing with network_mode: host in the spark subsection but didnt work either.

is there a standard way that you know that I could start this cluster up and then connect to it with some spark code from my client (currently my intellij).

thanks

Binding 8081 port error when I execute scale up

➜ docker-hadoop-spark-workbench git:(master) sudo docker-compose up -d
Creating network "dockerhadoopsparkworkbench_default" with the default driver
Creating spark-notebook ...
Creating spark-master ...
Creating dockerhadoopsparkworkbench_hue_1 ...
Creating namenode ...
Creating spark-notebook
Creating spark-master
Creating dockerhadoopsparkworkbench_hue_1
Creating namenode ... done
Creating dockerhadoopsparkworkbench_datanode_1 ...
Creating dockerhadoopsparkworkbench_hue_1 ... done
Creating dockerhadoopsparkworkbench_datanode_1
Creating dockerhadoopsparkworkbench_spark-worker_1 ... done
➜ docker-hadoop-spark-workbench git:(master) sudo docker-compose scale spark-worker=3
Starting dockerhadoopsparkworkbench_spark-worker_1 ... done
Creating dockerhadoopsparkworkbench_spark-worker_2 ...
Creating dockerhadoopsparkworkbench_spark-worker_3 ...
Creating dockerhadoopsparkworkbench_spark-worker_2 ... error
Creating dockerhadoopsparkworkbench_spark-worker_3 ... error

ERROR: for dockerhadoopsparkworkbench_spark-worker_2 Cannot start service spark-worker: driver failed programming external connectivity on endpoint dockerhadoopsparkworkbench_spark-worker_2 (796ee234433b58f170caabee79d04644276ebc04f9eb8509f7ffd0390dca6658): Bind for 0.0.0.0:8081 failed: port is already allocated

ERROR: for dockerhadoopsparkworkbench_spark-worker_3 Cannot start service spark-worker: driver failed programming external connectivity on endpoint dockerhadoopsparkworkbench_spark-worker_3 (925cb4a1c2bfe6132d39b58862dfdd1d5b15feb08f37cdb6070a4432c43da8d1): Bind for 0.0.0.0:8081 failed: port is already allocated
ERROR: Cannot start service spark-worker: driver failed programming external connectivity on endpoint dockerhadoopsparkworkbench_spark-worker_2 (796ee234433b58f170caabee79d04644276ebc04f9eb8509f7ffd0390dca6658): Bind for 0.0.0.0:8081 failed: port is already allocated
➜ docker-hadoop-spark-workbench git:(master)

It was caused by spark_worker_1 had already occupied the port 8081, thus spark_work_2 and other workers failed to bind the port.
What should I do now? What is the port 8081 used for？

Steps compatible to Windows?

Hi,
Are these below steps compatible on Windows Dockers?

docker network create hadoop
docker-compose up -d
docker-compose scale spark-worker=3

Few issues that I observed were:

The docker-compose scale did not work for me unless I killed the Docker containers and restart them.
The "data" directory was not created automatically.
below command did not work, when executed from data folder
docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop uhopper/hadoop hadoop fs -put vannbehandlingsanlegg.csv /user/root

error in spark notebook

print ("user")

I got the error below
<console>:12: error: not found: value globalScope globalScope.sparkContext.setJobGroup("cell-A0F42CF740B04DD287456255B6E4528E", "run-1534702622531: print ('user')")

General Questions - Multihost, Spark Version and Apache Zeppelin

I have only some general questions. How to you handle the docker containers on multiple physical hosts?

Is it possible to extend the example to SPARK Version 2.0 and add also Apache Zeppelin as Notebook Driver to this repo?

And is it possible to scale also the datanode from hadoop hdfs?

Why HDFS run without YARN ?

Could you please explain why HDFS is run without YARN because I think YARN is necessary for scheduling Spark jobs ?

Thanks,

ClassCastException while running example from BDE2020 blog

I downloaded the data file and run the code as https://www.big-data-europe.eu/scalable-sparkhdfs-workbench-using-docker/

The code is "sc.textFile("/user/wanglei/vannbehandlingsanlegg.csv").count()", but I got an Exception like this:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 10, 172.18.0.6, executor 0): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
... 63 elided
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Spark Worker not connected to Spark Master

I'm sorry for opening so many issues...

But I am using the swarm version of HDFS and Spark, in particular I think it is a version of few weeks ago. This is the docker-compose-spark.yml file:

version: '3.2'
services:
  spark-master:
    image: vzzarr/spark-master:gatk_env
    networks:
      - core
    deploy:
      replicas: 1
      mode: replicated
      restart_policy:
        condition: on-failure
    ports:
      - 8080:8080
      - 7077:7077
    env_file:
      - ./hadoop-swarm.env
    volumes:
      - /data0/reference/hg19-ucsc/:/reference/hg19-ucsc/
      - /data0/fastq/:/fastq/
      - ../../:/NGS-SparkGATK/
      - /data/ngs/:/ngs/
  spark-worker:
    image: bde2020/spark-worker:2.1.0-hadoop2.8-hive-java8
    networks:
      - core
    environment:
      - SPARK_MASTER=spark://spark-master:7077
    deploy:
      replicas: 5
      #mode: global
      restart_policy:
        condition: on-failure
    env_file:
      - ./hadoop-swarm.env

networks:
  core:
    external: true

and how you can see there are some personal modifies to your original file, because of implementation necessity.

In particular I am using two Azure VM for testing, where one is the Swarm Master and the other a Swam Worker.

I would expected that all the Spark Worker were directly connected to the Spark Master, in a similar manner of how explained in this Spark tutorial. But as reported in that Spark Cluster tutorial: "Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS)".
But using lynx http://localhost:8080 (lynx is a terminal browser) while I am executing a Spark Job, I can see:

 * URL: spark://475ac7c0cd02:7077
 * REST URL: spark://475ac7c0cd02:6066 (cluster mode)
 * Alive Workers: 0
 * Cores in use: 0 Total, 0 Used
 * Memory in use: 0.0 B Total, 0.0 B Used
 * Applications: 0 Running, 0 Completed
 * Drivers: 0 Running, 0 Completed
 * Status: ALIVE

It seems to me that the Spark Master can't see the Spark Workers. Moreover, while I can see that the Swarm Master is really "working" (20-30 GB of RAM occupied and all the CPUs working at 97-100%), the Swarm Worker load of work is clearly lower (<2 GB of RAM and CPUs at 0-2%), giving me the impression that the work load is not distributed.

So my question is: is it handled the worker connection to the master? Or I should provide it (as they suggest in the Spark tutorial with ./sbin/start-slave.sh <master-spark-URL>)? Or I am doing something wrong?

Thank you very much for your patience and your time.

Hue is not able to login

With master branch code, Hue throws exception when login with user 'hue', as stated in here http://www.big-data-europe.eu/scalable-sparkhdfs-workbench-using-docker/

namenode | 16/07/29 15:45:04 WARN security.UserGroupInformation: No groups available for user hue
namenode | 16/07/29 15:45:04 WARN security.UserGroupInformation: No groups available for user hue
namenode | 16/07/29 15:45:04 INFO namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms)
: 4 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 1
hdfsfb | 172.18.0.1 - - [29/Jul/2016 08:45:05] "POST /accounts/login/ HTTP/1.1" 302 -
hdfsfb | 172.18.0.1 - - [29/Jul/2016 08:45:05] "GET / HTTP/1.1" 500 -
hdfsfb | Traceback (most recent call last):
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/contrib/static
files/handlers.py", line 67, in call
hdfsfb | return self.application(environ, start_response)
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/
wsgi.py", line 206, in call
hdfsfb | response = self.get_response(request)
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/
base.py", line 194, in get_response
hdfsfb | response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/
base.py", line 229, in handle_uncaught_exception
hdfsfb | return debug.technical_500_response(request, _exc_info)
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/django_extensions-1.5.0-py2.7.egg/django_exte
nsions/management/technical_response.py", line 5, in null_technical_500_response
hdfsfb | six.reraise(exc_type, exc_value, tb)
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/
base.py", line 112, in get_response
hdfsfb | response = wrapped_callback(request, *callback_args, *_callback_kwargs)
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction
.py", line 371, in inner
hdfsfb | return func(_args, *_kwargs)
hdfsfb | File "/opt/hue/desktop/core/src/desktop/views.py", line 283, in index
hdfsfb | return redirect(reverse('about:index'))
hdfsfb | File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/urlresolv
ers.py", line 532, in reverse
hdfsfb | key)
hdfsfb | NoReverseMatch: u'about' is not a registered namespace

How to enable Pyspark in Jupyter Notebook

I am a relative nubie to Docker containers and have found your project very useful. I want to develop Python using SparkSQL Notebooks and unfortunately the Notebook in this bundle only supports Scala. How can I modify this build to incorporate support for Python notebooks and SparkSQL?

Any assistance would be appreciated

Regards

Anant

Question: how can i run hdfs command line commands such as "hdfs dfs fs -ls"

Question please: how can i run hdfs command line commands such as hdfs dfs fs -ls?

Job aborted due to stage failure while reading a simple Text File from HDFS

I working with spark notebooks, regarding to Scalable Spark/HDFS Workbench using Docker

val textFile = sc.textFile("/user/root/vannbehandlingsanlegg.csv")

textFile: org.apache.spark.rdd.RDD[String] = /user/root/vannbehandlingsanlegg.csv MapPartitionsRDD[1] at textFile at <console>:67

It will show the execution time and the number of lines in the csv file, but I got the next error:

cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD`

I have been searching and I saw it could be about executor dependencies, any idea?

spark cannot read file in hadoop in overlay network(swarm mode)

i have
namenode(10.0.5.x) in machine-1
spark master(10.0.5.x) in machine-1
network-endpoint(10.0.5.3) in machine-2
spark worker(10.0.5.x) in machine-2
datanode(10.0.5.x) in machine-2

my code run in spark master(use pyspark)
text = sc.textFile("hdfs://namenode:9000/path/file")
text.collect()

I create swarm with spark(gettyimage) with your hadoop and I cannot read data in hadoop, I read log in worker and it say Failed to connect to /10.0.5.3:50010 for block BP-1439091006-10.0.5.76-1536712157279:blk_1073741825_1001, add to deadNodes and continue.

10.0.5.3 is network endpoint
why namenode connect with endpoint ip?

but I can access data in namenode and why ? it endpoint ip.

docker-compose-hadoop.yml

version: '3'
services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
    ports:
      - 50070:50070
    volumes:
      - ./namenode:/hadoop/dfs/name
      - ./hadoop-data:/hadoop-data
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        condition: on-failure

  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
    volumes:
      - ./datanode:/hadoop/dfs/data
    env_file:
      - ./hadoop.env
    environment:
      SERVICE_PRECONDITION: "namenode:50070"
    deploy:
      mode: global
      placement:
        constraints: [node.role == worker]
      restart_policy:
        condition: on-failure

networks:
    default:
        external:
            name: hadoop-spark-swarm-network

docker-compose-spark.yml

version: '3'
services:
  master:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.master.Master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
      SPARK_MASTER_HOST: 0.0.0.0
    env_file:
      - ./hadoop.env
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8001:8080
      - 8888:8888
    volumes:
      - ./data:/tmp/data
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints: [node.role == manager]

  worker:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 4
      SPARK_WORKER_MEMORY: 6g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    env_file:
      - ./hadoop.env
    depends_on:
      - master
    links:
      - master
    ports:
      - 8081:8081
    volumes:
      - ./data:/tmp/data
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints: [node.role == worker]

networks:
    default:
        external:
            name: hadoop-spark-swarm-network

I create my own network with docker network create -d overlay hadoop-spark-swarm-network

Faile to connect to namenode:8020

I used this repo to pick up a docker swarm cluster and follow everything in swarm directory step by step and also modified the Makefile in main directory as follow ->

get-example:
if [ ! -f example/SparkWriteApplication.jar ]; then
wget -O example/SparkWriteApplication.jar https://www.dropbox.com/s/7dn0horm8ocbu0p/SparkWriteApplication.jar ;
fi

example: get-example
docker run --rm -it --network workbench --env-file ./swarm/hadoop.env -e SPARK_MASTER=spark://sparm-master:7077 --volume $(shell pwd)/example:/example bde2020/spark-base:2.2.0-hadoop2.8-hive-java8 /spark/bin/spark-submit --master spark://spark-master:7077 /example/SparkWriteApplication.jar
docker exec -it namenode hadoop fs -cat /tmp/numbers-as-text/part-00000

and when execute make example it reads the file and give me exception which tells me ->
Failed to connect to server: namenode/10.0.0.102:8020: try once and fail.
java.net.ConnectException: Connection refused

I have allowed all of the necessary ports and still cannot connect to namenode.... any suggestions ?

How do i get my hadoop home directory?

When i start the docker container, where will my hadoop home directory be?. What will be the path of the hadoop directory. I want to point sqoop to it and sqoop needs to know the hadoop directory of the hadoop installation on my system

Switch spark-notebook to apache zeppelin

Error when accessing hdfs file from remote spark driver

Thank you for sharing the result of your project. I'm trying play the hadoop spark workbench and I got some trouble when I want to access the data stored in the hdfs file.

My setting is as follow:

I set up the hadoop-spark-workbench in a physical server inside our internal network.
I run a remote spark-shell from my laptop by specifying the master address.

When I try to load a data on the hdfs for example using the following line:

val data = sparkContext.textFile("hdfs://namenode/data.txt")

I get a connection refused error even if the port 8082 is forwarded in the docker.

Global Values not found

Hi,
I have setup Spark master, worker nodes via docker in this repo.
Upon running the sample code, or any code for that matter, Spark notebook throws the following error:

error: not found: value globalScope
error: not found: value SparkSession

Problem while running 'docker-compose up' command

I was trying this repository and when I ran 'docker-compose up' command, it got stuck at:

datanode1 | 16/06/09 11:12:49 WARN datanode.DataNode: Problem connecting to server: namenode:8020 datanode2 | 16/06/09 11:12:52 WARN datanode.DataNode: Problem connecting to server: namenode:8020
Can anyone help?

latest version of hue and hue.ini in volume

Current hue version has a lot of issues for me: it is outdated and partially broken. In our lab we have our own docker-compose configuration that uses latest hue 4. I am ready to contribute by updating hue here but for this, I will probably need to include hue.ini file in hue volume.

Failed on Connection Exception

I am facing an issue where I am not able to let my Job to connect to the HDFS. Basically what I am trying to do is to run a simple Hadoop Job against the docker container. I follow the README and I was able to visualise all the UIs as described on the address of my docker machine (http://192.168.99.100:port). Everything works as expected but when I try to run the following Java code

Job job = Job.getInstance();
job.set("fs.defaultFS", "hdfs://192.168.99.100:8020");  // Also tried with port 9000
   
// settings of the job here

return (job.waitForCompletion(true) ? 0 : 1);  <--- This line throw the exception

I get the following exception:

Exception in thread "main" java.net.ConnectException: Call From Pitagora.local/192.168.99.1 to 192.168.99.100:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        ....

The previous code works if I am using my local instance of Hadoop, but I am not able with the one in Docker. I tried to change port, ip, etc. but none seems working.
Any advice ?

Thanks in advance

How to add a file from local file system?

I tried
docker exec -it namenode hadoop fs -put README.md /tmp/README.md
and got
put: `README.md': No such file or directory

Use HDFS docker images from BDE project

HDFS docker images are not ready yet.

org.apache.hadoop.hdfs.server.common.IncorrectVersionException

Thanks for the reply to the previous issue

I tried to execute the new setup as you suggested me in the previous issue; according to the README.md, after executing make hadoop I'm facing with this problem:

after the command sudo docker service ls I notice that hadoop_datanode doesn't start, so investigating with sudo docker logs -f hadoop_datanode.yrqg6y0x6i9r6fzumsmas6w2d.6gmr55bf2tc68cahnyjuu5xkw I found this exception:
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /hadoop/dfs/data. Reported: -57. Expecting = -56.

Is there anything wrong that I could have done? Or there is any incompatibility between the docker containers?

I am waiting for a kind reply, ask me if you need for more info from the stack-trace

Any way to run this in swarm mode?

Hi,

This configuration and images worked very well for me. But still, being only able to deploy on one physical machine is an issue. The docker-compose files are version 2, not compatible with swarm mode. Is there something I am missing out, or is this meant to be used on only one physical machine?

Thanks in advance!

P.D.: For scaling I am using pull request #25

incompatible clusterID Hadoop

Hi,
anytime I rebooted the swarm I have this problem

java.io.IOException: Incompatible clusterIDs in /hadoop/dfs/data: namenode clusterID = CID-b25a0845-5c64-4603-a2cb-d7878c265f44; datanode clusterID = CID-f90183ca-4d87-4b49-8fb2-ca642d46016c
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:777)

FATAL datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to namenode/10.0.0.7:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:574)

I solved this problem deleting this docker volume

sudo docker volume inspect hadoop_datanode

[ { "CreatedAt": "2018-05-10T19:35:31Z", "Driver": "local", "Labels": { "com.docker.stack.namespace": "hadoop" }, "Mountpoint": "/data0/docker_var/volumes/hadoop_datanode/_data", "Name": "hadoop_datanode", "Options": {}, "Scope": "local" } ]
but in this volume are present the files which I put in hdfs, so in this way I have to to put again the files into hdfs when I deploy the swarm. I'm not sure this is the right way to solve this problem.
Googling I found one solution but I dont know how to applicate it before the swarm reboot, this is the solution:
The problem is with the property name dfs.datanode.data.dir, it is misspelt as dfs.dataode.data.dir. This invalidates the property from being recognised and as a result, the default location of ${hadoop.tmp.dir}/hadoop-${USER}/dfs/data is used as data directory.
hadoop.tmp.dir is /tmp by default, on every reboot the contents of this directory will be deleted and forces datanode to recreate the folder on startup. And thus Incompatible clusterIDs.
Edit this property name in hdfs-site.xml before formatting the namenode and starting the services.

thanks.

can not open Spark-notebook: http://localhost:9001

Override SPARK_MASTER and SPARK_MASTER_PORT env variables

Would it be possible to override SPARK_MASTER and SPARK_MASTER_PORT env variables by just defining them in hadoop-hive.env file?

Error when accessing Hue UI

I cannot access hue in http://docker-host-ip:8088/

First time I access I can create an account with username hue and a password, but after registering the user the web servers responds with this error:

Traceback (most recent call last):

File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/contrib/staticfiles/handlers.py", line 67, in call
return self.application(environ, start_response)
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/wsgi.py", line 206, in call
response = self.get_response(request)
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", line 194, in get_response
response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", line 229, in handle_uncaught_exception
return debug.technical_500_response(request, _exc_info)
File "/opt/hue/build/env/lib/python2.7/site-packages/django_extensions-1.5.0-py2.7.egg/django_extensions/management/technical_response.py", line 5, in null_technical_500_response
six.reraise(exc_type, exc_value, tb)
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", line 112, in get_response
response = wrapped_callback(request, *callback_args, *_callback_kwargs)
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", line 371, in inner
return func(_args, *_kwargs)
File "/opt/hue/desktop/core/src/desktop/views.py", line 283, in index
return redirect(reverse('about:index'))
File "/opt/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/urlresolvers.py", line 532, in reverse
key)
NoReverseMatch: u'about' is not a registered namespace

Spark notebook not working on AWS EMR

After starting the HDFS/Spark Workbench, and opening up any notebook on Spark Notebook, I get the following error:

"Dead Kernel: The kernel has died, and the automatic restart has failed. It is possible the kernel cannot be restarted. If you are not able to restart the kernel, you will still be able to save the notebook, but running code will no longer work until the notebook is reopened."

Selecting "Manual Restart" results in "No kernel" and as a result no cells in the notebook can be run.

Same setup on Mac OS X works like a charm. How can I make this work on AWS EMR? Thanks!

java.net.UnknownHostException: namenode

Hi,
I'm running a scala spark job. I created a uber jar and used spark-submit to move it to the docker spark master:
spark-submit --class "myclass" --master spark://localhost:7077 target/myjob-1.0-SNAPSHOT.jar

It executes well, but when trying to write to HDFS I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode

In the spark job I'm using this line to write to hdfs
myrdd.saveAsTextFile("hdfs://namenode:8020/myfile")

Which host/port do I have to use to make this work? I tried localhost and nothing, but without success. Where can I set the DNS of my namenode?

Broken hadoop package - missing GLIBC_2.14

hadoop fs -ls /
shows:
17/03/25 13:49:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Looks like problem is related to missing GLIBC library:

ldd /opt/hadoop-2.7.1/lib/native/libhadoop.so
/opt/hadoop-2.7.1/lib/native/libhadoop.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /opt/hadoop-2.7.1/lib/native/libhadoop.so)

Use Spark from https://github.com/big-data-europe/docker-spark

Necessary features from uhopper spark image should be migrated into bde docker-spark image (make pull request).

Disable config check for Hue

this hangs the container

GlusterFS?

What do you think about adding glusterfs as an alternative file system in swarm stack?
Dealing with HDFS is a lot of pain and namenode is a single point of failure (it also has its data in local volume, so it should always be replicated in a same node).
In our lab, we are now considering using alluxio + glusterFS or any other alternatives to HDFS