Coder Social home page Coder Social logo

geo-kafka-sparkstreaming's Introduction

Docker container for Big Data Geospatial Analysis in Kafka & Spark Streaming (PySpark)

This Dockerfile sets up a complete streaming environment for experimenting with Kafka and Spark streaming (PySpark). It installs

  • Kafka 2.4.0;
  • Spark 2.4.4.

It additionnally installs

0. Docker installation

Docker is a software container platform, which allows to isolate OS environments. Its main benefits over virtual machines are a reduced footprint (starting from 60MB for a Linux distribution - for example ubuntu), and an easy way reconfigure an OS environment. Note: The Docker container for this course is 7.7GB, mostly due to the Anaconda (>1GB) and Spark (>1GB) distributions.

For more information on Docker, see:

To download and install, see: (NB: install the Community Edition)

Get started Docker tutorial:

We strongly encourage you to follow the tutorial, and learn how to build, run, pull and push a container.

1. Get started

In order to avoid building the image from scratch, a prebuilt image is made available from DockerHub, see below.

1.1. Pull image

Pull image and create a folder for your Project

The image is called kafkasparkgio and is available from DockerHub (Note: image is 7.7GB, make sure you have a reasonably good Internet conection).

To install the image, use the standard docker pull command

docker pull giobbu/kafkasparkgio

Create a folder and cd into it.

mkdir Folder_Geospatial_KafkaSpark 
cd Folder_Geospatial_KafkaSpark 

Git clone the repository for the course

git clone https://github.com/giobbu/Geospatial_KafkaSpark

Cd to the Geospatial_KafkaSpark folder

cd Geospatial_KafkaSpark

Finally, give recursive permission to all for writing to it (ease the sharing with Docker container)

chmod -R a+rwx .

The Docker container should now be able to read/write to your host Geospatial_KafkaSpark folder.

1.2. Start container

1.2.1. Start Docker container

From the Geospatial_KafkaSpark folder, start the container with

docker run -v `pwd`:/home/guest/host -p 8888:8888 -p 4040:4040 -p 23:22 -it giobbu/kafkasparkgio bash

Notes

  • -v is used to share folder (right permissions given above will allow your changes to be saved on your computer). The "-v pwd:/home/guest/host" shares the local folder (i.e. folder containing Dockerfile, ipynb files, etc...) on your computer;
  • -it starts the Docker container in interactive mode, so you can use the console and Bash;
  • -p is for sharing ports between the container and the host. 8888 is the notebook port, and 4040 the Spark UI port.

SSH allows to get a onnection to the container

ssh -p 23 guest@containerIP

where 'containerIP' is the IP of th container (127.0.0.1 on Linux). Password is 'guest'.

1.2.2. Start Services

Once run, you are logged in as root in the container. Run the startup_script.sh (in /usr/bin) to start

  • SSH server. You can connect to the container using user 'guest' and password 'guest'
  • Zookeeper server
  • Kafka server
startup_script.sh

1.2.3. Connect, open notebook and start streaming

Connect as user 'guest' and go to 'host' folder (shared with the host)

su guest

Start Jupyter notebook

notebook

and connect from your browser at port host:8888 (where 'host' is the IP for your host. If run locally on your computer, this should be 127.0.0.1 or 192.168.99.100, check Docker documentation)

Start Kafka producer

Open OBUcsvToKafka.ipynb and run all cells.

Start Kafka receiver

Open sparkStreamingPredictionsOBU.ipynb and run cells up to start streaming.

Connect to Spark UI

It is available in your browser at port 4040

2. Container configuration details

The container is based on CentOS 6 Linux distribution. The main steps of the building process are

  • Install some common Linux tools (wget, unzip, tar, ssh tools, ...), and Java (1.8)
  • Create a guest user (UID important for sharing folders with host!, see below), and install Spark and sbt, Kafka, Anaconda and Jupyter notbooks for the guest user
  • Go back to root user, and install startup script (for starting SSH), sentenv.sh script to set up environment variables (JAVA, Kafka, Spark, ...)and spark-default.conf

User UID

In the Dockerfile, the line

RUN useradd guest -u 1000

creates the user under which the container will be run as a guest user. The username is 'guest', with password 'guest', and the '-u' parameter sets the linux UID for that user.

In order to make sharing of folders easier between the container and your host, make sure this UID matches your user UID on the host. You can see what your host UID is with

echo $UID

3. Troubleshooting

Issue: 127.0.0.1 refused to connect

On some versions of Mac OSX and Windows, the Docker Toolbox is used instead of the native Docker Engine on the system.

In this case, instead of connecting to the IP address of the localhost (i.e. 127.0.0.1), one should connect to the IP address of the virtual machine where the Docker Engine is running (192.168.99.100, by default).

For example, to connect to the Jupyter Notebook, you should open your browser at 192.168.99.100:8888

Issue: Shared folder empty

The Docker Toolbox on Windows 10 Home shares by default only the primary hard drive (i.e. the one where Windows is installed -> C:).

In case your machine has more than one hard disk two solutions exists:

  • Either the Geospatial_KafkaSpark folder has to be located on the main hard drive (C:)
  • Or, a manual mountpoint to the second hard drive must be created in the configuration of VirtualBox (the provisioner running the virtual machine containing Docker Engine) as described here.

and NOW ENJOY IT !!

geo-kafka-sparkstreaming's People

Contributors

giobbu avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.