Coder Social home page Coder Social logo

realestdata's Introduction

Realest Data

In order to facilitate the modernization of Realest Estate Crop., we present this project: Realest Data, a modern data platform built ontop of Spark and Jupyter.

Platform

The platform is composed of two main components, a set of ETL jobs and a set of analytics notebooks.

Extract

  • Spark is used for the heavy lifting, performing extracting and transformations on data.

Explore

  • Juypter is used to enable the more technically-savvy to get their hands dirty with the data, either from the raw logs or from a database, in a reproducible and shareabe manner utilizing PySpark as an interface to Spark and Seaborn for visualization.

Setup

Setup

To work on or setup Realest Data, install the following tools according to their respective websites:

There may be some other things that require installaton to get everything working on your own machine, some googling will resolve the majority of these issues, however if that doesn't work. Feel free to create an issue so that this installation guide can be made more universal.

You can check if you have all the tools with:

$ make check

Explore Setup

First we need to install the python dependencies, preferably within a virtualenv.

$ make setup_pyspark

For more information about Seaborn and Jupyter

NOTE: In order to export Notebooks to PDFs, you'll need to install pandoc and TeX as detailed here

Use

Extract

jobs is a SBT project that contains Spark applications written in Scala which operate on some data somewhere and saves the results of those tranformations on that data in another place.

Submitting a job is as easy as exporting the Master node to send to job to and using submit_job specifying the job (package.class format) you want to submit.

For example to submit the TestJob to a local cluster (after setting one up of course with make start_local):

$ export SPARK_MASTER=spark://$(hostname):7077
$ make submit_job job=com.realest_estate.TestJob

Remember to teardown the cluser if you're done using it with make kill_local

Explore

notebooks is the place to hold Juypter notebooks, providing an easy, yet powerful environment for manipulating data at any level in an ad-hoc, yet documentable, fashion. Investigations, prototypes, and research can all be performed in Python on top of PySpark and Seaborn. More languages and frameworks can be supported in the future.

Run which python and ensure that it's pointing to the python with the dependencies you need. If it isn't:

$ export PYSPARK_PYTHON=<path_to_python_exe>

To run a notebook server in a Spark cluser:

$ # export the url of the master node to `SPARK_MASTER`
$ export SPARK_MASTER=spark://<master_hostname>:<master_port>
$ make pyspark

Testing

Jobs

Once everything is installed, to start a local cluster, run a test job, and verify your local setup:

$ make test_local

NOTE: If this is your first time with sbt or spark, this might take a little while as sbt has to download the right versions of itself, Scala, and the project's dependencies. Also, for building the local cluster and tearing it down you may be asked for your password to connect locally to ssh.

Notebooks

If a Spark cluster isn't running, you can do so locally with make start_local. EExport the location of the Master node to SPARK_MASTER and run the Jupyter notebook

$ export SPARK_MASTER=spark://<master_hostname>:<master_port>
$ make pyspark

A Jupyter notebook will open in your browser, opened to the notebooks folder. Select the Test notebook and run it. If everything is cool, no error will be thrown, and you can go about your day.

Example

In fixtures/ you'll find a CSV file called daily_properties. This is the training data of the Housing Prices Kaggle Challenge. In this example, we'll run a job to transform this Data into something that we'll use in a notebook.

First setup according to the steps listed above, including the section about "Explore". Once everything is configured, start a local cluser:

$ make start_local

Once, the cluster is up, check out its status at http://localhost:8080/. You should have one worker and a Master node at port 7077 on your local machine.

Run the TranformDailyProperties job:

$ make submit_job job=com.realest_estate.TransformDailyProperties

Once that completes, if everything when well, the following command shouldn't error:

$ cat fixtures/transformed_daily_properties.csv/_SUCCESS

Start the Jupyter notebook server with:

$ make pyspark

Check out the Questions Investigation notebook, this has a bunch of different answers to a variety of questions about the data; run it and you should see how you're able to read from the data returned by the job and perform queries on it.

Once your done, do you computer a favor a teardown the cluster.

$ make kill_local

TODO

Right now, while this system is uncoupled and organized, the lack of structure makes it rather haphazard to use the pieces in conjunction, due to implicit timing dependencies and inferred schemas. Moreover, the platform is currently without a concrete database layer which will make informal analysis by non-technical users (among other tasks) difficult.

Various things that can be done to productionize this platform:

  • Provision and orchestrate a Spark cluster using Terraform and Kubernetes
  • Provision and configure a Jenkins service to be able to structure pipelines from jobs with Terraform, Ansible, and Docker
  • Establish a schema registry so that jobs and databases can work in sync with different data shapes.
  • Design codepaths to provision and/or configure Databases to be populated by Lambda functions that run when jobs deposit their results in S3.
  • Adjust SBT so that only the job being submitted (and it's dep) get packaged

This is only some of the possible directions this project can go, it really depends on the business needs.

realestdata's People

Contributors

puhrez avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.