Coder Social home page Coder Social logo

alexioannides / pyspark-example-project Goto Github PK

View Code? Open in Web Editor NEW
1.4K 51.0 641.0 787 KB

Implementing best practices for PySpark ETL jobs and applications.

Shell 17.09% Python 82.91%
pyspark etl-job python data-engineering spark data-science etl etl-pipeline

pyspark-example-project's Introduction

alexioannides

Python source code for generating my personal website - hosted by GitHub pages at alexioannides.github.io - using the Pelican framework for static websites, together with Flex theme.

The output of the build process is written to the output folder in the root directory, that is not version controlled using this repository. Instead, the output directory has its own repository at alexioannides, that is necessary for hosting with GitHub pages.

Development Setup

The package's 3rd party dependencies are described in requirements.txt. Create a new virtual environment and install these dependencies as follows,

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Building the Website

To build the website we need to call Pelican,

pelican

Testing the Site Locally

We recommend setting RELATIVE_URLS = True when testing (do not forget to revert this before deploying) and then executing the following,

pelican --listen output

A test version of the website will then be available at http://localhost:8000.

Deploying to GitHub Pages

After testing locally, first of all ensure that RELATIVE_URLS = False, rebuilding the website if necessary. Then, make sure that you are still in the output directory and remember that this is version controlled by a different repository, that now needs new changes to be committed and pushed to master as usual - e.g.,

git add -A
git commit -m "latest changes to alexioannides.github.io"
git push origin master

The updated website is usually available within a minute or two.

pyspark-example-project's People

Contributors

alexioannides avatar clarksun avatar marouenes avatar oliverw1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyspark-example-project's Issues

Failed TestCase

ERROR: test_transform_data (tests.test_etl_job.SparkETLTests)
Test data transformer.

Traceback (most recent call last):
File "/home/brix/pyspark-workloads/pyspark-example-project/tests/test_etl_job.py", line 27, in setUp
self.spark, *_ = start_spark()
File "/home/brix/pyspark-workloads/pyspark-example-project/dependencies/spark.py", line 88, in start_spark
spark_sess = spark_builder.getOrCreate()
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 118, in init
conf, jsc, profiler_cls)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 195, in _do_init
self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/py4j/java_gateway.py", line 1487, in getattr
"{0}.{1} does not exist in the JVM".format(self._fqn, name))
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

import sklearn fails

Hi, I am referring to your project in order to write ETL apps using pyspark. I am just importing sklearn in my app. I am running spark-submit locally.

jobs/etl_job.py fails with the following error:


Traceback (most recent call last):
  File "/Users/divay/Documents/pyspark-example-project/jobs/etl_job.py", line 41, in <module>
    import sklearn
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__init__.py", line 63, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 46, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 26, in raise_build_error
OSError: [Errno 20] Not a directory: '/Users/divay/Documents/pyspark-example-project/packages.zip/sklearn/__check_build'

ModuleNotFoundError: No module named 'dependencies'

when I run the code with following command .
$spark-submit --master local[*] jobs/reconciliation.py

I get the error
ModuleNotFoundError: No module named 'dependencies'

Its because jobs and dependencies are sibling folder
Someone please help me where am I going wrong..

etl_config.json not loaded in EMR

First of all, thanks for the great work! I am new to spark and this repo has really helped me getting started.

I am trying to get my etl job running on aws EMR in cluster mode, but got hit with an issue where the pyspark program failed to load up my config.json passed in through --files s3://path.... I googled a bit and figured this might be an issue with how --files only copies files to executor nodes, not driver node, so the code here https://github.com/AlexIoannides/pyspark-example-project/blob/master/dependencies/spark.py#L93 finds nothing when iterating through the spark root dir. (for reference, I based my guess on this SO post https://stackoverflow.com/questions/47187533/files-option-in-pyspark-not-working).

Could this be the issue and is there a workaround for it? Or am I not making sense at all?

Best practice around passing DF to multiple functions

Hi Alex,
It is quite informative and helpful project.
However I have a question around ETL job.

In extract_data(spark) , we build a dataframe from different sources (i.e. S3, CSV or any database),

suppose in transform_data(df, steps_per_floor_), we have multiple other methods to transform the data...

i.e. cleaning(df), setup_different_conditional_edits(df), windowing_function_transformations(df), other_statestical_tranformations(df)

Is it performance efficient to pass dataframe as python method argument ? or using UDF is more performance efficient ?
what if the dataframe size is in ~ GB (10-100)?

Can you point me to right direction.
Thanks for this helpful project :)

Add License

Please add the LICENSE file to the repo so that one is sure of how to use it in closed-source codebases. See here

Setup and Teardown should be @classmethods setUpClass and tearDownClass

if they are not class methods then the method would be invoked for every test and a session would be created for each of those tests.

`class PySparkTest(unittest.TestCase):
@classmethod
def suppress_py4j_logging(cls):
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)

@classmethod
def create_testing_pyspark_session(cls):
    return SparkSession \
                .builder \
                .master('local[*]') \
                .appName("my-local-testing-pyspark-context") \
                .getOrCreate()


@classmethod
def setUpClass(cls):
    cls.suppress_py4j_logging()
    cls.spark = cls.create_testing_pyspark_session()
    cls.test_data_path = "<PATH>"
    cls.df = cls.spark.read.options(header='true', inferSchema='true') \
                .csv(cls.test_data_path)
    cls.df_exepcted = transform_data(cls.df, cls.spark)


@classmethod
def tearDownClass(cls):
    cls.spark.stop()`

Having problems with running this code under PySpark 2.3.1

Hi Alex,

Thanks a lot for creating this example. I cloned your repository and tried to run it on my laptop (MacOS X Sierra. pyspark==2.3.1 py4j==0.10.7) and encountered the following problem:

Traceback (most recent call last):
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 237, in
main()
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 56, in main
files=['etl_config.json'])
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 209, in start_spark
spark_sess = spark_builder.getOrCreate()
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/sql/session.py", line 173, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 343, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 115, in init
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 292, in _ensure_initialized
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/java_gateway.py", line 120, in launch_gateway
TypeError: init() got an unexpected keyword argument 'auth_token'
2018-08-24 15:24:22 INFO ShutdownHookManager:54 - Shutdown hook called

What needs to be modified in your source code in order to run it?

Thanks!

Issue while executing the code via pycharm

File "/home/ashish/Downloads/pyspark-example-project-master/jobs/etl_job.py", line 57, in main
data_transformed = transform_data(data, config['steps_per_floor'])
TypeError: 'NoneType' object is not subscriptable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.