alexioannides / pyspark-example-project Goto Github PK

Implementing best practices for PySpark ETL jobs and applications.

Shell 17.09% Python 82.91%

pyspark etl-job python data-engineering spark data-science etl etl-pipeline

pyspark-example-project's Introduction

alexioannides

Python source code for generating my personal website - hosted by GitHub pages at alexioannides.github.io - using the Pelican framework for static websites, together with Flex theme.

The output of the build process is written to the output folder in the root directory, that is not version controlled using this repository. Instead, the output directory has its own repository at alexioannides, that is necessary for hosting with GitHub pages.

Development Setup

The package's 3rd party dependencies are described in requirements.txt. Create a new virtual environment and install these dependencies as follows,

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Building the Website

To build the website we need to call Pelican,

pelican

Testing the Site Locally

We recommend setting RELATIVE_URLS = True when testing (do not forget to revert this before deploying) and then executing the following,

pelican --listen output

A test version of the website will then be available at http://localhost:8000.

Deploying to GitHub Pages

After testing locally, first of all ensure that RELATIVE_URLS = False, rebuilding the website if necessary. Then, make sure that you are still in the output directory and remember that this is version controlled by a different repository, that now needs new changes to be committed and pushed to master as usual - e.g.,

git add -A
git commit -m "latest changes to alexioannides.github.io"
git push origin master

The updated website is usually available within a minute or two.

pyspark-example-project's People

Contributors

Stargazers

Watchers

Forkers

vulpalam mallik-g rachidbel narendrasingamaneni91 sudhirkumar003 bnopacheco vikasyadav15 arjunmantri ashwini-padhy clarksun naveen-8451 naveenkrazy mainak24 chenzhih03 mohcinemadkour umair-gujjar arunma jeeva-ganesan japp54 xieq pwrose leidaxia icenine457 adolfoeliazat eshajanani github764 diego-ka inuka aymankh86 lckfire hwangjohn ttuygun navula1 lp-dataninja sakthi5006 seandavi elsyifa anbunandini etalay naveenrajkumarakrishnan raghavendra56 ontiyonke devendrasr milandhore seabull krthn nifsupreme blueroutecn tongqqiu sharmosarkar san089 oliverw1 autopy-co jzfsdev varunbargali kartikgoel88 djinny dsujant pquinteroc fzulwh evi1angel ndricca kissdevil ai-learningandoptimize kkrbalam justinshenk eliekawerk ideepthink dataecstasy daniele-sartiano saifullahkatpar lzythebset kumesh1407 geekay2015 lingliwang cqboywy chihyulai sarma-chitti gitprouser lucksuper peopzen tsivaid vyap2000 cmftall arshedg prkapur marciocourense guoyin90 truonghai001 dolphin-dogs milano88888 gowtham-kanagodu narayana1043 geshuro cdvx dguo456 hsali kepkon appcoreopc mohanraya

pyspark-example-project's Issues

PEX integration?

Hi guys, looking to template spark jobs as well, thx for this job!

What about PEX aproach?
https://medium.com/criteo-labs/packaging-code-with-pex-a-pyspark-example-9057f9f144f3

Failed TestCase

ERROR: test_transform_data (tests.test_etl_job.SparkETLTests)
Test data transformer.

Traceback (most recent call last):
File "/home/brix/pyspark-workloads/pyspark-example-project/tests/test_etl_job.py", line 27, in setUp
self.spark, *_ = start_spark()
File "/home/brix/pyspark-workloads/pyspark-example-project/dependencies/spark.py", line 88, in start_spark
spark_sess = spark_builder.getOrCreate()
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 118, in init
conf, jsc, profiler_cls)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/pyspark/context.py", line 195, in _do_init
self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc)
File "/home/brix/pyspark-workloads/pyspark-example-project/.venv/lib/python3.6/site-packages/py4j/java_gateway.py", line 1487, in getattr
"{0}.{1} does not exist in the JVM".format(self._fqn, name))
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

`AttributeError: module 'logging' has no attribute 'Handler' in PySpark3

When I import from pyspark import SparkFiles, I got the error AttributeError: module 'logging' has no attribute 'Handler'.

Python version: 3.8.5 Spark version: 3.0.0 pyspark version3.0.1

Anyone knows how to fix it?

import sklearn fails

Hi, I am referring to your project in order to write ETL apps using pyspark. I am just importing sklearn in my app. I am running spark-submit locally.

jobs/etl_job.py fails with the following error:


Traceback (most recent call last):
  File "/Users/divay/Documents/pyspark-example-project/jobs/etl_job.py", line 41, in <module>
    import sklearn
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__init__.py", line 63, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 46, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 26, in raise_build_error
OSError: [Errno 20] Not a directory: '/Users/divay/Documents/pyspark-example-project/packages.zip/sklearn/__check_build'

Pass Parameters to Spark

Could you add functionality to pass Job level parameters. Pass via parameters file maybe?

ModuleNotFoundError: No module named 'dependencies'

when I run the code with following command .
$spark-submit --master local[*] jobs/reconciliation.py

I get the error
ModuleNotFoundError: No module named 'dependencies'

Its because jobs and dependencies are sibling folder
Someone please help me where am I going wrong..

etl_config.json not loaded in EMR

First of all, thanks for the great work! I am new to spark and this repo has really helped me getting started.

I am trying to get my etl job running on aws EMR in cluster mode, but got hit with an issue where the pyspark program failed to load up my config.json passed in through --files s3://path.... I googled a bit and figured this might be an issue with how --files only copies files to executor nodes, not driver node, so the code here https://github.com/AlexIoannides/pyspark-example-project/blob/master/dependencies/spark.py#L93 finds nothing when iterating through the spark root dir. (for reference, I based my guess on this SO post https://stackoverflow.com/questions/47187533/files-option-in-pyspark-not-working).

Could this be the issue and is there a workaround for it? Or am I not making sense at all?

Best practice around passing DF to multiple functions

Hi Alex,
It is quite informative and helpful project.
However I have a question around ETL job.

In extract_data(spark) , we build a dataframe from different sources (i.e. S3, CSV or any database),

suppose in transform_data(df, steps_per_floor_), we have multiple other methods to transform the data...

i.e. cleaning(df), setup_different_conditional_edits(df), windowing_function_transformations(df), other_statestical_tranformations(df)

Is it performance efficient to pass dataframe as python method argument ? or using UDF is more performance efficient ?
what if the dataframe size is in ~ GB (10-100)?

Can you point me to right direction.
Thanks for this helpful project :)

Add License

Please add the LICENSE file to the repo so that one is sure of how to use it in closed-source codebases. See here

Setup and Teardown should be @classmethods setUpClass and tearDownClass

if they are not class methods then the method would be invoked for every test and a session would be created for each of those tests.

`class PySparkTest(unittest.TestCase):
@classmethod
def suppress_py4j_logging(cls):
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)

@classmethod
def create_testing_pyspark_session(cls):
    return SparkSession \
                .builder \
                .master('local[*]') \
                .appName("my-local-testing-pyspark-context") \
                .getOrCreate()


@classmethod
def setUpClass(cls):
    cls.suppress_py4j_logging()
    cls.spark = cls.create_testing_pyspark_session()
    cls.test_data_path = "<PATH>"
    cls.df = cls.spark.read.options(header='true', inferSchema='true') \
                .csv(cls.test_data_path)
    cls.df_exepcted = transform_data(cls.df, cls.spark)


@classmethod
def tearDownClass(cls):
    cls.spark.stop()`

Having problems with running this code under PySpark 2.3.1

Hi Alex,

Thanks a lot for creating this example. I cloned your repository and tried to run it on my laptop (MacOS X Sierra. pyspark==2.3.1 py4j==0.10.7) and encountered the following problem:

Traceback (most recent call last):
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 237, in
main()
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 56, in main
files=['etl_config.json'])
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 209, in start_spark
spark_sess = spark_builder.getOrCreate()
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/sql/session.py", line 173, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 343, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 115, in init
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 292, in _ensure_initialized
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/java_gateway.py", line 120, in launch_gateway
TypeError: init() got an unexpected keyword argument 'auth_token'
2018-08-24 15:24:22 INFO ShutdownHookManager:54 - Shutdown hook called

What needs to be modified in your source code in order to run it?

Thanks!

cols = len(expected_data.columns)

You should use data_transformed not expected_data for actual transformation output.