spark-examples / pyspark-examples Goto Github PK

Pyspark RDD, DataFrame and Dataset Examples in Python language

Python 100.00%

pyspark-examples's Introduction

Explanation of all PySpark RDD, DataFrame and SQL examples present on this project are available at Apache PySpark Tutorial, All these examples are coded in Python language and tested in our development environment.

Table of Contents (Spark Examples in Python)

PySpark Basic Examples

How to create SparkSession
PySpark – Accumulator
PySpark Repartition vs Coalesce
PySpark Broadcast variables
PySpark – repartition() vs coalesce()
PySpark – Parallelize
PySpark – RDD
PySpark – Web/Application UI
PySpark – SparkSession
PySpark – Cluster Managers
PySpark – Install on Windows
PySpark – Modules & Packages
PySpark – Advantages
PySpark – Feature
PySpark – What is it? & Who uses it?

PySpark DataFrame Examples

PySpark – Create a DataFrame
PySpark – Create an empty DataFrame
PySpark – Convert RDD to DataFrame
PySpark – Convert DataFrame to Pandas
PySpark – StructType & StructField
PySpark Row using on DataFrame and RDD
Select columns from PySpark DataFrame
PySpark Collect() – Retrieve data from DataFrame
PySpark withColumn to update or add a column
PySpark using where filter function
PySpark – Distinct to drop duplicate rows
PySpark orderBy() and sort() explained
PySpark Groupby Explained with Example
PySpark Join Types Explained with Examples
PySpark Union and UnionAll Explained
PySpark UDF (User Defined Function
PySpark flatMap() Transformation
PySpark map Transformation

PySpark SQL Functions

PySpark Aggregate Functions with Examples
PySpark Window Functions

PySpark Datasources

PySpark Read CSV file into DataFrame
PySpark read and write Parquet File

pyspark-examples's People

Contributors

Stargazers

Watchers

Forkers

dasaradhs1 rahul0523 joserfjuniorllms joshuaballance dataedgesystems dfkruse maimoonaiqbal2000 shukla41 vincevertulfo gskiran1988 narra-rakesh sagar2694 cultivater nguyendo24 macjei maverick317 8m81n8 priya-hub-design cegladanych dineshshan10 lynoge zhuohuwu0603 kasthuribai1972 abcshravan pradeep7025 mrsmartpants yuejiesong1900 gallardo-rivilla vivekshaw katanike debanjan1989 svsarath thejanw arita37 hasmikaleksanyan sikz1127 pkgit123 guoch magantivenkat kulgan zzozz sumanthgit-hub samarthk swapnagit23 yennanliu febikambu yltsai0609 mariapcampos lidong4000 arunprasadbh jaya2020 shivanandpawar mckreddy shaikhzhas rangareddyb zealberth chavdavijayd addream gkathiresan63 lancelot1969 sivainspy marcusrb sp3006 khajaasmath786 guipes pmvamsi-coder sdudhagithub cgaurav1990 mehardeep rohit-2019 zarinakamytbaeva arijit-1988 ikowin mihaigpetric sree-lucky absognety wenouyang arpitabhi umeshzgde chandrumailbox decko048 muhammadakbar21 almabetter-school torkaitraining f22j msuarez smaddinieni chistegor prashant4849 jinisaweaklearner limitspro suyalmukesh sanand33 next-aidaalonso yungthic payalgodse vamshigvk toorrepus yiming1012 ajitboyite

pyspark-examples's Issues

Unsupported Literal Type Error

The last line of code in pyspark-broadcast-dataframe.py:
filteDf= df.where((df['state'].isin(broadcastStates.value)))

gives this error:

SparkRuntimeException Traceback (most recent call last)
in <cell line: 29>()
27
28 # Broadcast variable on filter
---> 29 filteDf= df.where((df['state'].isin(broadcastStates.value)))

4 frames
/usr/local/lib/python3.10/dist-packages/pyspark/errors/exceptions/captured.py in deco(*a, **kw)
183 # Hide where the exception came from that shows a non-Pythonic
184 # JVM exception message.
--> 185 raise converted from None
186 else:
187 raise

SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '{FL=Florida, NY=New York, CA=California}' of class java.util.HashMap.

sparkbyexamples.com Portal not responding

26-AUG-2023 sparkbyexamples.com Portal is not responding.

PySpark Examples in PDF

Where can I find a pdf format for all the examples on your website https://sparkbyexamples.com?

error of running pyspark using jupyter notebook on Windows, Exception: Java gateway process exited before sending its port number

Hello, I am trying to run pyspark examples on local windows machine, with Jupyter notebook using Anaconda. I followed this tutorial. and did not find any issue during the installation. However, I still got the following error messages when running the following example

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import to_timestamp, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

Exception Traceback (most recent call last)
in
5 from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType
6
----> 7 spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

~\Anaconda3\envs\sparkenv\lib\site-packages\pyspark\sql\session.py in getOrCreate(self)
226 sparkConf.set(key, value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getOrCreate(sparkConf)
229 # Do not update SparkConf for existing SparkContext, as it's shared
230 # by all sessions.

~\Anaconda3\envs\sparkenv\lib\site-packages\pyspark\context.py in getOrCreate(cls, conf)
382 with SparkContext._lock:
383 if SparkContext._active_spark_context is None:
--> 384 SparkContext(conf=conf or SparkConf())
385 return SparkContext._active_spark_context
386

~\Anaconda3\envs\sparkenv\lib\site-packages\pyspark\context.py in init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~\Anaconda3\envs\sparkenv\lib\site-packages\pyspark\context.py in _ensure_initialized(cls, instance, gateway, conf)
329 with SparkContext._lock:
330 if not SparkContext._gateway:
--> 331 SparkContext._gateway = gateway or launch_gateway(conf)
332 SparkContext._jvm = SparkContext._gateway.jvm
333

~\Anaconda3\envs\sparkenv\lib\site-packages\pyspark\java_gateway.py in launch_gateway(conf, popen_kwargs)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise Exception("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

exponential smoothing in Pyspark

Hello, I have a pandas code for exponential smoothening. But I am not able to do the same in pyspark.
def exponential_smoothing(x, alpha):
result = []
for value in x:
if result:
smoothed_value = alpha * value + (1 - alpha) * result[-1]
else:
smoothed_value = value
result.append(smoothed_value)
return result
def apply_exponential_smoothing(df, alpha):
df['product_area_sales_value_N_mean_T'] = df.groupby(['area_id', 'product_id'])['product_area_sales_value_N_mean'].transform(lambda x: exponential_smoothing(x, alpha))
df['product_area_sales_unit_N_mean_T'] = df.groupby(['area_id', 'product_id'])['product_area_sales_unit_N_mean'].transform(lambda x: exponential_smoothing(x, alpha))
return df

tmp3 = apply_exponential_smoothing(tmp3, alpha=0.8)
this is the code. here in pyspark, I am not able to fetch previous row smoothen value. there is no such functionality in pyspark. Please suggest solution in spark

AttributeError: 'NoneType' object has no attribute 'rdd'

Line # 39
keysDF = df.select(explode(map_keys(df.properties))).distinct().show()
throws below error:

AttributeError Traceback (most recent call last)
in
----> 1 keysList = keysDF.rdd.map(lambda x:x[0]).collect()

AttributeError: 'NoneType' object has no attribute 'rdd'

Data

sparkbyexample website not working

Hi,

Thank you for creating such awesome tutorials. I just want to give you kind notice that sparkbyexample website is not working for some reason.

Please help to resolve this.

List of strings is not a supported format to create dataframe.

pyspark-examples/currentdate.py

Line 21 in 0ae16f1

df = spark.createDataFrame(list('1'), schema=schema)