databricks / spark-the-definitive-guide Goto Github PK

Spark: The Definitive Guide's Code Repository

Home Page: http://shop.oreilly.com/product/0636920034957.do

License: Other

Java 4.89% Python 33.36% Scala 54.80% R 2.20% TSQL 4.76%

spark-the-definitive-guide's Introduction

Spark: The Definitive Guide

This is the central repository for all materials related to Spark: The Definitive Guide by Bill Chambers and Matei Zaharia.

This repository is currently a work in progress and new material will be added over time.

Code from the book

You can find the code from the book in the code subfolder where it is broken down by language and chapter.

How to run the code

Run on your local machine

To run the example on your local machine, either pull all data in the data subfolder to /data on your computer or specify the path to that particular dataset on your local machine.

Run on Databricks

To run these modules on Databricks, you're going to need to do two things.

Sign up for an account. You can do that here.
Import individual Notebooks to run on the platform

Databricks is a zero-management cloud platform that provides:

Fully managed Spark clusters
An interactive workspace for exploration and visualization
A production pipeline scheduler
A platform for powering your favorite Spark-based applications

Instructions for importing

Navigate to the notebook you would like to import

For instance, you might go to this page. Once you do that, you're going to need to navigate to the RAW version of the file and save that to your Desktop. You can do that by clicking the Raw button. Alternatively, you could just clone the entire repository to your local desktop and navigate to the file on your computer.

Upload that to Databricks

Read the instructions here. Simply open the Databricks workspace and go to import in a given directory. From there, navigate to the file on your computer to upload it. Unfortunately due to a recent security upgrade, notebooks cannot be imported from external URLs. Therefore you must upload it from your computer.

You're almost ready to go!

Now you just need to simply run the notebooks! All the examples run on Databricks Runtime 3.1 and above so just be sure to create a cluster with a version equal to or greater than that. Once you've created your cluster, attach the notebook.

Replacing the data path in each notebook

Rather than you having to upload all of the data yourself, you simply have to change the path in each chapter from /data to /databricks-datasets/definitive-guide/data. Once you've done that, all examples should run without issue. You can use find and replace to do this very efficiently.

spark-the-definitive-guide's People

Contributors

Stargazers

Watchers

Forkers

zhuohuwu0603 srishail harlixxy piyushacharyakp ppetr82 datapilgrim ellelang kangshe kcompher joset0899 grahamcrowell maheshsv yucl80 johanpoct yubobo luke-pretzie thirapat sunilpentapati mshayeb rojour janes ashishtadose skucukbay ridhikag m-tian iamnancy07 jashchahal sdaingade stelukutla dimovr mallik-g abhilash-ravindran varunsaxena manjunathgit ericxiao-dev sreenuvasuluzinnov yiwei007 ultrasonex sdipayan maz-repo wuque ptzagk levilian hemant-rout ajrramirez hungnguyengoc nehaavishwa sinharahul mohibtech istvandiosi namrata01 dax1n vivekveeramani moyphilip scotthb samipsairam getdarren deviantcode kaparthy fajarberlian lukaszgaza ironistm stefano-pietroiusti abhishekms1047 divjot2006 lemenchao angu10 ruudvanlaar pythseq liuchaoren okliangnan padigela goungoun qbui santiagomrv mcdas1980 ahnqirage mobilefirsts ontiyonke iamavb aldengolab krthn shireeshkthota drissnejjar aipachakutiqwan gappelgren jadeyouth enixon44 reenamathen akema1 kanghuawu skarwal dcguy1978 zhangxin1988 zhezhe123 it-crasy bigdatarahul evelynancajimaminguillo aminmohebbi bigdatavik

spark-the-definitive-guide's Issues

ch3 sample code error

// in Scala
case class Flight(DEST_COUNTRY_NAME: String,
                  ORIGIN_COUNTRY_NAME: String,
                  count: BigInt)
val flightsDF = spark.read
  .parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val flights = flightsDF.as[Flight]

The code seems to be running in python 2

Is there any specific reason? Can it be updated to run on python 3 since thats the standard now?

Code Error: Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala

When executing code from line 2 to 7 in "code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala" file as follows,

val sales = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/retail-data/by-day/*.csv")
  .coalesce(5)
  .where("Description IS NOT NULL")

an error occurs:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`Description`' given input columns: [<!DOCTYPE html>]; line 1 pos 0;
'Filter isnotnull('Description)
+- Relation[<!DOCTYPE html>#10] csv

chapter 24 missing python code

The end of Chapter 24 is scala only. Is this intentional?

The training summary and persisting and applying models are Scala only.

I was able to get the summary to work in python3 with Spark 2.3, but not the persist and apply. Can python be added to the code?

`
trainedPipeline = tvsFitted.bestModel

lrBestModel = trainedPipeline.stages[1]

lrBestModel.summary.objectiveHistory`

Chapter 9: DataSources - Partitioning based on a sliding window section - python

The below piece of code when run on a spark 3 cluster

in python

colName = "count"
upperBound = 348113L
numPartitions = 10
lowerBound = 0L

fails with File "", line 2
upperBound = 348113L
^
SyntaxError: invalid syntax_

Below is the correct code
colName = "count"
upperBound = 348113
numPartitions = 10
lowerBound = 0

Code in Chapter 22 - Windows on Event Time

On Page 371, for the python example, there's a backslash after "streaming" that should not be there because there's additional code on the same line. Fix is to either remove the backslash or start ".selectExpr(" on the next line.

Ch 6 Regex. Code is wrong

I'm learning Spark using the book and I ran into this code which returns an error:

from pyspark.sql.functions import expr, locate

simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
  return locate(color_string.upper(), column)\
          .cast("boolean")\
          .alias("is_" + c)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr("*")) # has to be a Column type

df.select(*selectedColumns).where(expr("is_white OR is_red"))\
  .select("Description").show(3, False)

Error: NameError: name 'c' is not defined

Even if I define a dummy 'c' within the function, I get the following:

AnalysisException: "cannot resolve '`is_white`' given input columns: [Quantity, StockCode, is_, InvoiceNo, is_, is_, CustomerID, is_, Description, InvoiceDate, UnitPrice, is_, Country]; line 1 pos 0;\n'Filter ('is_white || 'is_red)\n+- Project [cast(locate(BLACK, Description#14, 1) as boolean) AS is_#1632, cast(locate(WHITE, Description#14, 1) as boolean) AS is_#1633, cast(locate(RED, Description#14, 1) as boolean) AS is_#1634, cast(locate(GREEN, Description#14, 1) as boolean) AS is_#1635, cast(locate(BLUE, Description#14, 1) as boolean) AS is_#1636, InvoiceNo#12, StockCode#13, Description#14, Quantity#15, InvoiceDate#16, UnitPrice#17, CustomerID#18, Country#19]\n   +- Relation[InvoiceNo#12,StockCode#13,Description#14,Quantity#15,InvoiceDate#16,UnitPrice#17,CustomerID#18,Country#19] csv\n"

Any suggestions?

Thank you!

Databricks or iPython Notebooks

It would be very helpful and time saving to have the databricks or ipython notebooks available to import rather than the .py code files. Could you upload them separately to the a directory?

Translation of the book,"Spark-The-Definitive-Guide", from English into Chinese

Non-profit Translation

I'm translating this book in person, the repository link is https://github.com/SnailDove/Translation_Spark-The-Definitive-Guide

ch10 - sparksql- Inserting into tables query not working Cannot safely cast 'count': string to bigint

Query in the book:
INSERT INTO partitioned_flights
PARTITION (DEST_COUNTRY_NAME="UNITED STATES")
SELECT count, ORIGIN_COUNTRY_NAME FROM flights
WHERE DEST_COUNTRY_NAME='UNITED STATES'
LIMIT 12
In Spark 3.0, The above query returns the below error in SQL statement: AnalysisException: Cannot write incompatible data to table 'default.partitioned_flights':

Cannot safely cast 'count': string to bigint
so, modified the query as below to cast the count column as an integer.
INSERT INTO partitioned_flights
PARTITION (DEST_COUNTRY_NAME = "UNITED STATES")
SELECT ORIGIN_COUNTRY_NAME, cast(count as int) count1 FROM flights
WHERE DEST_COUNTRY_NAME = "UNITED STATES"
LIMIT 12
could you please check on this?

Not able to locate files in databricks-datasets

Could u please give proper instruction for path to data uploaded in databricks-datasets?

Also I am not able to load the data folder which is saved locally , directly to my databricks cluster and uploading each file seperately is really cumbursome?
Is there any solution for that?

Chapter 7 Wrong code in the window function example

Spark-The-Definitive-Guide/code/Structured_APIs-Chapter_7_Aggregations.scala

Line 171 in 38e8814

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")

First, the column Quantity is parsed as String. It should be IntegerType.

Second orderBy(CustomerId) should be orderBy(desc(Quantity))

Error while connecting to Databases in DataBrick Community Edition on Cloud

Hi,
I have downloaded repository and I was able to execute and practice all example . But when I am trying to execute examples related to SQL data source from Chapter 9 Data Source I am getting following error. I don't have any clue on what to do so please guide me. Thanks in advance.
Error"Java.sql.SQLException: path to '/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db': '/databricks-datasets' does not exist"

Path do exists when I run %fs ls and I get following

dbfs:/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db

Following are parameters

driver = "org.sqlite.JDBC" path = "/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db" url = "jdbc:sqlite:" + path tablename = "flight_info"

Chapter 6 - Working with JSON

Under working with JSON section in the book it is mentioned as
"The equivalent in SQL would be
jsonDF.selectExpr("json_tuple(jsonString, '$.myJSONKey.myJSONValue[1]') as column").show(2)"
but it is not a SQL syntax at first . could you please check and let us know the equivalent SQL expression of JSON transformation.

color_locator() error? Chapter 6 p. 96

recommend:

def color_locator(column, color_string):
return locate(color_string.upper(), column)
.cast("boolean")
.alias("is_" + color_string) # existing code has 'c' instead of 'color_string' - doesn't work

Chapter 3: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

Created a scala program and trying to run the streaming code. it is filing with below error

19/11/09 07:25:24 ERROR StreamExecution: Query customer_purchases_2 [id = 6863c8d1-fd1c-49eb-a454-92825f0d2782, runId = 8adbd1e6-04e1-4b79-9fad-05950f359e77] terminated with error
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
spark_guide.chapter3.StructuredStreaming$.delayedEndpoint$spark_guide$chapter3$StructuredStreaming$1(StructuredStreaming.scala:12)
spark_guide.chapter3.StructuredStreaming$delayedInit$body.apply(StructuredStreaming.scala:9)
scala.Function0$class.apply$mcV$sp(Function0.scala:34)
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
scala.App$$anonfun$main$1.apply(App.scala:76)
scala.App$$anonfun$main$1.apply(App.scala:76)
scala.collection.immutable.List.foreach(List.scala:392)
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
scala.App$class.main(App.scala:76)
spark_guide.chapter3.StructuredStreaming$.main(StructuredStreaming.scala:9)
spark_guide.chapter3.StructuredStreaming.main(StructuredStreaming.scala)

Help Needed : Can we add examples in Java ?

Chapter_3_sort_code_missing

When introducing the streaming DataFrame, the following code:

from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate") \
.groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.show(5)

produces an output different from the one shown in the chapter, because it misses a "sorting line".

I think the correct code should be:

from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate") \
.groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.sort(desc("sum(total_cost)")) \
.show(5)

Not applicable - opened by mistake

Clarification on loading data folder having multiple CSV file from local hard drive

Hi,
I'm really stuck with this section of Spark book.
staticDataFrame = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/mnt/defg/retail-data/by-day/*.csv")

I'm not able to understand the "load("/mnt/...") section.
I have downloaded the data to my local drive. But now the issue is on loading the data. How to load the data ?
Is the mnt/defg being done via S3 ? or by any other method !

not a member of org.apache.spark.sql.SparkSession

In Chapter 2 when loading the flightData2015 csv the action commands do not work at the scala prompt in Spark 2.4.4. You get messages like:
:37 error: value take is not a member of org.apache.spark.sql.SparkSession
:37 error: value sort is not a member of org.apache.spark.sql.SparkSession
:37 error: value show is not a member of org.apache.spark.sql.SparkSession

According to some stack overflow posts it is the wrong version of Maven. However when I read the README file for the databricks download it said that downloading Maven separately was not needed because it was included in the pre-built package that I downloaded from chapter 1.

Do you have any suggestions on how to fix this? I don't see any maven directories or files in this pre-built package spark-2.4.4-bin-hadoop2.7.gz.

Chapter 3: Creating Rows typo

On the example:
myRow.getInt(2) // String
It should be
myRow.getInt(2) // Int
right?

Was this line even tested ? There is no closing quote for the column name

Spark-The-Definitive-Guide/code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.scala

Line 66 in 3255bc3

.groupBy('DEST_COUNTRY_NAME)

Code issue chapter2

Issue in last code to get maximum destination countries. Count should be typecast to int before taking sum ,and syntax error is also coming with existing code

ch 6 map -> create_map

Spark-The-Definitive-Guide/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.py

Line 313 in 38e8814

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))\

It seems SCALA version, same as #L319

Spark SQL Help

I enter pyspark command:

maxSql = spark.sql(""" SELECT DEST_COUNTRY_NAME, sum(count) as destination_total FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME ORDER BY sum(count) DESC LIMIT 5 """)

Error:

19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:07 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __ca
ll__
  File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: flight_data_2015; line 1 pos 64'

what is the reason for this error ?

samples chapter 5?

In chapter 5 the following dataframe is used:

val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/mnt/defg/streaming/*.csv")
.coalesce(5)

I cannot find a /streaming subdirectory in the /data subdirectory

When using the retail-data/all/online-retail-dataset.csv I get problems getting the InvoiceDate recognized as a Date.

Chapter 7 - A small correction in sql code in group query

Please find the below correction in the book under the Grouping section
In the book:
--in SQL
%sql
SELECT
count()
FROM
dfTable
GROUP BY
InvoiceNo, CustomerId
+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
| 536846| 14573| 76|
| 537026| 12395| 12|
| 537883| 14437| 5|
...
| 543188| 12567| 63|
| 543590| 17377| 19|
| C543757| 13115| 1|
| C544318| 12989| 1|
+---------+----------+-----+
correct one:
--in SQL
Select InvoiceNo, CustomerId, count()
from dfTable
Group by InvoiceNo, CustomerId

could you please modify the source accordingly?

Structured Streaming

Hi - I am following the structured streaming example from the console (using pyspark). I successfully read the JSON files (which I load from S3) and set the files per trigger. However once I start the 'activityQuery' using code below, I can't get back into the shell as it just runs continously. So I cannot execute the 'activityQuery.awaitTermination' command or the spark.sql select command to see the activity_counts table. If I try to create another shell and run pyspark, none of the tables are available or visible.

activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("memory").outputMode("complete")\
.start()

Chapter 9: java.lang.ClassNotFoundException: org.sqllite.JDBC

Hi I am trying to use sqllite to create Data frame. Below is the code.
_

driver = "org.sqllite.JDBC"
path = "/dbfs/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqllite:" + path
table_name = "flight_info"

dfDF = spark.read.format("jdbc").option("url", url)
.option("dbtable", table_name)
.option("driver", driver)
.load()
_

I am getting error class not found:
java.lang.ClassNotFoundException: org.sqllite.JDBC on line that has driver option.

Is it possible to run JDBC code on community edition? please let me know what I need to do sort this error.

Chapter 7 - A small correction to sql code in sumDistinct

Please find the below correction in book under sumDistinct section
In the book:
--in SQL
select sum(Quantity) from dfTable -- 29310
This query will result in 5176450 rows.
correct one:
--in SQL
select sum(distinct(Quantity)) from dfTable -- 29310
This query will result in 29310 rows.

could you please modify the source accordingly ?

staticSchema error in chapter 3 in s

I get the following error in the running of the following code from Chapter 3 (Structured Streaming)

in Python

streamingDataFrame = spark.readStream
.schema(staticSchema)
.option("maxFilesPerTrigger", 1)
.format("csv")
.option("header", "true")
.load("/data/retail-data/by-day/*.csv")

NameError: name 'staticSchema' is not defined

NameError Traceback (most recent call last)
in
2 #How many files read together is identified by maxFilesPerTrigger
3 streamingDataFrame = spark.readStream
----> 4 .schema(staticSchema)
5 .option("maxFilesPerTrigger", 1)
6 .format("csv")\

NameError: name 'staticSchema' is not defined

Can anyone guide me about it? I am running the code on Databricks community cluster.

Thanks,

Error while connecting to Databases in DataBrick Community Edition on Cloud .

HI,
I have downloaded repository and I was able to execute and practice all example . But when I am trying to execute examples related to SQL data source from Chapter 9 Data Source I am getting following error. I don't have any clue on what to do so please guide me. Thanks in advance.
Error"Java.sql.SQLException: path to '/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db': '/databricks-datasets' does not exist"

Path do exists when I run %fs ls and I get following

dbfs:/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db

Following are parameters

driver = "org.sqlite.JDBC" path = "/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db" url = "jdbc:sqlite:" + path tablename = "flight_info"

Unable to locate Deep Learning images

Unable to locate /data/deep-learning-images/ dataset.

Chapter 13 - Advanced RDD example of Custom partitioner may need correction

I'm studying spark advanced RDD API and got a little bit confused by one example.
`// in Scala
import org.apache.spark.Partitioner

class DomainPartitioner extends Partitioner {
def numPartitions = 3
def getPartition(key: Any): Int = {
val customerId = key.asInstanceOf[Double].toInt
if (customerId == 17850.0 || customerId == 12583.0) {
return 0
} else {
return new java.util.Random().nextInt(2) + 1
}
}
}`
As far as I can see in code documentation, partitioner must return the same partition id given the same partition key. That is not true for the example in the code above. Isn't "random" id for key break the Partitioner interface ?

Deep Learning - Transfer learning

Hello,

I tried to follow the Transfer learning example in pyspark on page 534-535. However, when I try to do p_model = p.fit(train_df) I get a 'Number of source Raster bands and source color space components do not match'. I tried googling what to do, but i wasn't sure what to do? The trace includes references to java.awt.image.ColorConvertOp.filter, com.sun.imageio.plugins.jpeg.JPEGImageReader.... and javax.imageio.ImageIO.read. Note that I used Spark ml's ImageSchema read function because the sparkdl readImages function seems to be deprecated.

Thanks in advance,
Nick

any instruction to import your code and data

It's my first time using databricks , I don't find the way to import your code and data to databricks directly(only found something like upload file, import by URL..). Could you write in readme or just give the link instruction?

chap3 - streaming write results in console

Hi,

I start learning Apache Spark by reading that book. I'm now at chapter 3 - Streaming part.
For snippet code, I choose python3
My problem is that nothing is displayed in console, as said in the book, from that code

purchaseByCustomerPerHour.writeStream \
    .format("console") \
    .queryName("customer_purchases_3") \
    .outputMode("complete") \
    .start() \
    .show()

I don't know if I'm doing something wrong but tell me how to display result in console at start and at every update too

Typo on page 14 in "excerpts" version of guide

The line:
"We use the sort is a transformation"
should read
"We use the sort as a transformation"

Ch_9 running SQLite in Databricks

Hi
Is it possible to run the JDBC SQLite code in Databricks Community Edition?
If so how would I set this up?

Thanks
Jonathan

ImportError: cannot import name '_centered' from 'scipy.signal.signaltools'

Hello,
Getting while importing seasonal decompose-from statsmodels.tsa.seasonal import seasonal_decompose
List of libraries installed on the databricks:
"azure-storage-blob": 12.9.0
"azure-identity": 1.7.1
"azure-keyvault": 4.2.0
"pandas": 1.3.4
"numpy": 1.20.3
"geopandas": 0.9.0
"shapely": 1.7.1
"dill": (latest available version)
"esda": 2.4.1
"rtree": 0.9.7
"scikit-learn": 1.0.1
"numba": 0.54.1
"setuptools": 58.0.4
"libpysal": 4.5.1
"xgboost": 1.4.2
"catboost": 1.0.3
"pyyaml": 6.0.0
"findspark": 1.3.0
"cvxpy": 1.3.0
"apache-sedona": 1.1.0
"scipy": 1.8.1
"hdbscan": 0.8.32
statsmodels

Please guide

what does "in SQL" mean in book

there are many statements in book like this:
// in Scala
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()
-- in SQL
SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable
I understand "in Scala " means execution in spark-shell. Does "in SQL" mean short hand for spark.sql("xx")? similar question is how can I execute commands in Spark-The-Definitive-Guide/code/xx.sql file?

Ch9 - Writing to SQL database - SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: flight_info)

When executing the below command to write to the below path facing the issue

val newPath = "jdbc:sqlite://tmp/my-sqlite.db"
val tablename = "flight_info"
val props = new java.util.Properties
props.setProperty("driver", "org.sqlite.JDBC")
csvFile.write.mode("overwrite").jdbc(newPath, tablename, props)

Driver Stack trace at high level:
Caused by: java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: flight_info)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.DB.newSQLException(DB.java:901)
at org.sqlite.core.DB.throwex(DB.java:868)
at org.sqlite.core.NativeDB.prepare(Native Method)
at org.sqlite.core.DB.prepare(DB.java:211)

could you please review and let us know how to resolve this issue further?

Strange behaviour of REPL when running first few Ch3 examples

If I run the following code from Chapter 3 as two separately pasted snippets all is OK.

`// in Scala - snippet 1
import spark.implicits._
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String,
count: BigInt)
val flightsDF = spark.read
.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

// COMMAND ----------

// in Scala - snippet 2
flights
.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
.map(flight_row => flight_row)
.take(5)`

If I run it as one pasted snippet the result is a java exception.

java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

I am using Spark 2.2.2.

Kind regards, Lawrence

Sessionization Example never times out

When running https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.scala lines 239-319 (plus a few boilerplate lines, using Spark 2.2.0), "if (oldState.hasTimedOut)" on line 281 is never true for me, i.e. the timer never expires. I only see output "if (state.values.length > 1000)" on line 287. The only modification I made to the code is to change the writeStream.format from "memory" to "json" (and add the path and checkpointLocation), and I'm using the first 6 files/parts of the activity-data, both with and without readStream.option("maxFilesPerTrigger",1). I also tried changing line 309 to "CURRENT_TIMESTAMP as timestamp", but it didn't help. Do you get any output (besides empty files) if you comment out lines 287-292? Would you please confirm whether there is a bug with the code or with Spark or am I doing something wrong?

Multi-line chain of Dataframe transformations needs to be enclosed in parenthesis

At least for spark-shell the Spark syntax needs this fixing. Without the parenthesis, the following command on page 421 of the book fails to generate proper Array[...] into the params Dataframe:
val params = new ParamGridBuilder() .addGrid(rForm.formula, Array( ...

SDG_ch24_MLlib_in_Action_fail.txt

SDG_ch24_MLlib_in_Action_fixed.txt

Implement codes from the book in Java as well

It will be great if we have Java implementations as well for all the codes in the book

Ch 6 Working With JSON

Using Python...

Having trouble replicating the use of get_json_object and json_tuple (p110). One small issue is using "... as 'column'" [easily fixed with use of .alias("column") instead], but even when accounting for this I'm getting a "DataFrame object is not callable" error. If I replace the col("jsonString") with jsonDF.jsonString (where ever that is used) the error goes away. This sort of replicates what's in use in the documentation for get_json_object, so I'm guessing it's the right answer. And when I run this I get the results in the book...

Also, the SQL string only calls out one column, "column"... and it doesn't seem like the pure SQL code of previous examples, but I was able to reproduce the results.

If you see what I mean and agree I'll do a PR (errata page still down...)

Won't address the Scala code, but it looks as if it would have the same issues...

Code samples in repo

Going through the guide and it's a bit tedious to type out all of the code examples. Copy pasting is a bit tedious as well due to line endings and “” instead of ""

Would you accept PR which includes the code examples from the book ? Many textbooks have the code examples available on their website.

databricks / spark-the-definitive-guide Goto Github PK

spark-the-definitive-guide's Introduction

Spark: The Definitive Guide

Code from the book

How to run the code

Run on your local machine

Run on Databricks

Instructions for importing

spark-the-definitive-guide's People

Contributors

Stargazers

Watchers

Forkers

spark-the-definitive-guide's Issues

in python

Non-profit Translation

in Python

NameError: name 'staticSchema' is not defined

Recommend Projects

Recommend Topics

Recommend Org