In this section, you learned about the challenges of working with big data and the need for distributed computing languages like Apache Spark.
You will be able to:
- Describe the key components of Spark and the Big Data ecosystem
- Perform analyses using Spark
The key takeaways from this section include:
- Big Data usually refers to datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches
- Big data refers to data that is terabytes (TB) to petabytes (PB) in size
- Map-Reduce can be used to split big data sets up in smaller sets to be distributed over several machines to deal with Big Data Analytics.
- Before starting to work, you need to install Docker and Kinematic on your environment
- Make sure to test your installation so you're sure everything is working
- When you start working with PySpark, you have to create a
SparkContext()
. - The creation or RDDs is essential when working with PySpark
- Examples of actions and transformations include
collect()
,count()
,filter()
,first()
,take()
, andreduce()
. - Machine Learning on the scale of big data can be done with Spark using the
ml
library