The key takeaways from this section include:
- Big Data usually refers to datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches
- Big data refers to data that is terabytes (TB) to petabytes (PB) in size
- MapReduce can be used to split big datasets up in smaller sets to be distributed over several machines to deal with Big Data Analytics
- Before starting to work, you need to install Docker and Kinematic on your environment
- Make sure to test your installation so you're sure everything is working
- When you start working with PySpark, you have to create a
SparkContext()
- The creation or RDDs is essential when working with PySpark
- Examples of actions and transformations include
collect()
,count()
,filter()
,first()
,take()
, andreduce()
- Machine Learning on the scale of big data can be done with Spark using the
ml
library