Parallel and Distributed Computing with Map-Reduce

Introduction

MapReduce a programming paradigm that enables scalability across hundreds or thousands of servers for big data analytics. The underlying concept can be somewhat difficult to grasp, because this paradigm differs from the traditional programming practices. This lesson aims to present a simple yet intuitive account of MapReduce that we shall put into practice in upcoming labs.

In a nutshell, the term "MapReduce" refers to two distinct tasks. The first is the Map job, which takes one set of data and transforms it into another set of data, where individual elements are broken down into tuples (key/value pairs), while the Reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

Let's see this with help of some simple examples in this lesson.

Objectives

You will be able to:

Understand distributed and parallel computing environments
Explain how Map-Reduce differs from traditional programming approaches
Understand how Map-Reduce works using a simple word count example

Parallel and Distributed Processing

The MapReduce programming paradigm is designed to allow Parallel and Distributed Processing of large sets of data that classify as Big Data. MapReduce allows us to convert such big datasets into sets of Tuples as key:value pairs, as we'll see shortly. These pairs are analogous to the data structure we saw with dictionaries and JSON files etc. These tuples are combined and reduced under in a computational environment to allow distributed execution of complex tasks on a group (cluster) of interconnected computers.

So in simpler terms, MapReduce use parallel distributed computing to turn big data into regular data.

Let's first see what we mean by parallel and distributed processing below

Distributed Processing Systems

A distributed system is a network of computers that communicate with each other in order to achieve a unified goal.

The computers in a distributed system are independent and do not physically share physical memory or processors. They communicate with each other as messages, transferred from one computer to another over a network. The computers in such an environment are normally referred to as nodes of that network. Messages between nodes can communicate many things:

Computers communicate among themselves to execute computing procedures with given conditions
Computers send and receive actual packets of data
Computers send signals that tell other nodes to behave a certain way . etc.

The terms "concurrent computing", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them. The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system often run concurrently, in parallel.

In the image below , a and b are examples of distributed networks having own memory and processing abilities. In c, memory is shared between processors to run real time , parallel execution of required tasks on multiple processors.

Nodes(i.e. processors shown above) in a distributed system can have different roles. This role depends on the goal of the system and the a node's own hardware and software properties. There are two predominant ways of organizing computers in a distributed system. The first is the client-server architecture, and the second is the peer-to-peer architecture.

We know from earlier lessons that client server architecture operates by allowing client to make requests for communication/data to the server, which in turns accepts or rejects these requests. In a peer to peer system, (think about file sharing systems like Bit Torrents etc), all connected computers can communicate freely with each other without requiring any approval from a server. Here is a great article highlighting the differences between two approaches.

Parallel Processing Systems

Parallel distributed processing refers to a powerful framework where mass volumes of data are processed very quickly by distributing processing tasks across clusters of computers.

Consider the world famous Moore's law and how computational speeds have grown in past 50 years or so. Despite this explosion in speed, computers today still aren't able to keep up with the scale of Big Data. Let's take the example of gene sequencing technology in in the field of Genomics. This technology will make higher amounts of gene-sequence data available more quickly. The required speed required to process this data will become becoming much higher than the rate at which processors are getting faster according to Moore's Law. This is shown in the image below.

In other words, for genetic analytical experiments, computers are become less and less able to cope with the scale of processing problems each year, even though the computers themselves are getting faster.

Multi-Processor Architectures

Individual processors, however fast they get, will always have physical and mechanical constraints on speed and storage looking at the chart above. An obvious solution that the technology world is turning to is using Multiple Processors. If two, or three, or more processors are available, then many programs can be executed more quickly in a distributed environment as mentioned earlier. While one processor (a node) is doing one aspect of some computation, other nodes can work on another. All of them can share the same data, but the task will proceed in parallel.. A comparison between sequential (traditional processing on a single processor) and parallel processing is shown in the image below:

In order to be able to work together, multiple processors need to be able to share information with each other. This is by a distributed environment system like Hadoop or Spark. The role of a processor in parallel computation is to carry out the evaluation and execution rules of a specified programming language.

As compared to the Message Passing Model for distributed environments we described earlier, there is also a Shared Memory Model for parallel processing where different processes may execute different statements, but any statement can affect the shared environment.

So What is Map-Reduce ?

MapReduce is a software framework developed for processing datasets that qualify as "Big Data", in a parallel and distributed procesing environment over several computers/nodes connected to each other as part of a cluster.

The key idea behind MapReduce is mapping large datasets in a collection of key,value pairs, and then reducing all pairs with the same key. So the connected nodes are either Mappers or Reducers according to the task allocated to them by a dedicated Master, which in simplest of terms is responsible for task allocation, fault tolerance and other control tasks.

This may sound a bit counter intuitive at the beginning, but we shall soon look into a simple (standard) example that is shown to introduce MapReduce, i.e, The Word Count Problem. The overall concept of MapReduce is very simple yet very powerful as:

Somehow, all data can be mapped to kay:value pairs. Yes, ALL data.
Keys and Values themselves can be of ANY data type.

MapReduce Workflow

The MapReduce workflow involves creating map and reduce tasks as mentioned above. Let's look at these in a bit more detail here.

MAP Task ((Splitting & Mapping)

The dataset that needs processing must first be transformed into key:value pairs and split into fragments, which are then assigned to map tasks. Each computing cluster is assigned a number of map tasks, which are subsequently distributed among its nodes.

A Computing Cluster a group of nodes that are connected to each other and perform a shared computing task.

After processing of the original key:value pairs, some intermediate key:value pairs are generated. The intermediate key:value pairs are sorted by their key values to create a new list of key:value pairs.

REDUCE Task (Shuffling, Reducing)

This list from map task is divided into a new set of fragments. Number these new fragments, will be the same as the number of the reduce tasks. Every reduce task has a fragment assigned to it. The reduce task simply processes the fragment and produces an output, which is also a key:value pair. Reduce tasks may also be distributed among the different nodes of the cluster. After the task is completed, the final output is written onto a file system. The underlying file system is usually HDFS (Hadoop file system).

We can efficiently break down and begin to make sense of a huge volume, velocity, and variety of data by using map and reduce tasks as shown above. A general diagram of the MapReduce architecture is shown below:

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a

Jobtracker : Acts like a master (responsible for complete execution of submitted job)

Multiple Task Trackers : Act like slaves, each of them performing the job

The Word Count problem

Map-Reduce paradigm is quite different than traditional programming paradigms as you can probably tell by now. Let's look at a classical example of "word count" from the days of Unix systems, https://en.wikipedia.org/wiki/Wc_(Unix), which counts how often words appear in a documents. In a MapReduce environment, the solution will requires one map-reduce job. More complex tasks might need more jobs. Here we shall keep it simple, as shown in the image below, where we are trying a count how many times each word appears in the given document. This toy example can be mapped on to analyzing terabytes of text in a digital library.

A brief description of each phase:

Split: Split the input data in a number of fragments according to size defined during configuration. (In hadoop file system "HDFS" the default block size is 64KB, whih can be changed if required).
Mapper : Each mapper processes a individual fragment rather than considering the whole document. A fragment only contains part of the document. In map phase we tokenize words in each line of the given fragment as a <Key,Value> pair. We set each token , we set key = word, and assign to this key value = 1.
Shuffler : A shuffler (also called a combiner) sorts all similar keys together, to be passed to the reducer later.
Reducer: A Reducer process each sorted split. In words count example, reducer calculate the number of word occurrences based on the given value.

We shall soon look at programming this simple example in an emulated SPARK environment. Meanwhile, have at look at this article to see read about the in a bit more detail. In essence:

A MapReduce program is defined by at least two functions, creating 3 sets of key:value pairs:

map : (k1, v1) −→ [(k2, v2)]

reduce : (k2, [v2]) −→ [(k3, v3)]

A general pseudocode for a word count map and reduce tasks would look like

# Count word frequency
def map( doc ) :
    for word in doc.split( ' ' ) :
    emit ( word , 1 )

def reduce( key , values ) :
    emit ( key , sum( values ) )

Similarly, we can discuss several Map-Reduce jobs in order to complete a given task. This means that once a first MapReduce job is finish, the output will become an input for the second MapReduce job and the output could be the final result ( or fed into another job).

Let's assume that we would like to extend the word count program and we would like to count all words in a given Twitter dataset. The first MapReduce will read our twitter data and extract the tweets text. The second MapReduce is the word count Map-Red which analyze twitter and produce the statistics about it. So it is simply chaining together multiple jobs.

InputFile -> Map-1 -> Reduce1 -> output1 -> Map2 - > Reduce-2 -> output2 -> ....Map-x

Next, We shall look briefly at Hadoop, distributed processing framework built on MapReduce Programming paradigm and also Apache Spark, which adds extra features of security and fault tolerance to it's MapReduce offering, making it an industry standard. We shall also look at programming for the above mentioned word count problem.

Additional Resources

Visit following external links to read about above descriptions and example in more detail.

Summary

In this lesson, we looked at How MapReduce allows a programming paradign, quite different than traditional programming practices, yet very powerful and effective towards processing large amounts of data. Next we shall look at Spark programming environment and some coding exercises to get grips with PySpark programming.

learn-co-students / dsc-4-38-03-parallel-and-distributed-computing-with-mapreduce-data-science-demo Goto Github PK

dsc-4-38-03-parallel-and-distributed-computing-with-mapreduce-data-science-demo's Introduction