kompetens-bigdata

Sources and presentation for Kompetensomrade BigData

Pre-load by visiting the Downloads page

Definition of Big Data

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (Snijders, C.; Matzat, U.; Reips, U.-D. (2012). "'Big Data': Big gaps of knowledge in the field of Internet")

The three increasing Vs:

Volume (amount of data)
Velocity (speed of data in and out)
Varitey (range of data types and sources)

Scaling and Hadoop

Horizontal scaling by adding more server hardware is important, and both volume and velocity should scale linearly with number of servers or nodes. Hadoop is a java software framework which allows you to setup a cluster of commodity server hardware, on which you can distribute your data storage and processing so that it scales linearly.

Hadoop is quite a piece to download, build and configure from the Apache Open Source, so there are various distributions to make that step a breeze:

Cloudera CDH
Hortonworks
Microsoft HDInsight
IBM Big Data Platform
Amazon Web Services

Many of these also have a quick start Single-Node system setup and ready in a VirtualBox VM.

Storing data

Data is often stored in a database or in a file system.

HDFS

The most common file system is HDFS, which is a component of Hadoop. Each node in the Hadoop cluster contributes with disk space to the HDFS, where a part of the data is stored. All data is also replicated across several nodes (for example 3), to provide redundancy in case of a node failure.

Databases

Many databases have been developed to manage larger data sets with higher throughput, compared to relational databases. Some examples:

BigTable (Google)
HBase (Hadoop)
Cassandra (Apache)
MongoDB

Processing data

Hadoop is the dominating framework to help scaling or distributing the processing of big data. There are two patterns being used:

MapReduce
Spark

MapReduce high-level frameworks

Nowadays, you rarely develop pipelines based on the low-level MapReduce API. Instead, you use other tools, frameworks or languages, such as

Crunch, a java Spark or MapReduce pipeline builder framework
Pig, high-level data-flow language and execution framework for parallel computation
Hive, provides data summarization and ad hoc querying

Sending data

Kafka is a commonly used Message Bus, to which you can connect producers and consumers. Each consumer is guaranteed to receive all messages being produced, even though if the network connection is interrupted.

Two common use cases of Kafka:

collecting log data from front end servers, and write the logs to a HDFS
Real-time processing of streams with high bandwidths, e.g. web site analytics

Serializing data

Common formats of serialized big data are

Avro (Apache Hadoop) supports schema evolution
ProtoBuf (Google)
JSON (JavaScript Object Notation)

dalalsunil1986 / kompetens-bigdata Goto Github PK