Coder Social home page Coder Social logo

kompetens-bigdata's Introduction

kompetens-bigdata

Sources and presentation for Kompetensomrade BigData

Pre-load by visiting the Downloads page

Definition of Big Data

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (Snijders, C.; Matzat, U.; Reips, U.-D. (2012). "'Big Data': Big gaps of knowledge in the field of Internet")

The three increasing Vs:

  • Volume (amount of data)
  • Velocity (speed of data in and out)
  • Varitey (range of data types and sources)

Scaling and Hadoop

Horizontal scaling by adding more server hardware is important, and both volume and velocity should scale linearly with number of servers or nodes. Hadoop is a java software framework which allows you to setup a cluster of commodity server hardware, on which you can distribute your data storage and processing so that it scales linearly.

Hadoop is quite a piece to download, build and configure from the Apache Open Source, so there are various distributions to make that step a breeze:

  • Cloudera CDH
  • Hortonworks
  • Microsoft HDInsight
  • IBM Big Data Platform
  • Amazon Web Services

Many of these also have a quick start Single-Node system setup and ready in a VirtualBox VM.

Storing data

Data is often stored in a database or in a file system.

HDFS

The most common file system is HDFS, which is a component of Hadoop. Each node in the Hadoop cluster contributes with disk space to the HDFS, where a part of the data is stored. All data is also replicated across several nodes (for example 3), to provide redundancy in case of a node failure.

Databases

Many databases have been developed to manage larger data sets with higher throughput, compared to relational databases. Some examples:

  • BigTable (Google)
  • HBase (Hadoop)
  • Cassandra (Apache)
  • MongoDB

Processing data

Hadoop is the dominating framework to help scaling or distributing the processing of big data. There are two patterns being used:

  • MapReduce
  • Spark

MapReduce high-level frameworks

Nowadays, you rarely develop pipelines based on the low-level MapReduce API. Instead, you use other tools, frameworks or languages, such as

  • Crunch, a java Spark or MapReduce pipeline builder framework
  • Pig, high-level data-flow language and execution framework for parallel computation
  • Hive, provides data summarization and ad hoc querying

Sending data

Kafka is a commonly used Message Bus, to which you can connect producers and consumers. Each consumer is guaranteed to receive all messages being produced, even though if the network connection is interrupted.

Two common use cases of Kafka:

  • collecting log data from front end servers, and write the logs to a HDFS
  • Real-time processing of streams with high bandwidths, e.g. web site analytics

Serializing data

Common formats of serialized big data are

  • Avro (Apache Hadoop) supports schema evolution
  • ProtoBuf (Google)
  • JSON (JavaScript Object Notation)

kompetens-bigdata's People

Contributors

sosandstrom avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.