Coder Social home page Coder Social logo

101-hadoop-hdfs-faspeee's Introduction

101 Hadoop and HDFS

Module 1, Big Data course (81932), University of Bologna.

101-1 Cluster setup

Goal: setup connections to the classroom's cluster via Putty and WinSCOP

  • Connect to isi-vclustN.csr.unibo.it on Putty
    • N is the number of the node you have been assigned to
    • Get your credentials from https://tinyurl.com/bigdata20users
    • If connecting from outside UniBo network, you first need to connect to 137.204.72.5 on Putty; use your institutional credentials to connect; then, use another window to connect to isi-vclust
  • Change your password! passwd <username>
  • Create a directory in your home called bigdata mkdir <foldername>
  • Connect to isi-vclustN.csr.unibo.it via WinSCP
  • Check the existence of the directory

101-2 HDFS disk usage

Goal: understand basic HDFS commands that provide reports on the disk usage

hdfs dfs -df -h
hdfs dfs -du -h /
hdfs dfsadmin -report

101-3 HDFS storing files

Goal: create/remove files and directories; navigate directories; change the replication factor of files.

From shell

# Explore HDFS directories with –ls
hdfs dfs -ls /
# Create a bigdata folder in your HDFS home
hdfs dfs -mkdir bigdata
# Create a dummy file in your folder in the local file system
echo 'This is a dummy file' > dummy.txt
# Put the dummy file to your bigdata folder in HDFS
hdfs dfs -put bigdata/dummy.txt
# Change the replication factor of the dummy file to 5
hdfs dfs -setrep -w 5 bigdata/dummy.txt
# Verify that the number of replicas has actually increased
# Delete the test folder and the dummy.txt file on HDFS
hdfs dfs –rm -skipTrash bigdata/dummy.txt

From HDFS's web UI

HDFS provides a basic web interface with read permissions on the filesystem.

Go to Cloudera Manager (Username: student - Password: student) > HDFS service (left panel) > NameNome Web UI > Utilities > Browse the file system. Navigate to your folder and click on your file to check blocks' locations and download the file.

From Apache Hue

Apache Hue offers a more complete navigation of the filesystem, with the possibility to create/move/rename/delete folders and files. You can change permissions, download files, and use the drag&drop feature to easily upload new files and folders.

Go to Apache Hue and click on the three-lines menu (top-left) > Files.

101-4 Virtual machine setup

Quickstart for CDH 5.13 is the virtual machine from Cloudera that provides a safe environment for testing and self-learning. It is already available on lab computers, but you can also download it from here and install it on your computers.

Minimum requirements

Minimum requirements depend on the desired configuration:

  • 4 GB of RAM for running CDH alone
  • 8 GB of RAM for running Cloudera Express (i.e., with Cloudera Manager)
  • 12 GB of RAM for running Cloudera Enterprise (i.e., the commercial version)

We will use 6 GB of RAM to run Cloudera Express with a reduced number of services.

Running CDH and Cloudera Manager

  • Open VMware and make sure that the settings of the virtual machine comply with the minimum requirements.
  • Launch the virtual machine.
  • Go to System > Preferences > Keyboard > Layouts to setup the Italian keyboard
  • Open a new Terminal and launch Cloudera Manager with the command sudo /home/cloudera/cloudera-manager --force --express
  • Open the browser and select the Cloudera Manager bookmark
  • Stop every unnecessary service (i.e., HBase, Impala, Key-Value Store, Oozie, Solr, Sqoop 1 Client, Sqoop 2). Delete them if you don't wont them starting up again if the cluster is restarted.
  • Remember: if you close the VM by suspending, some services may go down when the VM is reopened and will need to be restarted; otherwise, if you clode the VM by shutting down, Cloudera Manager will need to be restarted when the VM is reopened.

Loading datasets

Put the content of the dataset folder in the virtual machine either by setting Git in the virtual machine, or by copy/pasting the folder from your physical machine.

Then, create a "dataset" folder in the home folder on HDFS and put there the files. Either use the following commands from a Terminal windows or use the web UI of Hue.

hdfs dfs -mkdir dataset
hdfs dfs -put <localpath1> ... <localpathN> dataset

101-5 Differences between Cluster and Virtual machine

  • Java: 1.7 both
  • CDH: 5.13 both
  • Hadoop: 2.6 both
  • Spark: 1.6 both
  • Spark2: 2.1 cluster only

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.