Coder Social home page Coder Social logo

alexholmes / hadoop-book Goto Github PK

View Code? Open in Web Editor NEW
202.0 42.0 200.0 1.4 MB

Source code to accompany the book "Hadoop in Practice", published by Manning.

License: Apache License 2.0

Shell 3.36% R 0.48% Java 94.25% Python 0.89% Awk 0.31% Gnuplot 0.59% Protocol Buffer 0.08% Thrift 0.05%

hadoop-book's Introduction

Source code for book "Hadoop in Practice", Manning Publishing

Overview

This repo contains the code, scripts and data files that are referenced from the book Hadoop in Practice, published by Manning.

Issues

If you hit any compilation or execution problems please create an issue and I'll look into it as soon as I can.

Hadoop Version

All the code has been exercised against CDH3u2, which for the purposes of the code is the same has Hadoop 0.20.x. There are a couple of places where I utilize some features in Pig 0.9.1, which won't work with CDH3u1 which uses 0.8.1.

I've recently run some basic MapReduce jobs against CDH4, and I also updated the examples so that they would run against Hadoop 2. Please let me know on the Manning forum or in a GitHub ticket if you encounter any issues.

Building and running

Download from github

git clone git://github.com/alexholmes/hadoop-book.git

Build

cd hadoop-book
mvn package

Runtime Dependencies

Many of the examples use Snappy and LZOP compression. Therefore you may get runtime errors if you don't have them installed and configured in your cluster.

Snappy can be installed on CDH by following the instructions at https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation.

To install LZOP follow the instructions at https://github.com/kevinweil/hadoop-lzo.

Run an example

# copy the input files into HDFS
hadoop fs -mkdir /tmp
hadoop fs -put test-data/ch1/* /tmp/

# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3
export HADOOP_HOME=/usr/local/hadoop

# run the map-reduce job
bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce /tmp/file1.txt /tmp/file2.txt output

hadoop-book's People

Contributors

alexholmes avatar cnauroth avatar wisonhuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadoop-book's Issues

How to get your git file to Unbuntu

Hi,

I have installed Unbuntu as my VM ware OS. It seems the following command does not run on his OS. could you let me know the command on Unbuntu. thanks!

$git clone git://github.com/alexholmes/hadoop-book.git

LY

the error message of running example

Hi Alex,

I followed the command line

copy the input files into HDFS

hadoop fs -mkdir /tmp
hadoop fs -put test-data/ch1/* /tmp/

replace the path below with the location of your Hadoop installation

this isn't required if you are running CDH3

export HADOOP_HOME=/usr/local/hadoop

But running the map-reduce job
bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce /tmp/file1.txt /tmp/file2.txt output

gives the following error message

Hadoop_HOME environment not set, but found /usr/lib/hadoop in path so using that
/usr/lib/hadoop/conf must be the hadoop config directory

I am using the cloudera-quickstart-vm

mvn package; Timeouts, Missing Files, Retries, ...

Aug 18, 2014 4:23:38 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:23:38 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/com/sun/jersey/jersey-json/1.4/jersey-json-1.4.jar
Aug 18, 2014 4:23:54 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:23:54 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/com/sun/jersey/jersey-server/1.4/jersey-server-1.4.jar
Aug 18, 2014 4:23:59 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:23:59 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/javax/ws/rs/jsr311-api/1.1.1/jsr311-api-1.1.1.jar
Aug 18, 2014 4:24:09 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:09 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/pig/pig/0.9.0/pig-0.9.0.jar
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/hadoop/avro/1.3.2/avro-1.3.2.jar
Downloading: https://oss.sonatype.org/content/repositories/releases/javax/transaction/jta/1.1/jta-1.1.jar
Aug 18, 2014 4:24:23 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:23 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/derby/derby/10.6.1.0/derby-10.6.1.0.jar
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/mahout/mahout-core/0.9/mahout-core-0.9.jar
Aug 18, 2014 4:24:27 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:27 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar
Aug 18, 2014 4:24:40 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:40 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/com/thoughtworks/xstream/xstream/1.4.4/xstream-1.4.4.jar
Downloading: https://oss.sonatype.org/content/repositories/releases/xmlpull/xmlpull/1.1.3.1/xmlpull-1.1.3.1.jar
Aug 18, 2014 4:24:52 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:52 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/lucene/lucene-core/4.6.1/lucene-core-4.6.1.jar
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/lucene/lucene-analyzers-common/4.6.1/lucene-analyzers-common-4.6.1.jar
Aug 18, 2014 4:24:55 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request: Operation timed out
Aug 18, 2014 4:24:55 PM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request
Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/commons/commons-math3/3.2/commons-math3-3.2.jar

^CDavid-Laxers-MacBook-Pro:hadoop-book(Manning) davidlaxer$ wget https://oss.sonatype.org/content/repositories/releases/com/thoughtworks/xstream/xstream/1.4.4/xstream-1.4.4.jar
--2014-08-18 16:25:19-- https://oss.sonatype.org/content/repositories/releases/com/thoughtworks/xstream/xstream/1.4.4/xstream-1.4.4.jar
Resolving oss.sonatype.org (oss.sonatype.org)... 207.223.241.93
Connecting to oss.sonatype.org (oss.sonatype.org)|207.223.241.93|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2014-08-18 16:25:20 ERROR 404: Not Found.

David-Laxers-MacBook-Pro:hadoop-book(Manning) davidlaxer$

mvn package BUILD FAILURE

[ERROR] Failed to execute goal on project hadoop-book: Could not resolve dependencies for project com.manning.hip:hadoop-book:jar:1.0.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.thrift:hadoopbook-libthrift:jar:0.5.0, com.hadoop.compression.lzo:hadoop-lzo:jar:0.4.14, com.maxmind.geoip:maxmind-geoip:jar:1.2.5: Could not transfer artifact org.apache.thrift:hadoopbook-libthrift:jar:0.5.0 from/to hadoop-non-releases (http://alexholmes.github.com/hadoop-book-mvn-repo/repository/releases/): GET request of: org/apache/thrift/hadoopbook-libthrift/0.5.0/hadoopbook-libthrift-0.5.0.jar from hadoop-non-releases failed: Connection reset -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.

NoSuchMethodError

When I run the following Script:-
bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce
/tmp/file1.txt /tmp/file2.txt output

I get the following Error Message:-
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.securty.authorize.AccessControlList.getACLString()Ljava/lang/String;

Please Help.

package issue

Hi alex,

I am still getting mvn package issue after clearing mvn cache.

[ERROR] Failed to execute goal on project hadoop-book: Could not resolve dependencies for project com.manning.hip:hadoop-book:jar:1.0.0-SNAPSHOT: The following artifacts could not be resolved: com.cloudera.crunch:crunch:jar:0.1.0, org.apache.hbase:hbase:jar:0.90.4-cdh3u2, org.apache.hadoop.hive:hive-serde:jar:0.7.1-cdh3u2, org.apache.hadoop.hive:hive-exec:jar:0.7.1-cdh3u2, org.apache.hadoop.hive:hive-metastore:jar:0.7.1-cdh3u2, org.apache.mahout:mahout-core:jar:0.5-cdh3u2: Failure to find com.cloudera.crunch:crunch:jar:0.1.0 in http://alexholmes.github.com/hadoop-book-mvn-repo/repository/releases/ was cached in the local repository, resolution will not be reattempted until the update interval of hadoop-non-releases has elapsed or updates are forced -> [Help 1].

Please help me out here

mvn package error

I am trying to execute "mvn package" but getting the following error.

Earlier I tried this on HortonWorks VM and now I tried with Cloudera VM. Please help as I am not able to move forward with the book.


Downloading: https://repository.cloudera.com/content/repositories/snapshots/org/apache/ftpserver/ftpserver-deprecated/1.0.0-M2/ftpserver-deprecated-1.0.0-M2.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.421s
[INFO] Finished at: Tue Oct 15 08:52:25 PDT 2013
[INFO] Final Memory: 8M/57M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project hadoop-book: Could not resolve dependencies for project com.manning.hip:hadoop-book:jar:1.0.0-SNAPSHOT: Failed to collect dependencies for [com.google.code.findbugs:jsr305:jar:1.3.9 (compile), org.codehaus.jackson:jackson-mapper-asl:jar:1.8.6 (compile), org.apache.avro:avro:jar:1.6.1 (compile), org.apache.avro:avro-mapred:jar:1.6.1 (compile), com.cloudera.crunch:crunch:jar:0.1.0 (compile), org.apache.thrift:hadoopbook-libthrift:jar:0.5.0 (compile), junit:junit:jar:4.9 (test), org.apache.hadoop:hadoop-core:jar:0.20.2-cdh3u2 (compile), org.apache.hbase:hbase:jar:0.90.4-cdh3u2 (compile), org.apache.pig:pig:jar:0.9.0 (compile), org.apache.hadoop.hive:hive-serde:jar:0.7.1-cdh3u2 (compile), org.apache.hadoop.hive:hive-exec:jar:0.7.1-cdh3u2 (compile), org.apache.hadoop.hive:hive-metastore:jar:0.7.1-cdh3u2 (compile), org.apache.mahout:mahout-core:jar:0.5-cdh3u2 (compile), com.googlecode.json-simple:json-simple:jar:1.1 (compile), com.google.guava:guava:jar:r09 (compile), commons-io:commons-io:jar:2.0.1 (compile), com.google.protobuf:protobuf-java:jar:2.3.0 (compile), com.hadoop.compression.lzo:hadoop-lzo:jar:0.4.14 (compile), edu.uci.ics.crawler4j:crawler4j:jar:2.6.1 (compile), com.twitter.elephantbird:elephant-bird:jar:2.0.5 (compile), com.maxmind.geoip:maxmind-geoip:jar:1.2.5 (compile), org.apache.hadoop.contrib.utils:join:jar:0.20.2 (compile), cascading:cascading-core:jar:1.2.5 (compile), org.apache.mrunit:mrunit:jar:0.5.0-incubating (test), org.apache.hadoop:hadoop-test:jar:0.20.2-cdh3u2 (test)]: Failed to read artifact descriptor for org.apache.ftpserver:ftpserver-core:jar:1.0.0: Could not transfer artifact org.apache.ftpserver:ftpserver-core:pom:1.0.0 from/to central (http://repo.maven.apache.org/maven2): Connection to http://repo.maven.apache.org refused: Connection refused -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

mvn package error

I have just started reading the book and trying to create a package for the code used in the book.
$ cd hadoop-book
$ mvn package

I am using the thoughworks sandbox VM.

Getting the following error:-

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3:32.748s
[INFO] Finished at: Mon Oct 14 00:25:40 PDT 2013
[INFO] Final Memory: 9M/38M
[INFO] ------------------------------------------------------------------------

[ERROR] Failed to execute goal on project hadoop-book: Could not resolve dependencies for project com.manning.hip:hadoop-book:jar:1.0.0-SNAPSHOT: Failed to collect dependencies at com.cloudera.crunch:crunch:jar:0.1.0 -> org.slf4j:slf4j-log4j12:jar:1.6.1 -> log4j:log4j:jar:1.2.16: Failed to read artifact descriptor for log4j:log4j:jar:1.2.16: Could not transfer artifact log4j:log4j:pom:1.2.16 from/to hadoop-non-releases (http://alexholmes.github.com/hadoop-book-mvn-repo/repository/releases/): Failed to transfer file: http://alexholmes.github.com/hadoop-book-mvn-repo/repository/releases/log4j/log4j/1.2.16/log4j-1.2.16.pom. Return code is: 500 , ReasonPhrase:Server Error. -> [Help 1]

mvn package errors

I did as instructed:

git clone https://github.com/alexholmes/hadoop-book.git
Initialized empty Git repository in /home/cloudera/hadoop-book/hadoop-book/.git/
remote: Counting objects: 936, done.
remote: Compressing objects: 100% (447/447), done.
remote: Total 936 (delta 390), reused 896 (delta 350)
Receiving objects: 100% (936/936), 1.17 MiB | 305 KiB/s, done.
Resolving deltas: 100% (390/390), done.

but when running mvn package I get a lot of these unexisting links:

Downloading: http://alexholmes.github.com/hadoop-book-mvn-repo/repository/releases/org/apache/avro/avro/1.6.1/avro-1.6.1.pom
Feb 24, 2013 8:11:16 AM org.apache.maven.wagon.providers.http.httpclient.impl.client.DefaultRequestDirector tryExecute
INFO: I/O exception (org.apache.maven.wagon.providers.http.httpclient.NoHttpResponseException) caught when processing request: The target server failed to respond
Feb 24, 2013 8:11:16 AM org.apache.maven.wagon.providers.http.httpclient.impl.client.DefaultRequestDirector tryExecute
INFO: Retrying request

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.