Coder Social home page Coder Social logo

netflix-prize-hadoop's Introduction

This repository contains code to demonstrate Hadoop by attempting to solve the Netflix Prize.

Code by the Hacked Existence Team. Project Page: http://hackedexistence.com/project-netflix.html

This code is meant to be run against the dataset provided by Netflix to teams who registered for the Netflix Prize Competition (http://www.netflixprize.com). It is 2.1 Gigabytes in size broken down into individual files for each movie with a total of 17,770 movies in the data set.

Files:

netflixReorg.awk This file is 3 lines of awk code that were used to reorganize the netflix prize dataset into something that would be much more efficient for use with hadoop. It was used to take the first line of each file which contained the MovieID and append it to the beginning of each remaining line in the file so that Hadoop could be fed each line individually to compute against.

Netflix1

The Netflix1 program was written to produce statistical information about each movie in the dataset. It took the entire dataset as input and produced the first date rated, last date rated, total rating count and average rating for each movie in the dataset.

javaNetflix1/MyMapper.java

This file is the mapper written in Java for the Netflix1 project. The output hashmap that gets sent to the reducer uses the MovieID as the key and the Rating,Date as the value (Rating and Date are combined as the value and separated by a comma which can then be parsed in the reducer).

pythonNetflix1/pyMapper.py

This file is the mapper for Netflix1 written in Python which is executed in the hadoop cloud by utilizing the streaming interface. The output hashmap that gets sent to the reducer uses the MovieID as the key and Rating,Date as the value (Rating and Date are combined as the value and separated with a tab character '\t' which can be parsed in the reducer).

awkNetflix1/awkMapper.awk

This file is the mapper for Netflix1 written in Awk which is executed in the hadoop cloud by utilizing the streaming interface. The output hashmap that gets sent to the reducer uses the MovieID as the key and Rating,Date as the value (Rating and Date are combined as the value and separated with a comma whcih can be parsed in the reducer).

javaNetflix1/MyReducer.java

This file is the reducer written in Java for the Netflix1 project. It takes the output from MyMapper.java as input and computes the "First Date Rated, Last Date Rated, Production Date, Number of Ratings, Average Rating, Movie Title" for each movie in the data set.

pythonNetflix1/pyReducer.py

This file is the reducer writte in Python for the Netflix1 project. It takes the output from pyMapper.py as input and computes the "First Date Rated, Last Date Rated, Production Date, Number of Ratings, Average Rating, Movie Title" for each movie in the data set.

awkNetflix1/pyReducer.py

Because of the logic that had to be executed in the reducer, we were not able to write a reducer in Awk, and since the Awk mapper used the streaming interface, we chose to use the python reducer with the Awk mapper so that we could compare run times between the Python mapper and the Awk mapper using the same reducer.

javaNetflix1/netflix1Driver.java

This is the Java driver file for the Netflix1 project, it registers the input data set in Distributed Cache, defines the mapper and reducer files/methods, and submits the job to the hadoop cloud for processing.

pythonNetflix1/pyNetflix1.sh

This is a shell script that calls the hadoop streaming command and sets all the job configuration as variables that it passes to the streaming command.

awkNetflix1/awkNetflix1.sh

This is a shell script that calls the hadoop streaming command and sets all the job configuration as variables that it passes to the streaming command.

Netflix2

The Netflix2 program was written to produce statistical information about each user in the dataset. It took the entire dataset as input and produced the total number of ratings, the overall average rating, and the rating delay (the number of days from movie production to date rated) for each user in the data set.

javaNetflix2/MyMapper.java

This file is the mapper written in Java for the Netflix2 project. It takes the whole data set as input and outputs a hashmap that gets sent to the reducer using UserID as the key and "MovieID,Rating,RatingDelay" as the value. It should also be noted that this mapper disregards any ratings that have NULL as the production date.

pythonNetflix2/pyMapper.py

This file is the mapper written in Python for the Netflix2 project. It takes the whole data set as input and outputs a hashmap that gets sent to the reducer using UserID as the key and "MovieID,Rating,RatingDelay" as the value. It should also be noted that this mapper disregards any ratings that have NULL as the production date, and reports the MovieID of disregarded movie to the job tracker.

javaNetflix2/MyReducer.java

This file is the reducer written in Java for the Netflix2 project. It takes the output hashmap from javaNetflix2/MyMapper.java as input. Its final output is UserID RatingCount,RatingAverage,RatingDelay.

pythonNetflix2/pyReducer.py

This file is the reducer written in Python for the Netflix2 project. It takes the output hashmap from pythonNetflix2/pyMapper.py as input. Its final output is UserID RatingCount,RatingAverage,RatingDelay.

javaNetflix2/netflix2Driver.java

This is the Java driver file for the Netflix2 project, it registers the input data set in Distributed Cache, defines the mapper and reducer files/methods, and submits the job to the hadoop cloud for processing.

pythonNetflix2/pyNetflix2.sh

This is a shell script that calls the hadoop streaming command and sets all the job configuration as variables that it passes to the streaming command.

netflix-prize-hadoop's People

Contributors

r3dfish avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.