Coder Social home page Coder Social logo

sudilhasitha / python-mapreduce---web-log-and-sales-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from siva1987c/python-mapreduce---web-log-and-sales-analysis

0.0 0.0 0.0 263 KB

MapReduce scripts written in Python for Hadoop Streaming

Python 100.00%

python-mapreduce---web-log-and-sales-analysis's Introduction

mapreduce

Repository of MapReduce scripts that I wrote in Python for the Udacity course Intro to Hadoop and MapReduce, a course about Hadoop and how to write MapReduce programs. Each mapper script has a complementary reducer script. Together, the mapper and the reducer receive data from the standard input. These scripts can be used locally or they can create MapReduce jobs in Hadoop with Hadoop Streaming. As detailed below, these scripts process data from one of three sources: data about users of and posts on Udacity's forums [link], weblog data [link], and sales data from different stores [link].

To run a mapper and a reducer on 100 lines of data and view the output on a Linux machine, try a command like the one below.

head -100 ./example_data.txt | ./example_mapper.py | sort | ./example_reducer.py | less

Forum Data

  • average_length_mapper.py and average_length_reducer.py count the length of every question and compute the average length of the answers to each question. Reuslts are output in three columns: (1) question ID, (2) question length, and (3) mean answer length. Questions without answers receive an average answer length of zero.

  • popular_tags_mapper.py and popular_tags_reducer.py count the number of times each tag is applied to a post by its author and keep track of the N-most popular tags, where N = the number of tags to output. Results are output in two columns, sorted by popularity: (1) tag and (2) number of posts with the tag.

  • inverted_index_mapper.py and inverted_index_reducer.py count the number of times a term appears in posts (e.g., 'fantastic') and keep track of the posts in which the term appears. Two results are output: (1) term count and (2) list of posts in which the term appears.

  • student_times_mapper.py and student_times_reducer.py count the number of times that each user posts at the different hours of the day and finds the hour at which each user posted most often. Results are output in two columns: (1) user ID and (2) hour with the most posts. Hours that tie are output as different records.

  • study_groups_mapper.py and study_groups_reducer.py create a list of the users who interacted via a question; that is, for each question, the list includes the ID of the user who asked the question and the IDs of the users who answered and the users who commented on the question and the answers. Results are output in two columns: (1) question ID and (2) list of user IDs.

  • combine_mapper.py and combine_reducer.py output information from forum_node about the posts and combine it with the reputation statistics of their authors from forum_node.

Web Log Data

  • count_mapper.py and count_reducer.py count the number of hits to each webpage. Results are output in two columns: (1) webpage and (2) number of hits.

  • maximum_mapper.py and maximum_reducer.py operate similarly, but they only output these results for the webpage with the most hits.

Sales Data

  • maximum_item_mapper.py and maximum_item_reducer.py find the highest price, across all stores, at which each item was sold. Results are output. Two results are output: (1) item name and (2) price.

  • mean_mapper.py and mean_reducer.py compute the average item price, across all stores, on a given day of the week (e.g., Sunday). The only result that is output is the mean item price.

python-mapreduce---web-log-and-sales-analysis's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.