Data Engineering Challenge

Collaborative efforts in music are always a gamble. While much of top 40 music consists of content resulting from partnerships between two or more artists, there are risks involved with bringing together such artists who may have opposing styles and motives. However, these risks are often worth the reward and juxtapositions of contrasting styles have led to some very successful efforts like those from Eminem and Dido (Stan), P!nk and Nate Ruess (Just Give Me a Reason), or even Lady Gaga and Kermit the Frog (Gypsy).

These partnerships aren't always between men and women but a lot of the more interesting ones are. In this challenge, you'll use a (fictituous) dataset to try to determine what pairings like this (i.e. between male and female artists) would be most "interesting" based on the sentiment within statements made by users online. The sentiment of each statement will fall into one of 3 categories, positive, negative, or neutral. Your job will then be to pair the artists up and draw conclusions about the sentiment around each pair.

Dataset

The input dataset will consist of the following things:

user.name - The screenname of the user that mentioned the artist
artist.name - The name of the artist mentioned
artist.gender - The gender of the artist (either 'male' or 'female')
sentiment - The attitude within the statement made (either 1, -1, or 0 for positive, negative, and neutral respectively)

For example:

user.name	artist.name	artist.gender	sentiment
user1	Miley Cyrus	female	1
user2	Miley Cyrus	female	-1
user1	Elton John	male	1
user2	Elton John	male	-1
user1	Sam Smith	male	1
user2	Sam Smith	male	1
user1	Meghan Trainor	female	-1
user2	Meghan Trainor	female	-1
user1	Garth Brooks	male	1
user2	Garth Brooks	male	1
user3	Garth Brooks	male	0
user4	Garth Brooks	male	-1

Questions

These questions all pertain to the data above and we only ask for answers to the first two, but if you're enjoying the problem then we would love to see answers to the others as well. Please submit an accompanying writeup to document your thought process when approaching the problems. When working in a collaborative environment, being able to explain yourself and justify your reasoning is often just as important as the work itself.

##Question 1: Sentiment Dissonance

Using Apache Pig (see next section for more details on it), determine single male and female artist pair with the largest difference in cumulative sentiment. We'll assume this difference would make the pairing more "interesting" since the public opinion about each is polarized.

An answer to this question should first determine the "net sentiment" for each artist. For example, Garth Brooks is mentioned in the example dataset above 4 times and the net sentiment for him over all mentions is 1 + 1 + 0 + -1 = 1. This value should be calculated for each artist and then all the male and female artists should be paired together and ordered by the absolute value of the difference in that value for each pair.

A result for the example dataset would be (ordered by difference):

female.artist	male.artist	sentiment.difference
Meghan Trainor	Sam Smith	abs(-2 - 2) = 4
Meghan Trainor	Garth Brooks	abs(-2 - 1) = 3
Miley Cyrus	Sam Smith	abs(0 - 2) = 2
Meghan Trainor	Elton John	abs(-2 - 0) = 2
Miley Cyrus	Garth Brooks	abs(0 - 1) = 1
Miley Cyrus	Elton John	abs(0 - 0) = 0

And we'd conclude that Meghan Trainor and Sam Smith make for the best pair.

Notes:

We don't care about pairings of same sex artists (e.g. Garth Brooks and Sam Smith)
The user.name field is irrelevant for this question

##Question 2: Making Decisions

Assume you run your own record label and your job is to determine what proposed male/female duet is worth producing music for. You'll encounter these opportunities once a week over the course of a year and each time you'll have to make a decision right away, and you can only choose one song to produce that year (and you'll never be offered the opportunity to work with the same duet twice). For example, you might run into a chance to produce a duet for Micheal Buble and Rihanna in week 1 and then another for Kenny Chesney and Iggy Azalea in week 2 but you can't want until the end of week 2 to decide -- each opportunity expires at the end of the week in which it arose.

Finally, assume that the metric calculated in Question 1 for sentiment dissonance is a perfect estimator of success. Each time you run into the chance to produce a song for a duet, you can calculate the value from Question 1 for it and use that to make your decision.

Given this, write a program (in python, java, or bash) that will take strings on stdin in the form female.artist,male.artist (one such pair per line) and output a decision as 'yes' or 'no' for each pair. This program can assume that only artists seen in the input dataset for Question 1 will be used and that you can do anything within the program you'd like, but you must make a decision for each pair at the time it is seen (i.e. you can't look at them all and then decide).

Notes:

No Googol-ing solutions for this please! We'd much prefer your own approach.
Remember, you'll never see an offer for the same duet twice in this problem.

##Question 3: Cohort Sentiment (Optional)

Using Pig again, determine the cumulative, net sentiment for each pair of male and female artists that are mentioned by the same users.

An answer to this question should first determine which artists are mentioned by the same users and then for each of those users, determine their "net sentiment" about the pairing. For example, user1 in the example dataset above mentions both Miley Cyrus and Garth Brooks with a sentiment of 1 (i.e. positive) for each. The "net sentiment" for that user about this artist pairing is then 1 + 1 = 2. Keep in mind that we only care about pairs containing one male and one female artist though -- so we wouldn't care about the net value for this user in regards to a pairing of say Elton John and Garth Brooks, who are both male artists mentioned (by user1).

The "cumulative" value for each male/female artist pair is then the sum of the net sentiment from each user that mentions both. For example, the resulting value for the pairing of Miley Cyrus and Garth Brooks should be the sum of the net sentiment from user1 and user2, and would not include any contributions for user3 or user4 who only mention Garth Brooks.

The result for the example dataset would be:

female.artist	male.artist	cohort.sentiment
Meghan Trainor	Sam Smith	0
Meghan Trainor	Garth Brooks	0
Miley Cyrus	Sam Smith	2
Meghan Trainor	Elton John	-2
Miley Cyrus	Garth Brooks	2
Miley Cyrus	Elton John	0

##Question 4: Anomalies (Optional)

In all the previous questions, net sentiment was calculated or used to create a solution -- which is not ideal. Considering only positive/negative/nuetral opinions like this can be useful but also involves a significant loss of information, namely how many opinions are being included. A net value of 0 calculated this way could result from a single neutral opinion or 1000 opinions, half negative and half positive.

This question then will involve taking a different approach to Question 1 where instead of using net sentiment, you will use the frequency of each sentiment type to figure out which artist pairs are least like the others. The first step in answering this question will be to compute a very similar result to Question 1 but 3 statistics instead of 1, like this:

female.artist	male.artist	num.positive	num.negative	num.neutral
Meghan Trainor	Sam Smith	2	2	0
Meghan Trainor	Garth Brooks	2	3	1
.. and so on ...

Note that the sign or value of the sentiment no longer matters -- the result above includes only the count of the number of occurrences each sentiment type for a specific artist pair.

Given these frequencies, we ask that you determine which 10 artist pairs (of differing gender) are least like the others based on all 3 statistics. For example, if we were to ignore neutral sentiment for now and just consider positive and negative sentiment, then this is how the different artist pairings relate to one another:

(each dot corresponds to a single artist pair)

Notes:

There is no "right" answer to this question so feel free to take any approach you want!
You do not have to use Pig for this -- you can use anything you want like java, python + pandas, R, SQL or whatever you think would be best
The user.name field is irrelevant for this question

Apache Pig

Pig is a high-level, imperative-style programming language that is great for data munging. As a relational algebra, it looks a lot like SQL but is much better for chaining operations together into more complicated workflows.

We created an EC2 image with pig, python, and java (using Mortar). Here's how you can launch an EC2 instance:

login to https://console.aws.amazon.com/ec2 and proceed to "Launch Instance"
select the "N. Virgina" region from the dropdown in the top right (otherwise the correct AMI won't be available)
click on "Community AMIs" and search for "nbs" (or "nbs-eng-challenge" if there are many results)
select the "nbs-eng-challenge" image (AMI ami-42bd232a)
select the "m3.medium" instance size and follow the steps to launch the instance

Note: m3.medium costs $0.07/hr ($1.68/day). NBS will send you a $5 amazon gift card to cover the cost.

Once the instance launches, you can log into it via ssh (as user nbs, password 1d4396ae44931bd626b38448824c0d20), and you'll find everything you need in the directory /home/nbs/data_engineer_challenge.

We started a Pig script for you that will read the input data from the correct place (~/data_engineer_challenge/data/sentiment) and count the number of times each artist is mentioned. Here is an example of how to run that script as well as the expected output:

nbs@ip-10-169-43-241:~$ cd data_engineer_challenge/
nbs@ip-10-169-43-241:~/data_engineer_challenge$ mortar local:run pigscripts/sentiment.pig 

Launching Pig: 3 jobs scheduled
Full logs will be written to logs/local-pig.log

Starting job job_local_0001
Map 001: 100%       Input records:  1	Output records: 1

... [ Omitting some output for brevity ] ...

Pig run completed in 21 seconds. 3/3 jobs successful.

(Sammy Hagar,79)
(Carrie Underwood,76)
(Demi Lovato,74)
(Andrew W.K.,73)
(Akon,72)
... [ Omitting some output for brevity ] ...

Success! - No error expected.

The mortar local:run command should be all you need to run any pigscripts you create or modify in ~/data_engineer_challenge/pigscripts. The example script, sentiment.pig, looks like this:

-- Load raw data from csv files in project data directory using comma delimiter
raw = LOAD '/home/nbs/data_engineer_challenge/data/sentiment' USING PigStorage(',') 
  AS (user_name:chararray, artist_name:chararray, artist_gender:chararray, sentiment:int);

-- Group all the mention records by artist name
grouped = GROUP raw BY artist_name;

-- Determine the number of records associated with each artist
counts = FOREACH grouped GENERATE group AS artist_name, COUNT(raw) AS count;

-- Order the results by the number of records per artist
result = ORDER counts BY count DESC;

-- Print the results to stdout
DUMP result;

Hopefully that seems pretty intuitive. There are many tutorials out there like this one, and for the sake of answering the questions you shouldn't have to learn about many Pig constructs other than the basic ones like FOREACH, JOIN, GROUP, ORDER, CROSS, and STORE/LOAD. There are a lot of others available though so you might find some fancy ways to use them. The version of Pig installed is 0.12 and you can find the full documentation here.

dataqc / data-engineer-challenge Goto Github PK

data-engineer-challenge's Introduction

Data Engineering Challenge

Dataset

Questions

Apache Pig

data-engineer-challenge's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent