Coder Social home page Coder Social logo

data-engineer-challenge's Introduction

Data Engineering Challenge

Collaborative efforts in music are always a gamble. While much of top 40 music consists of content resulting from partnerships between two or more artists, there are risks involved with bringing together such artists who may have opposing styles and motives. However, these risks are often worth the reward and juxtapositions of contrasting styles have led to some very successful efforts like those from Eminem and Dido (Stan), P!nk and Nate Ruess (Just Give Me a Reason), or even Lady Gaga and Kermit the Frog (Gypsy).

These partnerships aren't always between men and women but a lot of the more interesting ones are. In this challenge, you'll use a (fictituous) dataset to try to determine what pairings like this (i.e. between male and female artists) would be most "interesting" based on the sentiment within statements made by users online. The sentiment of each statement will fall into one of 3 categories, positive, negative, or neutral. Your job will then be to pair the artists up and draw conclusions about the sentiment around each pair.

Dataset

The input dataset will consist of the following things:

  • user.name - The screenname of the user that mentioned the artist
  • artist.name - The name of the artist mentioned
  • artist.gender - The gender of the artist (either 'male' or 'female')
  • sentiment - The attitude within the statement made (either 1, -1, or 0 for positive, negative, and neutral respectively)

For example:

user.name artist.name artist.gender sentiment
user1 Miley Cyrus female 1
user2 Miley Cyrus female -1
user1 Elton John male 1
user2 Elton John male -1
user1 Sam Smith male 1
user2 Sam Smith male 1
user1 Meghan Trainor female -1
user2 Meghan Trainor female -1
user1 Garth Brooks male 1
user2 Garth Brooks male 1
user3 Garth Brooks male 0
user4 Garth Brooks male -1

Questions

These questions all pertain to the data above and we only ask for answers to the first two, but if you're enjoying the problem then we would love to see answers to the others as well. Please submit an accompanying writeup to document your thought process when approaching the problems. When working in a collaborative environment, being able to explain yourself and justify your reasoning is often just as important as the work itself.

##Question 1: Sentiment Dissonance

Using Apache Pig (see next section for more details on it), determine single male and female artist pair with the largest difference in cumulative sentiment. We'll assume this difference would make the pairing more "interesting" since the public opinion about each is polarized.

An answer to this question should first determine the "net sentiment" for each artist. For example, Garth Brooks is mentioned in the example dataset above 4 times and the net sentiment for him over all mentions is 1 + 1 + 0 + -1 = 1. This value should be calculated for each artist and then all the male and female artists should be paired together and ordered by the absolute value of the difference in that value for each pair.

A result for the example dataset would be (ordered by difference):

female.artist male.artist sentiment.difference
Meghan Trainor Sam Smith abs(-2 - 2) = 4
Meghan Trainor Garth Brooks abs(-2 - 1) = 3
Miley Cyrus Sam Smith abs(0 - 2) = 2
Meghan Trainor Elton John abs(-2 - 0) = 2
Miley Cyrus Garth Brooks abs(0 - 1) = 1
Miley Cyrus Elton John abs(0 - 0) = 0

And we'd conclude that Meghan Trainor and Sam Smith make for the best pair.

Notes:

  1. We don't care about pairings of same sex artists (e.g. Garth Brooks and Sam Smith)
  2. The user.name field is irrelevant for this question

##Question 2: Making Decisions

Assume you run your own record label and your job is to determine what proposed male/female duet is worth producing music for. You'll encounter these opportunities once a week over the course of a year and each time you'll have to make a decision right away, and you can only choose one song to produce that year (and you'll never be offered the opportunity to work with the same duet twice). For example, you might run into a chance to produce a duet for Micheal Buble and Rihanna in week 1 and then another for Kenny Chesney and Iggy Azalea in week 2 but you can't want until the end of week 2 to decide -- each opportunity expires at the end of the week in which it arose.

Finally, assume that the metric calculated in Question 1 for sentiment dissonance is a perfect estimator of success. Each time you run into the chance to produce a song for a duet, you can calculate the value from Question 1 for it and use that to make your decision.

Given this, write a program (in python, java, or bash) that will take strings on stdin in the form female.artist,male.artist (one such pair per line) and output a decision as 'yes' or 'no' for each pair. This program can assume that only artists seen in the input dataset for Question 1 will be used and that you can do anything within the program you'd like, but you must make a decision for each pair at the time it is seen (i.e. you can't look at them all and then decide).

Notes:

  1. No Googol-ing solutions for this please! We'd much prefer your own approach.
  2. Remember, you'll never see an offer for the same duet twice in this problem.

##Question 3: Cohort Sentiment (Optional)

Using Pig again, determine the cumulative, net sentiment for each pair of male and female artists that are mentioned by the same users.

An answer to this question should first determine which artists are mentioned by the same users and then for each of those users, determine their "net sentiment" about the pairing. For example, user1 in the example dataset above mentions both Miley Cyrus and Garth Brooks with a sentiment of 1 (i.e. positive) for each. The "net sentiment" for that user about this artist pairing is then 1 + 1 = 2. Keep in mind that we only care about pairs containing one male and one female artist though -- so we wouldn't care about the net value for this user in regards to a pairing of say Elton John and Garth Brooks, who are both male artists mentioned (by user1).

The "cumulative" value for each male/female artist pair is then the sum of the net sentiment from each user that mentions both. For example, the resulting value for the pairing of Miley Cyrus and Garth Brooks should be the sum of the net sentiment from user1 and user2, and would not include any contributions for user3 or user4 who only mention Garth Brooks.

The result for the example dataset would be:

female.artist male.artist cohort.sentiment
Meghan Trainor Sam Smith 0
Meghan Trainor Garth Brooks 0
Miley Cyrus Sam Smith 2
Meghan Trainor Elton John -2
Miley Cyrus Garth Brooks 2
Miley Cyrus Elton John 0

##Question 4: Anomalies (Optional)

In all the previous questions, net sentiment was calculated or used to create a solution -- which is not ideal. Considering only positive/negative/nuetral opinions like this can be useful but also involves a significant loss of information, namely how many opinions are being included. A net value of 0 calculated this way could result from a single neutral opinion or 1000 opinions, half negative and half positive.

This question then will involve taking a different approach to Question 1 where instead of using net sentiment, you will use the frequency of each sentiment type to figure out which artist pairs are least like the others. The first step in answering this question will be to compute a very similar result to Question 1 but 3 statistics instead of 1, like this:

female.artist male.artist num.positive num.negative num.neutral
Meghan Trainor Sam Smith 2 2 0
Meghan Trainor Garth Brooks 2 3 1
.. and so on ...

Note that the sign or value of the sentiment no longer matters -- the result above includes only the count of the number of occurrences each sentiment type for a specific artist pair.

Given these frequencies, we ask that you determine which 10 artist pairs (of differing gender) are least like the others based on all 3 statistics. For example, if we were to ignore neutral sentiment for now and just consider positive and negative sentiment, then this is how the different artist pairings relate to one another:

(each dot corresponds to a single artist pair)

Notes:

  1. There is no "right" answer to this question so feel free to take any approach you want!
  2. You do not have to use Pig for this -- you can use anything you want like java, python + pandas, R, SQL or whatever you think would be best
  3. The user.name field is irrelevant for this question

Apache Pig

Pig is a high-level, imperative-style programming language that is great for data munging. As a relational algebra, it looks a lot like SQL but is much better for chaining operations together into more complicated workflows.

We created an EC2 image with pig, python, and java (using Mortar). Here's how you can launch an EC2 instance:

  • login to https://console.aws.amazon.com/ec2 and proceed to "Launch Instance"
  • select the "N. Virgina" region from the dropdown in the top right (otherwise the correct AMI won't be available)
  • click on "Community AMIs" and search for "nbs" (or "nbs-eng-challenge" if there are many results)
  • select the "nbs-eng-challenge" image (AMI ami-42bd232a)
  • select the "m3.medium" instance size and follow the steps to launch the instance

Note: m3.medium costs $0.07/hr ($1.68/day). NBS will send you a $5 amazon gift card to cover the cost.

Once the instance launches, you can log into it via ssh (as user nbs, password 1d4396ae44931bd626b38448824c0d20), and you'll find everything you need in the directory /home/nbs/data_engineer_challenge.

We started a Pig script for you that will read the input data from the correct place (~/data_engineer_challenge/data/sentiment) and count the number of times each artist is mentioned. Here is an example of how to run that script as well as the expected output:

nbs@ip-10-169-43-241:~$ cd data_engineer_challenge/
nbs@ip-10-169-43-241:~/data_engineer_challenge$ mortar local:run pigscripts/sentiment.pig 

Launching Pig: 3 jobs scheduled
Full logs will be written to logs/local-pig.log

Starting job job_local_0001
Map 001: 100%       Input records:  1	Output records: 1

... [ Omitting some output for brevity ] ...

Pig run completed in 21 seconds. 3/3 jobs successful.

(Sammy Hagar,79)
(Carrie Underwood,76)
(Demi Lovato,74)
(Andrew W.K.,73)
(Akon,72)
... [ Omitting some output for brevity ] ...

Success! - No error expected.

The mortar local:run command should be all you need to run any pigscripts you create or modify in ~/data_engineer_challenge/pigscripts. The example script, sentiment.pig, looks like this:

-- Load raw data from csv files in project data directory using comma delimiter
raw = LOAD '/home/nbs/data_engineer_challenge/data/sentiment' USING PigStorage(',') 
  AS (user_name:chararray, artist_name:chararray, artist_gender:chararray, sentiment:int);

-- Group all the mention records by artist name
grouped = GROUP raw BY artist_name;

-- Determine the number of records associated with each artist
counts = FOREACH grouped GENERATE group AS artist_name, COUNT(raw) AS count;

-- Order the results by the number of records per artist
result = ORDER counts BY count DESC;

-- Print the results to stdout
DUMP result;

Hopefully that seems pretty intuitive. There are many tutorials out there like this one, and for the sake of answering the questions you shouldn't have to learn about many Pig constructs other than the basic ones like FOREACH, JOIN, GROUP, ORDER, CROSS, and STORE/LOAD. There are a lot of others available though so you might find some fancy ways to use them. The version of Pig installed is 0.12 and you can find the full documentation here.

data-engineer-challenge's People

Contributors

eczech avatar spucci avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.