Coder Social home page Coder Social logo

team-16's Introduction

Team-16 Project

Introduction

The objective of this project is to predict the empirical CTR through statistical models with the categorical features of the given data.

The data we are provided with is the Track 2 data of the KDD cup 2012, which includes 12 categorical features in the main data file, and five additional data files: query_tokensid.txt, purchasekeywordid_tokensid.txt, titleid_tokensid.txt, description_tokensid.txt, and userid_profile.txt.

The main data was divided into three sections: training, testing, and validation. The training portion of the data is used to train the statistical model, while the testing data is used to test the model. The validation set then acts to validate the model, where the model attempts to predict data that is completely unknown to us.

The categorical features we chose to employ were gender, age, and the token similarity for QueryId, TitleId, Keyword, and Description. The statistical model we decided to utilize for this project was the Naive-Bayes classifier.

Features

Aggregating the data

Prediction Based on Age

We aggregate our data using MapReduce, which is run on the Amazon Web Service System. Since age is in a separate file from the instance file, our first step is to join these files together. To do this, we need to run a MapReduce sorted by the key userid. To begin our process, we need to place userid_profile.txt with our training data so that everything can be called at once. So that things will be easier to call when introducing our inputs, userid_profile.txtwill be renamedpart-uid and placed together in the training data folder.

MapReduce is then run with mapper_age_1.py and reducer_age_1.py with the inputs coming from the "training-60" folder including our userid_profile data. The output data is in the form:

'age \t click \t impression'.

The results of the MapReduce is then used as inputs for the second MapReduce, using the mapper and reducer mapper_age_2.py and reducer_age_2.py. This outputs data in the form:

'feature value \t feature name \t clicks \t impressions'.

Prediction Based on Gender

Like with age, gender is also in a separate file from the instances file. The process to aggregate this data is done in a similar manner to the aggregation of data based on age. Using the same initial input file as our MapReduce for age, we run mapper_gender_1.py and reducer_gender_1.py. The output data is in the form:

'gender \t click \t impression'.

The results of this MapReduce, like with Age, is then used as inputs for the second MapReduce using the mapper and reducer mapper_gender_2,py and reducer_gender_2.py. This outputs data in the form:

'feature value \t feature name \t clicks \t impression'

Prediction By Similarity Index

To begin, the files titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt were each run in MapReduce with their corresponding file_append_*.py file (e.g. file_append_title.py was run with titleid_tokensid.txt file and an identity reducer). This was done because otherwise, each of the token files were indistinguishable from one another.

Using the outputs from the above MapReduce along with the training data, we ran four MapReduce jobs to append the tokens to their subsequent ids. The Map-Reduce files are as follows:

  • mapper_tok_1.py, reducer_tok_1.py
  • mapper_tok_2.py, reducer_tok_2.py
  • mapper_tok_3.py, reducer_tok_3.py
  • mapper_tok_4.py, reducer_tok_4.py

After all MapReduce jobs are completed, a final map reduce is run by using the file token_simi.py as the mapper and an identity reducer. The final map reduce calculates similarity ratios for the number of matching tokens between:

  • query to title
  • query to description
  • query to key
  • query to title and description
  • query to title and key
  • query to description and key
  • query to title, key, description

Model

Naive Bayes

naive_train_model.py is run locally with its inputs as the output from the Aggregating Data MapReduce. This file calculates the probabilities:

  • P(feature = value | click)
  • P(feature = value | noclick)
  • p(click)
  • p(noclick)
  • p(feature="UNK" | click)
  • p(feature="UNK" | noclick)
where UNK refers to the unknown values for the feature. The output of this is called naive_probabilities.txt

After building the dictionary of conditional probabilities, we go back to using MapReduce for prediction.

The following files are run to get the validation data into the format that is useful for us. They are run with validation-20, titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt (appended as in Prediction By Similarity Index), userid_profile.txt:

  1. naive_mapper_tok_1.py, naive_reducer_tok_1.py
  2. naive_mapper_tok_2.py, naive_reducer_tok_2.py
  3. naive_mapper_tok_3.py, naive_reducer_tok_3.py
  4. naive_mapper_tok_4.py, naive_reducer_tok_4.py
  5. naive_mapper_agegender_5.py, naive_reducer_agegender_5.py
  6. naive_token_simi.py, identity reducer

The final output of these MapReduce jobs is then run through a final MapReduce. The following files are run with naive_probabilities.txt as a cache file. They will give our model's output predictions.

  1. naive_pred_mapper.py, identity reducer
  2. naive_pred_agender_mapper.py, identity reducer
  3. naive_pred_agqtkdsimi_mapper.py, identity reducer
  4. naive_pred_simi_mapper.py, identity reducer
  1. Runs predictions using age, gender, and every similarity ratio
  2. Runs predictions using only age and gender
  3. Runs predictions using age, gender, and the query to key, description, title ratio
  4. Runs predictions using only all of the similarity ratios

The outputs are downloaded and concatenated locally then run in R and the R script auc.R to get our AUC score.

*Note: A more detailed explanation of our project can be found in the Final Report.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.