Team-16 Project

Introduction

The objective of this project is to predict the empirical CTR through statistical models with the categorical features of the given data.

The data we are provided with is the Track 2 data of the KDD cup 2012, which includes 12 categorical features in the main data file, and five additional data files: query_tokensid.txt, purchasekeywordid_tokensid.txt, titleid_tokensid.txt, description_tokensid.txt, and userid_profile.txt.

The main data was divided into three sections: training, testing, and validation. The training portion of the data is used to train the statistical model, while the testing data is used to test the model. The validation set then acts to validate the model, where the model attempts to predict data that is completely unknown to us.

The categorical features we chose to employ were gender, age, and the token similarity for QueryId, TitleId, Keyword, and Description. The statistical model we decided to utilize for this project was the Naive-Bayes classifier.

Features

Aggregating the data

Prediction Based on Age

We aggregate our data using MapReduce, which is run on the Amazon Web Service System. Since age is in a separate file from the instance file, our first step is to join these files together. To do this, we need to run a MapReduce sorted by the key userid. To begin our process, we need to place userid_profile.txt with our training data so that everything can be called at once. So that things will be easier to call when introducing our inputs, userid_profile.txtwill be renamedpart-uid and placed together in the training data folder.

MapReduce is then run with mapper_age_1.py and reducer_age_1.py with the inputs coming from the "training-60" folder including our userid_profile data. The output data is in the form:

'age \t click \t impression'.

The results of the MapReduce is then used as inputs for the second MapReduce, using the mapper and reducer mapper_age_2.py and reducer_age_2.py. This outputs data in the form:

'feature value \t feature name \t clicks \t impressions'.

Prediction Based on Gender

Like with age, gender is also in a separate file from the instances file. The process to aggregate this data is done in a similar manner to the aggregation of data based on age. Using the same initial input file as our MapReduce for age, we run mapper_gender_1.py and reducer_gender_1.py. The output data is in the form:

'gender \t click \t impression'.

The results of this MapReduce, like with Age, is then used as inputs for the second MapReduce using the mapper and reducer mapper_gender_2,py and reducer_gender_2.py. This outputs data in the form:

'feature value \t feature name \t clicks \t impression'

Prediction By Similarity Index

To begin, the files titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt were each run in MapReduce with their corresponding file_append_*.py file (e.g. file_append_title.py was run with titleid_tokensid.txt file and an identity reducer). This was done because otherwise, each of the token files were indistinguishable from one another.

Using the outputs from the above MapReduce along with the training data, we ran four MapReduce jobs to append the tokens to their subsequent ids. The Map-Reduce files are as follows:

mapper_tok_1.py, reducer_tok_1.py
mapper_tok_2.py, reducer_tok_2.py
mapper_tok_3.py, reducer_tok_3.py
mapper_tok_4.py, reducer_tok_4.py

After all MapReduce jobs are completed, a final map reduce is run by using the file token_simi.py as the mapper and an identity reducer. The final map reduce calculates similarity ratios for the number of matching tokens between:

query to title

query to description

query to key

query to title and description

query to title and key

query to description and key

query to title, key, description

Model

Naive Bayes

naive_train_model.py is run locally with its inputs as the output from the Aggregating Data MapReduce. This file calculates the probabilities:

P(feature = value | click)
P(feature = value | noclick)
p(click)
p(noclick)
p(feature="UNK" | click)
p(feature="UNK" | noclick)

where UNK refers to the unknown values for the feature. The output of this is called naive_probabilities.txt

After building the dictionary of conditional probabilities, we go back to using MapReduce for prediction.

The following files are run to get the validation data into the format that is useful for us. They are run with validation-20, titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt (appended as in Prediction By Similarity Index), userid_profile.txt:

naive_mapper_tok_1.py, naive_reducer_tok_1.py
naive_mapper_tok_2.py, naive_reducer_tok_2.py
naive_mapper_tok_3.py, naive_reducer_tok_3.py
naive_mapper_tok_4.py, naive_reducer_tok_4.py
naive_mapper_agegender_5.py, naive_reducer_agegender_5.py
naive_token_simi.py, identity reducer

The final output of these MapReduce jobs is then run through a final MapReduce. The following files are run with naive_probabilities.txt as a cache file. They will give our model's output predictions.

naive_pred_mapper.py, identity reducer
naive_pred_agender_mapper.py, identity reducer
naive_pred_agqtkdsimi_mapper.py, identity reducer
naive_pred_simi_mapper.py, identity reducer

Runs predictions using age, gender, and every similarity ratio
Runs predictions using only age and gender
Runs predictions using age, gender, and the query to key, description, title ratio
Runs predictions using only all of the similarity ratios

The outputs are downloaded and concatenated locally then run in R and the R script auc.R to get our AUC score.

*Note: A more detailed explanation of our project can be found in the Final Report.pdf

kinmen / team-16 Goto Github PK

team-16's Introduction

Team-16 Project

Introduction

Features

Aggregating the data

Prediction Based on Age

Prediction Based on Gender

Prediction By Similarity Index

Model

Naive Bayes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent