The objective of this project is to predict the empirical CTR through statistical models with the categorical features of the given data.
The data we are provided with is the Track 2 data of the KDD cup 2012, which includes 12 categorical features in the main data file, and five additional data files: query_tokensid.txt, purchasekeywordid_tokensid.txt, titleid_tokensid.txt, description_tokensid.txt, and userid_profile.txt.
The main data was divided into three sections: training, testing, and validation. The training portion of the data is used to train the statistical model, while the testing data is used to test the model. The validation set then acts to validate the model, where the model attempts to predict data that is completely unknown to us.
The categorical features we chose to employ were gender, age, and the token similarity for QueryId, TitleId, Keyword, and Description. The statistical model we decided to utilize for this project was the Naive-Bayes classifier.
We aggregate our data using MapReduce, which is run on the Amazon Web Service System. Since age is in a separate file from the instance file, our first step is to join these files together. To do this, we need to run a MapReduce sorted by the key userid
. To begin our process, we need to place userid_profile.txt
with our training data so that everything can be called at once. So that things will be easier to call when introducing our inputs, userid_profile.txt
will be renamedpart-uid
and placed together in the training data folder.
MapReduce is then run with mapper_age_1.py
and reducer_age_1.py
with the inputs coming from the "training-60" folder including our userid_profile data. The output data is in the form:
'age \t click \t impression'
.
The results of the MapReduce is then used as inputs for the second MapReduce, using the mapper and reducer mapper_age_2.py
and reducer_age_2.py
. This outputs data in the form:
'feature value \t feature name \t clicks \t impressions'
.
Like with age, gender is also in a separate file from the instances file. The process to aggregate this data is done in a similar manner to the aggregation of data based on age. Using the same initial input file as our MapReduce for age, we run mapper_gender_1.py
and reducer_gender_1.py
. The output data is in the form:
'gender \t click \t impression'
.
The results of this MapReduce, like with Age, is then used as inputs for the second MapReduce using the mapper and reducer mapper_gender_2,py
and reducer_gender_2.py. This outputs data in the form:
'feature value \t feature name \t clicks \t impression'
To begin, the files titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt
were each run in MapReduce with their corresponding file_append_*.py
file (e.g. file_append_title.py
was run with titleid_tokensid.txt
file and an identity reducer). This was done because otherwise, each of the token files were indistinguishable from one another.
Using the outputs from the above MapReduce along with the training data, we ran four MapReduce jobs to append the tokens to their subsequent ids. The Map-Reduce files are as follows:
mapper_tok_1.py, reducer_tok_1.py
mapper_tok_2.py, reducer_tok_2.py
mapper_tok_3.py, reducer_tok_3.py
mapper_tok_4.py, reducer_tok_4.py
After all MapReduce jobs are completed, a final map reduce is run by using the file token_simi.py
as the mapper and an identity reducer. The final map reduce calculates similarity ratios for the number of matching tokens between:
- query to title
- query to description
- query to key
- query to title and description
- query to title and key
- query to description and key
- query to title, key, description
naive_train_model.py
is run locally with its inputs as the output from the Aggregating Data MapReduce. This file calculates the probabilities:
P(feature = value | click)
P(feature = value | noclick)
p(click)
p(noclick)
p(feature="UNK" | click)
p(feature="UNK" | noclick)
naive_probabilities.txt
After building the dictionary of conditional probabilities, we go back to using MapReduce for prediction.
The following files are run to get the validation data into the format that is useful for us. They are run with validation-20, titleid_tokensid.txt, queryid_tokensid.txt, descriptionsid_tokensid.txt, and purchasedkeywordid_tokensid.txt (appended as in Prediction By Similarity Index), userid_profile.txt
:
naive_mapper_tok_1.py, naive_reducer_tok_1.py
naive_mapper_tok_2.py, naive_reducer_tok_2.py
naive_mapper_tok_3.py, naive_reducer_tok_3.py
naive_mapper_tok_4.py, naive_reducer_tok_4.py
naive_mapper_agegender_5.py, naive_reducer_agegender_5.py
naive_token_simi.py, identity reducer
The final output of these MapReduce jobs is then run through a final MapReduce. The following files are run with naive_probabilities.txt
as a cache file. They will give our model's output predictions.
naive_pred_mapper.py, identity reducer
naive_pred_agender_mapper.py, identity reducer
naive_pred_agqtkdsimi_mapper.py, identity reducer
naive_pred_simi_mapper.py, identity reducer
- Runs predictions using age, gender, and every similarity ratio
- Runs predictions using only age and gender
- Runs predictions using age, gender, and the query to key, description, title ratio
- Runs predictions using only all of the similarity ratios
The outputs are downloaded and concatenated locally then run in R and the R script auc.R
to get our AUC score.
*Note: A more detailed explanation of our project can be found in the Final Report.pdf