rk2900 / dlf Goto Github PK

Deep learning for flexible market price modeling (landscape forecasting) in real-time bidding advertising. An implementation of our KDD 2019 paper with some other (Python) implemented prediction models.

License: MIT License

Python 99.85% Roff 0.15%

deep-learning real-time-bidding market-price survival-analysis bid-landscape-forecasting

dlf's Introduction

Deep Landscape Forecasting for Real-time Bidding Advertising

This is the implementation for our KDD 2019 paper "Deep Landscape Forecasting for Real-time Bidding Advertising".

The preprint version of the paper has been published on Arxiv: https://arxiv.org/abs/1905.03028.

If you have any problems, please feel free to contact the authors Kan Ren, Jiarui Qin and Lei Zheng.

Abstract

The emergence of real-time auction in online advertising has drawn huge attention of modeling the market competition, i.e., bid landscape forecasting. The problem is formulated as to forecast the probability distribution of market price for each ad auction. With the consideration of the censorship issue which is caused by the second-price auction mechanism, many researchers have devoted their efforts on bid landscape forecasting by incorporating survival analysis from medical research field. However, most existing solutions mainly focus on either counting-based statistics of the segmented sample clusters, or learning a parameterized model based on some heuristic assumptions of distribution forms. Moreover, they neither consider the sequential patterns of the feature over the price space. In order to capture more sophisticated yet flexible patterns at fine-grained level of the data, we propose a Deep Landscape Forecasting (DLF) model which combines deep learning for probability distribution forecasting and survival analysis for censorship handling. Specifically, we utilize a recurrent neural network to flexibly model the conditional winning probability w.r.t. each bid price. Then we conduct the bid landscape forecasting through probability chain rule with strict mathematical derivations. And, in an end-to-end manner, we optimize the model by minimizing two negative likelihood losses with comprehensive motivations. Without any specific assumption for the distribution form of bid landscape, our model shows great advantages over previous works on fitting various sophisticated market price distributions. In the experiments over two large-scale real-world datasets, our model significantly outperforms the state-of-the-art solutions under various metrics.

Setups

We recommend the settings of Tensorflow (>=1.3) and Python (2.7.6).

The models are trained under the same hardware settings with an Intel(R) Core(TM) i7-6900K CPU processor, an NVIDIA GeForce GTX 1080Ti GPU processor and 128 GB memory. The training time of each compared model is less than ten hours (as reported from the slowest training model MTLSA) on each dataset.

All the models are trained until convergence and we consider learning rate from {1e-4, 1e-5, 1e-3, 1e-3}. The value of $\alpha$ is tuned to 0.25. Batch size is fixed on 128 and embedding dimension is 32. All the deep learning models take input features and feed through an embedding layer for the subsequent feedforward calculation. The hyperparameters of each model are tuned and the best performances have been reported.

Data Preparation

The full dataset can be downloaded at this link and the corresponding MD5 code is 841698b0dd8718b1b4a4ff2e54bb72b4.

The raw data of iPinYou can be downloaded from Dropbox.

The feature engineering code is here, which is forked and slightly different to the original repository.

Data specification

Each subset of the data contains .yzbx.txt, featureindex.txt and .log.txt. We created the first data file .log.txt from the raw data of the original data source (please refer to our paper). Then we made feature engineering according to the created feature dictionary featindex.txt. The corresponding feature engineered data are in .yzbx.txt.

If you need to reproduce the experiemtns, you may run over .yzbx.txt.

In yzbx.txt file, each line is a sample containing the "yzbx" data, the information is splitted by SPACE. Here z is the true market price, b is the proposed bid price and x is the list of features (multi-hot encoded as feat_id:1). In the experiment, we only use zbx data. Note that, for the uncensored data, z < b, while for the censored data, z >= b.

Run the Codes

The running command are listed as below.

python km.py             # Kaplan-Meier
python gamma_model.py    # Gamma
python cox.py            # Lasso-Cox and DeepSurv
python deephit.py        # DeepHit
python DWPP.py           # DWPP
python RNN.py 0.0001      # for RNN
python DLF.py 0.0001     # for DLF

Citations

@inproceedings{ren2019deep,
  title={Deep Landscape Forecasting for Real-time Bidding Advertising},
  author={Ren, Kan and Qin, Jiarui and Zheng, Lei and Zhang, Weinan and Yu, Yong},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2019},
  organization={ACM}
}

dlf's People

Contributors

Stargazers

Watchers

Forkers

marklost romange wosu jhmclean mindis sertelnov anamonty lps683 seeker1943 keshava vaibhav-sikka devashish-khatwani msofiperez viethung1234q lewisbakkero rgushel

dlf's Issues

Question about c-index results

Hello, firstly, I'd like to say that I enjoyed the paper and the model's idea of predicting bidding landscape.

I'm trying to implement paper's model in PyTorch and I get some results with my implementation, but when calculating c-index, best values I can get on test dataset are around 68%. The dataset I'm using is 2259 iPinYou. On the train set I get results closer to the paper's 87%, that is 83% (score obtained using some checkpoint, before doing hparam search, so I think there is some room for closing the gap).

Maybe do you remember whether the c-index was calculated on the train dataset?

what is MD5 code

what is MD5 code in
The full dataset can be downloaded at this link and the corresponding MD5 code is 841698b0dd8718b1b4a4ff2e54bb72b4.

about the DLF-data.7z

Thank you for your amazing work. However, I get clone error below when trying to git clone this repository. Would you please add some other ways to download "DLF-data.7z", e.g. baidu netdisk.

Downloading DLF-data.7z (1.3 GB)
Error downloading object: DLF-data.7z (b5c53e5): Smudge error: Error downloading DLF-data.7z (b5c53e513ba892ac455c1a7d461aa280eac8fa9db182ec5cb5e7fad1f043e9cd): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Few questions while reading your paper

Hi Ren,

I have read your paper and encountered some questions.

1.What's the winning condition in one auction?

it confused me that your paper gives a winning condition as z<b,

while you give uncensored condition $z \le b$ in README.

and I also found an interesting statement about this dataset in a previous published paper

so could an advertiser win this auction with z equals b ?

I assume not for now.

2.What does h denotes in your paper ?

the defination in the paper is attached here

so let's take h as 'the just winning probability given z>=b_{l-1}'.
but if z is in V_{l} = (bl ,bl+1] , it means z>bl , then how could we win by b_{l}?

since z need to be at least no greater than b could the advertiser win, I will give my understanding of h

and if you would confirm that one cannot win with z==b , I believe the defination of V_{l} should be [bl,bl+1).

Only in this way can you get the equation h_{l} = Pr(z \in V_{l-1} | z \ge b_{l-1}).

3.What's the exact meaning of each h produced by RNN Cell?

after the second question, I'm trapped by a even more bigger one reading your code.

Assume that we still have h as the just winning probability given z>=b_{l-1}, which should be consistent both in your paper and code.

by the code below you prod all 'h's tegother , which is unreasonble.

survival_rate_last_one = tf.reduce_prod(x[0:bid_len])
anlp_rate_last_one = tf.reduce_prod(x[0:market_len + 1])
anlp_rate_last_two = tf.reduce_prod(x[0:market_len])

this enlights me that output of RNN maynot be the 'h' in your paper.
instead it looks like the losing probability at b_{l} given z>=b_{l-1} ,not winning , i.e. the survival rate at b_{l}

so I wonder whether you have given a reverse defination in the paper about h?
if not, what's the output of RNN stand for?

thanks

How to plot the survival curve?

Hello, I want to know how the two curves ‘Survival Curve of Different Model’ and ‘Figure 2: Learning curves’ mentioned in your paper are drawn?
Thank you.

Market Price (z) value in Losing Cases in Sample Data '2259'

Hi Kan,

Kudos for the excellent research and development work!

I just want to know that in the sample 'yzbx' dataset, we can see the values of z >= b when we have lost the cases in the train as well as the test dataset. Can you please let us know how we are getting those values as it is not possible to get the market price when we have lost the auction? Have they randomly generated the market price data (z >= b and z < MAX_SEQ_LEN) ?

Question regarding the paper

Hi, Kan. I read your paper, admire and find it promising of this model without distribution assumption. Here is two questions I have to expect your answer:

For the Equation(6), should the subscript of b be l instead of l-1? I found in your presentation on KKD the subscript had been corrected.

For the lose part of the 2nd loss L_lose in Equation(12), should the range of l in SUM be [0, index of bidding price]? In the previous equations the paper said "where li is the interval index of the true market price zi ∈Vli given the feature vector xi", however, the market price is censored in losing bids and li could not be acquired. In which way, I expect the range to be different from previous though they used the same notation here.

Thank you!