acbull / unbiased_lambdamart Goto Github PK

Code for WWW'19 "Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm", which is based on LightGBM

License: MIT License

Python 16.91% Shell 0.44% CMake 0.39% R 12.49% Makefile 0.04% Batchfile 0.05% C++ 62.21% C 6.63% Jupyter Notebook 0.84%

unbiased_lambdamart's Introduction

Unbiased LambdaMart

Unbiased LambdaMart is a unbiased version of traditional LambdaMart, which can jointly estimate the biases at click positions and the biases at unclick positions, and learn an unbiased ranker using a pairwise loss function.

The repository contains two parts, firstly an implementation of Unbiased LambdaMart based on LightGBM. Secondly a simulated click dataset with its generation scripts for evalution.

You can see our WWW 2019 (know as The Web Conference) paper “Unbiased LambdaMART: An Unbiased PairwiseLearning-to-Rank Algorithm” for more details.

Overview

Unbiased_LambdaMart：

An implementation of Unbiased LambdaMart based on LightGBM. Note that LightGBM contains a wide variety of applications using gradient boosting decision tree algorithms. Our modification is mainly on the src/objective/rank_objective.hpp, which is the LambdaMart Ranking objective file.
evaluation：

contains the synthetic click dataset generated using click models. This part of code is mainly forked from https://github.com/QingyaoAi/Unbiased-Learning-to-Rank-with-Unbiased-Propensity-Estimation. We also add the configs file to run our Unbiased LambdaMart on this synthetic dataset.

Setup

First compile the Unbias_LightGBM (Original LightGBM with the implementation of Unbiased LambdaMart)

On Linux LightGBM can be built using CMake and gcc or Clang.

Install CMake with sudo apt install cmake.

Run the following commands:

cd Unbias_LightGBM/
mkdir build ; cd build
cmake ..
make -j4

Note: glibc >= 2.14 is required. After compilation, we will get a "lighgbm" executable file in the folder.

Example

We modified the original example file to give an illustration.

Compile, then run the following commands:

cd Unbias_LightGBM
cp ./lightgbm ./examples/lambdarank/
cd ./examples/lambdarank/
./lightgbm config="train.conf"

Despite the original XXX.train (provides feature) and XXX.train.query (provides which query a document belongs to), our modified lambdamart requires a XXX.train.rank file to provide the position information to conduct debiasing. For later usage, remember to add this file.

Evaluation

Firstly, download the ranked dataset by an initial SVM ranker from HERE and unzip it into the evaluation directory. Also, one can generate this from scratch by their own, by refering to the procedure of Qingyao Ai, et al..

Next, generate the synthetic dataset from click models by:

cd evaluation
mkdir test_data
cd scripts
python generate_data.py ../click_model/user_browsing_model_0.1_1_4_1.json

Their are also other click model configurations in evaluation/click_model/, one can use any of them.

Finally, move the compiled lighgbm file into evaluation/configs, and then run:

./lightgbm config='train.conf'
./lightgbm config='test.conf'

In this way, the test results (LightGBM_predict_result.txt) based on synthetic click data will be generated. Next, we will evaluate it on real data, by:

cd ../scripts
python eval.py ../configs/LightGBM_predict_result.txt  #or any other model output.

Citation

Please consider citing the following paper when using our code for your application.

@inproceedings{unbias_lambdamart,
  title={Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm},
  author={Ziniu Hu, Yang Wang, Qu Peng, Hang Li},
  booktitle={Proceedings of the 2019 World Wide Web Conference},
  year={2019}
}

unbiased_lambdamart's People

Contributors

Stargazers

Watchers

Forkers

wuliang211 veolata repository-archives zhouyonglong jingmouren xxyy1 githubbeinner changqf iamsile traveler817 regzhuce kdjyss codealigned polaris79 vincentami darlwen batermj hbghhy lukebelieves wwdxfa gabrieltiger ruzhengzhao kiminh gyys1992 wangfengjs lfsblack seeker1943 frankfan007 lhqlhq yukoga renlang97 arita37 ahoyosid deep-cognition atul2804 yujun-qian jinfengfeng bitcit chritter brantzhang zhongyunuestc jkfdre xwc940512 liulj0507 adhi1904 zzzxwx ammsa shy-b ucla-dm hl212

unbiased_lambdamart's Issues

Reccommend using a submodule+fork for Unbias_LightGBM

Due to Unbias_LightGBM effectively being a fork of LightGBM, it would be sensible to create a fork of LightGBM with the necessary changes and renamed to Unbais_LightGBM and add that fork as a submodule to this project. The new fork would initially be set to the relevant commit of LightGBM that Unbias_LightGBM is based upon.

This would allow updates and bug fixes to LightGBM to be easily incorporated into this project, as well as additional clarity that Unbias_LightGBM is LightGBM with modification.

Broken Link in README.md

The link to the ranked dataset by an initial SVM ranker in the README is broken.

https://drive.google.com/file/d/1459mQDnj-0yPtYMIc1LAqLg7Q5VJUw-K/view?usp=sharing

Add LICENSE.md to project root

As a publically viewable project, Unbiased_LambdaMart should include a LICENSE.md file in order to explicitly convey how this project may be used.
As an unlicensed repository, the only rights provided to other users are to view and fork the repository. This is not in line with any desire for this work to be utilized in other projects.

Coming from GitHub's licensing help page:

You're under no obligation to choose a license. It's your right not to
include one with your code or project, but please be aware of the
implications. Generally speaking, the absence of a license means that
the default copyright laws apply. This means that you retain all
rights to your source code and that nobody else may reproduce,
distribute, or create derivative works from your work. This might not
be what you intend.

Even if this is what you intend, if you publish your source code in a
public repository on GitHub, you have accepted the Terms of Service
which do allow other GitHub users some rights. Specifically, you allow
others to view and fork your repository.

If you want to share your work with others, we strongly encourage you
to include an open source license.

I would strongly recommend the MIT license to encourage the widest availability of this project to other researchers, or if you seek protections regarding promotion and advertising material the BSD 3-clause, or Apache 2.0 if you want MIT with more words.

It also appears at first glance that the project is licensed under MIT, as that is the license included in the Unbias_LightGBM directory. However, it is not entirely clear as that is also the license provided with LightGBM.

Question about `position_bins`

Hi,

I'm trying to understand the implementation differences between this repo and lightgbm. Does

position_bins = 12       : this denotes the maximum positions taken into account.

effectively serve the same as lambdarank_truncation_level in the newer releases of lightgbm? Looks like they each cap the number of results from a given query we look at. Wanted to confirm.

segmentation fault

I am getting a segmentation fault when running your version of lightgbm with the default train/test sets as provided at: https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank

 ./lightgbm config=train.conf
[LightGBM] [Info] Finished loading parameters
num_threads_: 8
[LightGBM] [Info] Loading query boundaries...
[LightGBM] [Info] Loading query boundaries...
[LightGBM] [Info] Finished loading data in 0.038245 seconds

  position         bias_i         bias_j         i_cost         j_cost
         0              1              1              0              0
         1              1              1              0              0
         2              1              1              0              0
         3              1              1              0              0
         4              1              1              0              0
         5              1              1              0              0
         6              1              1              0              0
         7              1              1              0              0
         8              1              1              0              0
         9              1              1              0              0
        10              1              1              0              0
        11              1              1              0              0
[LightGBM] [Info] Total Bins 6177
[LightGBM] [Info] Number of data: 3005, number of used features: 211
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[1]    6611 segmentation fault (core dumped)  ./lightgbm config=train.conf

How to tune hyperparameter when use the lambdamart example

I have read the paper and the train.conf file in the lambdarank. It seems there are some hyperparameter such as p and M in the paper. But I cannot find it in the train.conf. Do I miss something?

And I want to use the lib in a large-scale dataset. Train lgb M times will cost a lot time. Can we just train some weak lgb at begining to estimate position-baised parameter. And then re-train with a complex lgb. Did you try it?

Unbiased_lambdamart ndcg is low than original lambdamart

I do the follow step:

download the generate_dataset
python evaluation/scripts/generate_data.py evaluation/click_model/user_browsing_model_0.1_1_4_1.json to generate train and test data
run ./lightgbm config="train.conf" and get test data ndcg@10=0.546817
build a original lightgbm, version=2.1.1 that same to Unbiased_LambdaMart
run ./lightgbm config="train.conf" and get test data ndcg@10=0.556632
Why Unbiased_lambdamart ndcg is low than original lambdamart? The paper say Unbiased_LambdaMart is better than origial?

Want to use PythonAPI to train and predict.

Hi,

I wanted to use the code in a similar way as used in LightGBM where I just import LightGBM and use LGBMRanker to train, make predictions, etc.

I am currently using the Unbias_LightGBM/examples/lambdarank/train.conf file to train.
Can you please guide me to how I can do the same for this repo?

Question about reading "XXX.train" file

Hello!
I am now trying to train using a dataset containing Nan. However, I find that the sample provided is in the form of libSVM. Because libSVM only allows numerical values and not Nan. Therefore, I convert the data to a ".npy" file to try. I would like to ask if the provided code supports reading in other files, such as ".npy," ".csv." If it supports it, I would like to get some details about it.

Why is sigma = 2?

Just a question about the implementation of unbiased lambdaMART - why is the sigma coefficient = 2? Is this related to numerical stability? The paper states "sigma is a constant with default value of 2" (section 4.3), but doesn't provide a reason. Most implementations of the lambdarank gradient just defaults to sigma = 1, including LightGBM. I'm just wondering what the benefits were that drove you to pick sigma = 2 as opposed to 1.

Thank you!

AppleClang not supported -- Setup on Mac.

I am trying to setup up this repo on Mac Mojave 10.14.4. When I make the build directory and run 'cmake ..' from it, it shows me the following error:

CMake Error at CMakeLists.txt:27 (message):
AppleClang wasn't supported. Please see
https://github.com/Microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#macos

-- Configuring incomplete, errors occurred!

Went to the above link and installed cmake and libomp through brew. I copied the cmake command that is given in the above link and it still showed me the same error as before.

When I try to install LightGBM through the same process, it works seamlessly.

How do I solve this issue and setup this repo?

Jupyter notebook example

Could you let me know how to use it in Jupyter notebook.

Should I add something to lightgbm package ?

Can u add an example