Open sourced implementation of this paper: Exploiting Result Diversification Methods for Feature Selection in Learning to Rank
Following 5 methods of feature selection for LTR have been implemented:
- MMR (Maximal Marginal Relevance)
- MSD (Maximum Sum Dispersion)
- MPT (Modern Portfolio Theory)
- GAS (Greedy Search Algorithm)
- TopK
I will assume that you have python 2.7 installed. Remaining requirements are written in requirements.txt
pip install -r requirements.txt
You will also need appropriate binary of trec_eval
available in your PATH
variable. The following steps might be helpful:
git clone https://github.com/usnistgov/trec_eval.git
cd trec_eval
make # this should generate binary of trec_eval
sudo chmod +x trec_eval
sudo cp trec_eval /bin/
Finally, just clone this repo and check next section to see how to use it.
git clone https://github.com/HarshTrivedi/feature-selection-for-learning-to-rank.git
I will assume that you have your training and testing data in svm_light
format. You can get the details here. In home directory of this repo, there are trainingset.txt
and testset.txt
for sample purpose.
So you will need to have following files of yours which need to be replaced for your usage:
- trainingset.txt
- testset.txt
- qrels.qrel
Once the input files are replaced, you need to set some configurations for the run. They can be set in config.py
file. Following parameters are tunable:
# example configurations
total_number_of_features = 45
n = [10, 15, 20, 25, 30]
methods = ["msd", "mpt", "mmr", "gas", "topk" ]
balancing_factors = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
measure = "ndcg" # ndcg or map
Config Details:
total_number_of_features
: total number of features in your dataset.n
: list of number of features to be selectedmethods
: list of methods you want to use in this runbalancing_factors
: each of the first 4 methods use a balancing factor [0, 1] which is essentially a balance between importance of selected features and diversity among this set. So its a list of balancing_factors to be used.measure
: quantity that will be used for measuring importance of feature.
With this configuration, you will have list of selected features for each method from methods
using each of the balancing facotr from balancing_factors
and for each k in n.
Once this is done, following scripts need to be run sequentially:
1.preprocess.py
2.select_features.py
3.generate_examples_from_generated_features.py
As a result, you will have following 5 additional directories now:
- features_selected
- rankings
- query_wise_rankings
- ranking_scores
- feature_selected_example_files
You would mainly be interested in 2 directories: features_selected
and feature_selected_example_files
. The former has the list of features selcted for each combination configuration from config.py
. The later has the various versions of trainingset.txt
and testset.txt
(containing only the selected features) - each correspond to one of the combination configuration.
-
features_selected/
contains files in this format:<method_name>_<balancing_factor>.ranking
. For eg:gas_0.1.ranking
-
Each of these .ranking file has feature numbers line wise. So if k features are to be selected in that configuration, top k feature numbers from that file will be picked.
-
feature_selected_example_files/
contains files in this format:<method_name>_<k>_<balancing_factor>_<training_examples|testing_examples>.dat
. For example,mmr_20_0.5_training_examples.dat
.
In case you plan to use it in your research, please cite the above paper using:
@inproceedings{naini2014exploiting,
title={Exploiting result diversification methods for feature selection in learning to rank},
author={Naini, Kaweh Djafari and Altingovde, Ismail Sengor},
booktitle={European Conference on Information Retrieval},
pages={455--461},
year={2014},
organization={Springer}
}
Please Note: I am NOT any of the authors of of the paper. So, if you want you can consider to verify the implementation yourself!
In case of any problem, please contact me at: [email protected]
Hope you find it useful : )