Coder Social home page Coder Social logo

allegro / allrank Goto Github PK

View Code? Open in Web Editor NEW
829.0 28.0 114.0 129 KB

allRank is a framework for training learning-to-rank neural models based on PyTorch.

License: Apache License 2.0

Dockerfile 0.31% Makefile 0.26% Python 98.30% Shell 1.13%
learning-to-rank ndcg ranking information-retrieval pytorch python machine-learning deep-learning transformer click-model

allrank's Introduction

allRank : Learning to Rank in PyTorch

About

allRank is a PyTorch-based framework for training neural Learning-to-Rank (LTR) models, featuring implementations of:

  • common pointwise, pairwise and listwise loss functions
  • fully connected and Transformer-like scoring functions
  • commonly used evaluation metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR)
  • click-models for experiments on simulated click-through data

Motivation

allRank provides an easy and flexible way to experiment with various LTR neural network models and loss functions. It is easy to add a custom loss, and to configure the model and the training procedure. We hope that allRank will facilitate both research in neural LTR and its industrial applications.

Features

Implemented loss functions:

  1. ListNet (for binary and graded relevance)
  2. ListMLE
  3. RankNet
  4. Ordinal loss
  5. LambdaRank
  6. LambdaLoss
  7. ApproxNDCG
  8. RMSE
  9. NeuralNDCG (introduced in https://arxiv.org/pdf/2102.07831)

Getting started guide

To help you get started, we provide a run_example.sh script which generates dummy ranking data in libsvm format and trains a Transformer model on the data using provided example config.json config file. Once you run the script, the dummy data can be found in dummy_data directory and the results of the experiment in test_run directory. To run the example, Docker is required.

Getting the right architecture version (GPU vs CPU-only)

Since torch binaries are different for GPU and CPU and GPU version doesn't work on CPU - one must select & build appropriate docker image version.

To do so pass gpu or cpu as arch_version build-arg in

docker build --build-arg arch_version=${ARCH_VERSION}

When calling run_example.sh you can select the proper version by a first cmd line argument e.g.

run_example.sh gpu ...

with cpu being the default if not specified.

Configuring your model & training

To train your own model, configure your experiment in config.json file and run

python allrank/main.py --config_file_name allrank/config.json --run_id <the_name_of_your_experiment> --job_dir <the_place_to_save_results>

All the hyperparameters of the training procedure: i.e. model defintion, data location, loss and metrics used, training hyperparametrs etc. are controlled by the config.json file. We provide a template file config_template.json where supported attributes, their meaning and possible values are explained. Note that following MSLR-WEB30K convention, your libsvm file with training data should be named train.txt. You can specify the name of the validation dataset (eg. valid or test) in the config. Results will be saved under the path <job_dir>/results/<run_id>

Google Cloud Storage is supported in allRank as a place for data and job results.

Implementing custom loss functions

To experiment with your own custom loss, you need to implement a function that takes two tensors (model prediction and ground truth) as input and put it in the losses package, making sure it is exposed on a package level. To use it in training, simply pass the name (and args, if your loss method has some hyperparameters) of your function in the correct place in the config file:

"loss": {
    "name": "yourLoss",
    "args": {
        "arg1": val1,
        "arg2: val2
    }
  }

Applying click-model

To apply a click model you need to first have an allRank model trained. Next, run:

python allrank/rank_and_click.py --input-model-path <path_to_the_model_weights_file> --roles <comma_separated_list_of_ds_roles_to_process e.g. train,valid> --config_file_name allrank/config.json --run_id <the_name_of_your_experiment> --job_dir <the_place_to_save_results>

The model will be used to rank all slates from the dataset specified in config. Next - a click model configured in config will be applied and the resulting click-through dataset will be written under <job_dir>/results/<run_id> in a libSVM format. The path to the results directory may then be used as an input for another allRank model training.

Continuous integration

You should run scripts/ci.sh to verify that code passes style guidelines and unit tests.

Research

This framework was developed to support the research project Context-Aware Learning to Rank with Self-Attention. If you use allRank in your research, please cite:

@article{Pobrotyn2020ContextAwareLT,
  title={Context-Aware Learning to Rank with Self-Attention},
  author={Przemyslaw Pobrotyn and Tomasz Bartczak and Mikolaj Synowiec and Radoslaw Bialobrzeski and Jaroslaw Bojar},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.10084}
}

Additionally, if you use the NeuralNDCG loss function, please cite the corresponding work, NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting:

@article{Pobrotyn2021NeuralNDCG,
  title={NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting},
  author={Przemyslaw Pobrotyn and Radoslaw Bialobrzeski},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.07831}
}

License

Apache 2 License

allrank's People

Contributors

allegro-bot avatar deejay1 avatar kretes avatar lidalida avatar przemekpobrotyn avatar sadaharu-inugami avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

allrank's Issues

the models folder and predictions folder are empty!

Hello there,
Thank you for the package.
I could successfully make a run of it but I am wondering how I can see the prediction results on a test set. Assuming I trained a model, now I want to load and see the actual predictions on a test set as well as the metrics.
Now, in the output folder, the models folder and predictions folder are empty! However, I have the logfile that shows a successful run.

I can explore inside the code but I thought you may know the reason before.

How to get prediction score

Similar question as this in this post #59.

That post uses __rank_slates to obtain predicted ranks for y but that doesn't seem to be correct to me. When you look at the function it just returns the original y vector after it has been re orderd (or reranked) but not the predicted ranks.

Is there a function that returns the predicted ranks? or get a predicted score and then generate the rank ourselves?

Question about mrr calculation

Hi everyone, thanks for releasing a great tool.
I have a couple of question regarding mrr calucalation

The first one is why do you need cloning in this place:

y_true = y_true.clone()
y_pred = y_pred.clone()

Second question is about padding_indicator.
Could you elaborate or provide any examples of using this parameters.
I not quite into the ranking systems but I couldn't find any reference in the web yet.

How to produce predictions?

I have a created a minimum reproducible example that has purposefully a perfect relationship between the feature that is used to predict and the corresponding label so we can test whether the algorithm works correctly (indeed for large enough data I do get ndcg=1.0, so it appears to work correctly):

For illustrative purposes, I use a small dataset:

import numpy as np
import pandas as pd

num_qid = 5
num_obs_per_qid =5
numRows = num_qid * num_obs_per_qid
numRanks= 5


df = pd.DataFrame({
    "qid":[i for i in range(num_qid) for j in range(num_obs_per_qid)],
    "uniqueID":num_qid*list(range(num_obs_per_qid)),
    "feature":np.random.random(size=(numRows,))
})

#df['label'] = pd.qcut(df["feature"], q=5, labels=False, precision=0, duplicates='raise')
df['label'] = df.groupby("qid")["feature"].apply(lambda x: pd.qcut(x, q=num_ranks, labels=False, precision=0, duplicates='raise'))

train_rows = round(0.8 * num_qid)
vali_rows  = round(0.9 * num_qid)

train = df[df['qid']<=train_rows]
vali  = df[(df['qid']>train_rows)&(df['qid']<=vali_rows)]
test  = df[(df['qid']>vali_rows)]

I use the code below to produce predictions, but cannot make sense of slates_X and slates_y. I don't understand what those are and how to mere this with my test df?



def df_to_libsvm(df: pd.DataFrame, folderName, fileName):
    x = df[['feature']]
    y = df['label']
    query_id  = df['qid']
    dump_svmlight_file(X=x, y=y, query_id= query_id, f=f'{folderName}/{fileName}.txt', zero_based=True)


df_to_libsvm(train, 'train_data', 'train')
df_to_libsvm(vali, 'train_data', 'vali')
df_to_libsvm(test, 'test_data', 'test')
df_to_libsvm(test, 'test_data', 'vali')


parser = ArgumentParser("allRank")

parser.add_argument("--job-dir", help="Base output path for all experiments", required=False, default = "test_run")

parser.add_argument("--run-id", help="Name of this run to be recorded (must be unique within output dir)", required=False, default = "test_run")

parser.add_argument("--config-file-name", type=str, help="Name of json file with config", required=False, default = "../scripts//local_config.json")

# this 'args=[]' needs to be added within the paranthesis
args = parser.parse_args(args=[])
paths = PathsContainer.from_args(args.job_dir, args.run_id, args.config_file_name)
# reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
create_output_dirs(paths.output_dir)
logger = init_logger(paths.output_dir)
#ogger.info(f"created paths container {paths}")

# read config
config = Config.from_json(paths.config_path)
logger.info("Config:\n {}".format(pformat(vars(config), width=1)))

output_config_path = os.path.join(paths.output_dir, "used_config.json")

#Notice that 'cp' is a Unix/Linux command, in Windows replace with 'copy' instead of 'cp'
execute_command("cp {} {}".format(paths.config_path, output_config_path))

#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/train_data'
# train_ds, val_ds
train_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
)

n_features = train_ds.shape[-1]
assert n_features == val_ds.shape[-1], "Last dimensions of train_ds and val_ds do not match!"

# train_dl, val_dl
train_dl, val_dl = create_data_loaders(
    train_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

# gpu support
dev = get_torch_device()
logger.info("Model training will execute on {}".format(dev.type))

# instantiate model
model = make_model(n_features=n_features, **asdict(config.model, recurse=False))
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)
    logger.info("Model training will be distributed to {} GPUs.".format(torch.cuda.device_count()))
model.to(dev)

# load optimizer, loss and LR scheduler
optimizer = getattr(optim, config.optimizer.name)(params=model.parameters(), **config.optimizer.args)
loss_func = partial(getattr(losses, config.loss.name), **config.loss.args)
if config.lr_scheduler.name:
    scheduler = getattr(optim.lr_scheduler, config.lr_scheduler.name)(optimizer, **config.lr_scheduler.args)
else:
    scheduler = None

with torch.autograd.detect_anomaly() if config.detect_anomaly else dummy_context_mgr():  # type: ignore
    # run training
    result = fit(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        scheduler=scheduler,
        train_dl=train_dl,
        valid_dl=val_dl,
        config=config,
        device=dev,
        output_dir=paths.output_dir,
        tensorboard_output_path=paths.tensorboard_output_path,
        **asdict(config.training)
    )
#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/test_data'
# train_ds, val_ds
test_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
    name_of_file= "test"
)
test_dl, val_dl = create_data_loaders(test_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

slates_X, slates_y = __rank_slates(test_dl, model)

Question about positional encoding

Can you provide some guidance on positional encoding? What should I put for 'positional_coding' in the config file, and how can I include the original ranking as part of the input data? Also can you provide some explanations on 'Fixed positional encoding' and 'Learnable positional encoding'? Many Thanks!

something wrong in run_example.sh

output and config-file-name have something wrong in run_example.sh

old version: docker run -e PYTHONPATH=/allrank -v $PROJECT_DIR:/allrank allrank:latest /bin/sh -c 'python allrank/data/generate_dummy_data.py && python allrank/main.py --config-file-name allrank/config.json --run-id test_run --output /allrank/test_run'

new version: docker run -e PYTHONPATH=/allrank -v $PROJECT_DIR:/allrank allrank:latest /bin/sh -c 'python allrank/data/generate_dummy_data.py && python allrank/main.py --config-file-name ./scripts/local_config.json --run-id test_run --job-dir /allrank/test_run'

How to get predictions for each observation or row in my data.

I would like to obtain the predicted rank of each observation or row in my dataset (val_dl). My dataset is structured such that I have 221,567 with 2,958 qids, and 5 features.

When I run:

slates_X, slates_y = __rank_slates(val_dl, model)

the shape of slates_y is:

slates_y.shape
torch.Size([2958, 96])

If I understand this shape correctly, the number of rows of slates_y corresponds to the number of qids in my dataset.

But this does not give me the predicted rank for each row, or does it?

It is also not clear to me what the 96 columns are? Are these maybe the predicted ranks in previous layers?

I have tried this with rank_slates instead (which is just a wrapper around __rank_slates and then get the same result.

How to run "run_example.sh"

I've been trying several says to run "run_example.sh" but I'm not having any success, can someone help?
I have already installed docker, tried to create an image and container with the provided dockerfile, and when I run "./run_example.sh" on terminal, it opens the code in Visual Studio Code... But doesn't do anything.

Any help is appreciated, thanks in advance!

Docker error when running run_example.sh on Mac M1

I get the following error on running the example script on my Mac M1

executor failed running [/bin/sh -c make -C /allrank install-reqs]: exit code: 2
Unable to find image 'allrank:latest' locally
docker: Error response from daemon: pull access denied for allrank, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

I also have an older MacBook Pro and did not face any such issues when I ran the same file on that. How can I fix this?

gpus > 1

Seems allRank suffers from what has been raised in this link

https://discuss.pytorch.org/t/solved-keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict/1686/3

So, instead of directly load the save model by:
model.load_state_dict(load_state_dict_from_file('model.pkl', dev))

We have to do the following:

state_dict = load_state_dict_from_file('model.pkl', dev)
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] if 'module.' in k else k
new_state_dict[name] = v
model.load_state_dict(new_state_dict)

How to produce predictions?

For my use case, I would like to obtain for each qid, the highest and lowest ranked observations, identified by the unique_ID.

I have a created a minimum reproducible example that has purposefully a perfect relationship between the feature that is used to predict and the corresponding label so we can test whether the algorithm works correctly (indeed for large enough data I do get ndcg=1.0, so it appears to work correctly).

I have not been able to merge my predicted ranks back to the original dataset in the correct order. The slates_y is not in an order that matches my test_df. Is there any way how I can match the slates_y tensor back to the test_df in the correct order? That is, where each row matches the correct unique_ID in test_df?

For illustrative purposes, I use a small dataset:

import numpy as np
import pandas as pd

num_qid = 10
num_obs_per_qid =10
numRows = num_qid * num_obs_per_qid
numRanks= 5


df = pd.DataFrame({
    "qid":[i for i in range(num_qid) for j in range(num_obs_per_qid)],
    "uniqueID":num_qid*list(range(num_obs_per_qid)),
    "feature":np.random.random(size=(numRows,))
})

#df['label'] = pd.qcut(df["feature"], q=5, labels=False, precision=0, duplicates='raise')
df['label'] = df.groupby("qid")["feature"].apply(lambda x: pd.qcut(x, q=num_ranks, labels=False, precision=0, duplicates='raise'))

train_rows = round(0.7 * num_qid)
vali_rows  = round(0.8 * num_qid)

train = df[df['qid']<=train_rows]
vali  = df[(df['qid']>train_rows)&(df['qid']<=vali_rows)]
test  = df[(df['qid']>vali_rows)]

I use the code below to produce predictions, but cannot make sense of the order of slates_y, so I am unable to merge it back to test_df in a correct order.



def df_to_libsvm(df: pd.DataFrame, folderName, fileName):
    x = df[['feature']]
    y = df['label']
    query_id  = df['qid']
    dump_svmlight_file(X=x, y=y, query_id= query_id, f=f'{folderName}/{fileName}.txt', zero_based=True)


df_to_libsvm(train, 'train_data', 'train')
df_to_libsvm(vali, 'train_data', 'vali')
df_to_libsvm(test, 'test_data', 'test')
df_to_libsvm(test, 'test_data', 'vali')


parser = ArgumentParser("allRank")

parser.add_argument("--job-dir", help="Base output path for all experiments", required=False, default = "test_run")

parser.add_argument("--run-id", help="Name of this run to be recorded (must be unique within output dir)", required=False, default = "test_run")

parser.add_argument("--config-file-name", type=str, help="Name of json file with config", required=False, default = "../scripts//local_config.json")

# this 'args=[]' needs to be added within the paranthesis
args = parser.parse_args(args=[])
paths = PathsContainer.from_args(args.job_dir, args.run_id, args.config_file_name)
# reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
create_output_dirs(paths.output_dir)
logger = init_logger(paths.output_dir)
#ogger.info(f"created paths container {paths}")

# read config
config = Config.from_json(paths.config_path)
logger.info("Config:\n {}".format(pformat(vars(config), width=1)))

output_config_path = os.path.join(paths.output_dir, "used_config.json")

#Notice that 'cp' is a Unix/Linux command, in Windows replace with 'copy' instead of 'cp'
execute_command("cp {} {}".format(paths.config_path, output_config_path))

#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/train_data'
# train_ds, val_ds
train_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
)

n_features = train_ds.shape[-1]
assert n_features == val_ds.shape[-1], "Last dimensions of train_ds and val_ds do not match!"

# train_dl, val_dl
train_dl, val_dl = create_data_loaders(
    train_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

# gpu support
dev = get_torch_device()
logger.info("Model training will execute on {}".format(dev.type))

# instantiate model
model = make_model(n_features=n_features, **asdict(config.model, recurse=False))
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)
    logger.info("Model training will be distributed to {} GPUs.".format(torch.cuda.device_count()))
model.to(dev)

# load optimizer, loss and LR scheduler
optimizer = getattr(optim, config.optimizer.name)(params=model.parameters(), **config.optimizer.args)
loss_func = partial(getattr(losses, config.loss.name), **config.loss.args)
if config.lr_scheduler.name:
    scheduler = getattr(optim.lr_scheduler, config.lr_scheduler.name)(optimizer, **config.lr_scheduler.args)
else:
    scheduler = None

with torch.autograd.detect_anomaly() if config.detect_anomaly else dummy_context_mgr():  # type: ignore
    # run training
    result = fit(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        scheduler=scheduler,
        train_dl=train_dl,
        valid_dl=val_dl,
        config=config,
        device=dev,
        output_dir=paths.output_dir,
        tensorboard_output_path=paths.tensorboard_output_path,
        **asdict(config.training)
    )
#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/test_data'
# train_ds, val_ds
test_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
    name_of_file= "test"
)
test_dl, val_dl = create_data_loaders(test_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

slates_X, slates_y = __rank_slates(test_dl, model)

My experiments consistently underperform in comparison to the paper's reported results.

Although I followed the exact settings outlined in the reproducibility file, my experiments consistently yield inferior results compared to those reported in the paper. Any suggestions, recommendations, or additional details I might overlooked?

WEB30K—Result in paper

Loss Self-attention Self-attention Self-attention MLP MLP MLP
  NDCG@5 NDCG@10 NDCG@30 NDCG@5 NDCG@10 NDCG@30
NDCGLoss 2++ 52.65+-0.37 54.49+-0.27 59.80+-0.08 49.15+-0.44 51.22+-0.34 57.14+-0.23
LambdaRank 52.29+-0.31 54.08+-0.19 59.48+-0.12 48.77+-0.38 50.85+-0.28 56.72+-0.17

WEB30K— Reproduce

Loss Self-attention Self-attention Self-attention MLP MLP MLP
  NDCG@5 NDCG@10 NDCG@30 NDCG@5 NDCG@10 NDCG@30
NDCGLoss 2++ 48.825 ±0.025 50.587±0.062 56.473±0.012 48.084±0.118 49.623±0.106 55.497±0.108
LambdaRank 48.015±0.351 49.602±0.147 55.466±0.180 41.739± 0.341 43.562±0.159 49.656±0.134

Whats the expected input to the losses and metrics?

E.g. for approxNDCGLoss, what is y_pred and y_true?

  • what is the range of the inputs?
  • does a slate have to be in order? e.g. best ranked first?

From what I gathered:

  • values should be in [0, ...)
  • higher values means better rank
  • the order does not matter

Is that correct? And are all losses and metrics following this API?
Thanks!

question about the version

Do you use the version 1.4.3 in the paper Context-Aware Learning to Rank with Self-Attention? when i try it, i get the sorce of about 52.8 which is a little bigger than 52.37 in the paper.

About the padding value

Hello,

I noticed that the padding value in this code/repository is set to -1, and I was curious about the reason behind this choice. Could someone kindly explain why -1 was chosen as the padding value, and what should we pay attention to when working with this particular padding value?

I appreciate any insights you can provide. Thank you!

Getting started without Docker

Is it possible to get started without Docker (technical issues with it)? Is it possible to run a pip install package? A lot of the import require for the library to be properly installed.

I have cloned the repo and run the usual:

pip install -r requirements.txt
python setup.py install

But the imports are still not working:

import allrank.models.losses as losses

OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\Users\cramk\anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops.dll" or one of its dependencies.

How to connect with Elasticsearch

Hi,
I want to use this package with Elasticsearch LTR such that I can use ES rescore function while searching. How can I add this package in Elasticsearch?

About the input of the loss function

In the loss function 'neuralNDCG', the input 'y_pred' is the prediction from the model, and it's shape is '[batch_size, slate_length]'. What does the 'slate_length' mean? Is each element in 'y_pred' a similarity score or a ranking value? Can you show a example of the loss function inputs?

code for NeuralNDCG

Recently, I read a paper about image retrieval, named SmoothAP aimed at optimizing neural network directly through AP metrics. It motives me to take this idea to other metrics, such as NDCG.

As you know, I have seen your paper NeuralNDCG https://docs.mlinpl.org/virtual-event/2020/posters/37-NeuralNDCG_Direct_Optimisation_of_a_Ranking_Metric_via_Differentiable_Relaxation_of_Sorting.pdf. I am so excited about this work. It fully implemented what I am thinking about.

I cant waited to use it in my task, would you mind to share this code to me !

sincere thanks!

Which loss functions are suitable for sentence ordering?

Given the features are tokenized sentences, and the targets y_pred are normalized rankings. A typical model accepts tokenized sentences as inputs and outputs their order/ranks.

`x`: tokenize(['sentence 1', 'sentence 2', 'sentence 3', 'sentence 4'])
`y_true`: [0., 0.33333333, 0.66666667, 1.]

Which of the loss function implementations is suitable for this kind of data?

ndcgLoss2PP backpropagation

The ndcgLoss2PP needs EM to optimise (according to the paper). I couldn't find EM implementation in the code, maybe I miss something? Could someone briefly explain what is happening (or any reference)?

Location of loss functions??

Hello, thank you for all the efforts taken to develop this library. I wanted to see the implementation of all the losses. I tried searching but could not find the exact location in the repository. I was particularly interested in the lambda loss. Could you please guide me to it.

RuntimeError Numpy not available

I built the docker image with

docker build --build-arg arch_version=${ARCH_VERSION} -t allrank .
sh scripts/run_example.sh

But I'm getting the error below if I run the run_example.sh. (I tried adding pandas and numpy in the requirements.txt no change)
image

Transformer implementation seemingly not corresponding to paper

Hi there,

First off - thank you for open sourcing this very interesting work(!).

I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:

https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.

It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.

Thanks & best wishes,
Patrick

question about the result

I standized the features and disposed the config.json as follow:
QQ图片20210415144124
I run on the FOLD 1 of MSLR-WEB30K and just get 0.502 of ndcg@5 on the test set. Is there any procedure that i miss?
Thanks a lot!

Results on MSNWEB30K

Hello developers,
I was willing to reproduce your results using a self-attentive model on msn30k. In particular, I was interested in reproducing the 0.5431 of ndcg@10 that you obtain with Ordinal Loss (Table 3 of your paper).
This is my config.json, based on the hyperparameters specified in the article.

    "model": {
        "fc_model": {
            "sizes": [144],
            "input_norm": "True",
            "activation": null,
            "dropout": 0.0
        },
        "transformer": {
            "N": 4,
            "d_ff": 512,
            "h": 2,
            "positional_encoding": null,
            "dropout": 0.4
        },
        "post_model": {
            "output_activation": "Sigmoid",
            "d_output": 4
        }
    },
    "data": {
        "path": "./Fold1/",
        "validation_ds_role": "vali",
        "num_workers": 16,
        "batch_size": 64,
        "slate_length": 240
    },
    "optimizer": {
        "name": "Adam",
        "args": {
            "lr": 0.001
        }
    },
    "lr_scheduler": {
        "name": "StepLR",
        "args": {
            "step_size": 50
        }
    },
    "training": {
        "epochs": 100,
        "gradient_clipping_norm": [],
        "early_stopping_patience": 10
    },
    "metrics": ["ndcg_10"],
    "loss": {
        "name": "ordinal",
        "args": {
            "n": 4
        }
    },
    "val_metric": "ndcg_10",
    "detect_anomaly": "True"
}

The best result I can get is 0.4307 of ndcg@10 on the validation set. What am I missing?

How to integrat with regular pytorch models

Hi!

I am currently looking into rephrasing a classification problem into a learning-to-rank problem. I have found your repo and it looks quite promising, you have definitely put a lot of effort into this.

My main question is about how I can use parts of this repo, import them and use them in my own model which operates on 1D vectors of inputs and outputs a scalar prediction per input example. In total It produces a 1D Tensor of roughly 10k predictions between 0 and 1 and I have a 1D tensor of labels between 0 and 1 of identical length to compare them with.

For the losses, taking lambdaLoss as example:

:param y_pred: predictions from the model, shape [batch_size, slate_length]
:param y_true: ground truth labels, shape [batch_size, slate_length]

How can I apply that to my output/labels? Do I have to sample n negatives for each positive to obtain m slates of length n+1 before I can feed it into the loss functions? In that case the y_pred and y_true would have the shape [m, n+1]. Is there a way or (preferably list-wise) loss function where I can directly drop my 1D output/labels vectors into (just like pytorchs BCE or MSE losses)?

For the models:

How can I make the architecture of my model more LTR-sensitive? Assume I have a network of fully connected layers which passes its output through a sigmoid layer to predict class probabilites. What would have to change to optimize the architecture for a Learning-To-Rank task? Which of the Ideas and implementations of this repo can I leverage to accomplish this?

Thanks in advance for your time and effort. I would be more than happy to elaborate if some of my questions are confusion or dont make sense.

Cheers,
Florin

Embeddings

Anybody tried to use in-line trained embeddings?
Not all columns are simple numerics. For categoricals or numerics where using bucketization may lead to better
learned representation; using NxM embedding matrix for some features can be helpful.
In Tensorflow it can be done using these helpers:

bucketized_col1 = tf.feature_column.bucketized_column(tf.feature_column.numeric_column(key='col1', shape=[1,], default_value=-2, dtype=tf.int64), boundaries=[1,3,4,5,10,14,18,27,52,61,79])
feature_columns["col1_emb"] = tf.feature_column.embedding_column(bucketized_col1, 5)

Forward vs Backward Slash in Path definition

Dealing with a very simply problem but haven't been able to find a good solution (or any solution):

The local_confic.json file defines folder bath UNIX style with forward slashes:
"path": "/allrank/dummy_data",

Now I can't change this to backward slash in there, because the JSON file will interpret those as commands.

Next, the main file passes this as an argument to the load_libsvm_dataset function:

input_path=config.data.path

And in is there, this problem is supposed to be fixed with:

input_path =  Path(input_path)    
path = os.path.join(input_path, "{}.txt".format(role))

But it isn't converted correctly, so I get this error:

FileNotFoundError: [Errno 2] No such file or directory: '/allrank/dummy_data\train.txt'

I have tried fixing this with Path and various other conversions but couldn't get it to work (when I use Path I end up getting double backward slashes). Can anyone suggest a solution?

How to use your own dataset?

Would it be possible to provide a brief explanation of the what the columns in the example dataset are (I am familiar with libsvm format from this description: https://catboost.ai/en/docs/concepts/input-data_libsvm but still don't fully understand)? The reason I am asking is, I would like to run the model on my own dataset but it isn't quite clear to me what the format of the data should be. I know that the paper has used MSLR-WEB30K but that one looked a bit different than this one.

Particularly, the dataset has:

  1. In the first column values from 0 to 4? Are the these the predicted ranks?
  2. The 2nd column has the qid with integer values from 0 to 99. I assume these are the query IDs.
  3. The remaining columns appear to be features? The integer indicating the feature name and the next one is the value (and so on)?

Only ordinal loss works

When I try to use a different loss function like rankNet, I get this error

ValueError: Target size (torch.Size([2703])) must be the same as input size (torch.Size([2703, 4]))

This is the config that I'm using

{
  "model": {
    "fc_model": {
      "sizes": [
        64
      ],
      "input_norm": false,
      "activation": null,
      "dropout": 0.0
    },
    "transformer": {
      "N": 1,
      "d_ff": 64,
      "h": 1,
      "positional_encoding": null,
      "dropout": 0.0
    },
    "post_model": {
      "output_activation": "Sigmoid",
      "d_output": 4
    }
  },
  "data": {
    "path": "/home/username/data",
    "validation_ds_role": "vali",
    "num_workers": 1,
    "batch_size": 64,
    "slate_length": 240
  },
  "optimizer": {
    "name": "Adam",
    "args": {
      "lr": 0.004
    }
  },
  "lr_scheduler": {
    "name": "StepLR",
    "args": {
      "step_size": 3,
      "gamma": 0.5
    }
  },
  "training": {
    "epochs": 50,
    "early_stopping_patience": 100,
    "gradient_clipping_norm": null
  },
  "val_metric": "ndcg_50",
  "metrics": [
    "ndcg_50"
  ],
  "loss": {
    "name": "rankNet",
    "args": {}
  },
  "expected_metrics" : {
    "val": {
      "ndcg_50": 0.785
    }
  }
}

Question regarding inferring the rank given model scores

First, let me thank you for this nice library! It's been of great help for my project.

My question is about how to infer the most likely rank of items given the model scores trained on ListMLE.
My understanding was that ordering the items by the scores (descending) would give you the most likely score, according to the PL model.
However, while toying with the model, I found out that actually ordering the scores by ascending order would give my model sensible results while ordering by descending doesn't.
Is this intended or does it sound like I'm doing something wrong?

Thank you!

Stuck after distributed to GPU

First, thank you for you awesome work.

Here, I've met some problems when running your code.

For multi queries ( number of unique qid > 1 ), the code stuck at this point.

...
[INFO] 2021-07-07 09:25:46 - loaded dataset with 2 queries
[INFO] 2021-07-07 09:25:46 - longest query had 171 documents
[INFO] 2021-07-07 09:25:46 - val DS shape: [2, 171, 175]
[INFO] 2021-07-07 09:25:46 - Will pad to the longest slate: 171
[INFO] 2021-07-07 09:25:46 - total batch size is 128
[INFO] 2021-07-07 09:25:46 - Model training will execute on cuda
[INFO] 2021-07-07 09:25:46 - Model training will be distributed to 8 GPUs.
[INFO] 2021-07-07 09:25:48 - Model has 36868 trainable parameters
[INFO] 2021-07-07 09:25:48 - Current learning rate: 0.001

I've tried to wait for about three hours while still no progress (Both for my dataset and your sample generated dummy dataset). And I even can't use Ctrl+c to stop this process. When I check my cuda with nvidia-smi, the usage of GPUs is normal.

Then I tried with single query (with all qid:0) dataset, the code runs fine.

So, what might cause this problem?

System: Ubuntu 20.04
GPUs: A100, 40G
Config: local_config.json only changed the data root path

np.unique() does not preserve order

I think there is a bug in LibSVMDataset

When creating groups and then splitting the input X, the np.unique() does not preserve the order of the query_ids. Hence, the split won't be done correctly!

A better way would be:

self.query_ids = Counter(query_ids)
groups = np.cumsum(list(self.query_ids.values()))

How to get predictions from the model?

This question was asked before. Is there are a way to get the predictions of the model? What I mean by that is, for the test dataset, can we obtain the predicted ranks for each observation? So we can do our own evaluation of model performance?

Similar to the way the usual sklearn models would have a fit and a predict function?

This would be particularly useful to evaluate the model in a broader context and compare against other models that might not produce the same evaluation metrics.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.