Coder Social home page Coder Social logo

attention-learn-to-route's Introduction

Note: I am currently not able to actively maintain this repository. Please also checkout more recent implementations, e.g. https://github.com/ai4co/rl4co and https://github.com/cpwan/RLOR.

Attention, Learn to Solve Routing Problems!

Attention based model for learning to solve the Travelling Salesman Problem (TSP) and the Vehicle Routing Problem (VRP), Orienteering Problem (OP) and (Stochastic) Prize Collecting TSP (PCTSP). Training with REINFORCE with greedy rollout baseline.

TSP100

Paper

For more details, please see our paper Attention, Learn to Solve Routing Problems! which has been accepted at ICLR 2019. If this code is useful for your work, please cite our paper:

@inproceedings{
    kool2018attention,
    title={Attention, Learn to Solve Routing Problems!},
    author={Wouter Kool and Herke van Hoof and Max Welling},
    booktitle={International Conference on Learning Representations},
    year={2019},
    url={https://openreview.net/forum?id=ByxBFsRqYm},
}

Dependencies

Quick start

For training TSP instances with 20 nodes and using rollout as REINFORCE baseline:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'

Usage

Generating data

Training data is generated on the fly. To generate validation and test data (same as used in the paper) for all problems:

python generate_data.py --problem all --name validation --seed 4321
python generate_data.py --problem all --name test --seed 1234

Training

For training TSP instances with 20 nodes and using rollout as REINFORCE baseline and using the generated validation set:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout' --val_dataset data/tsp/tsp20_validation_seed4321.pkl

Multiple GPUs

By default, training will happen on all available GPUs. To disable CUDA at all, add the flag --no_cuda. Set the environment variable CUDA_VISIBLE_DEVICES to only use specific GPUs:

CUDA_VISIBLE_DEVICES=2,3 python run.py 

Note that using multiple GPUs has limited efficiency for small problem sizes (up to 50 nodes).

Warm start

You can initialize a run using a pretrained model by using the --load_path option:

python run.py --graph_size 100 --load_path pretrained/tsp_100/epoch-99.pt

The --load_path option can also be used to load an earlier run, in which case also the optimizer state will be loaded:

python run.py --graph_size 20 --load_path 'outputs/tsp_20/tsp20_rollout_{datetime}/epoch-0.pt'

The --resume option can be used instead of the --load_path option, which will try to resume the run, e.g. load additionally the baseline state, set the current epoch/step counter and set the random number generator state.

Evaluation

To evaluate a model, you can add the --eval-only flag to run.py, or use eval.py, which will additionally measure timing and save the results:

python eval.py data/tsp/tsp20_test_seed1234.pkl --model pretrained/tsp_20 --decode_strategy greedy

If the epoch is not specified, by default the last one in the folder will be used.

Sampling

To report the best of 1280 sampled solutions, use

python eval.py data/tsp/tsp20_test_seed1234.pkl --model pretrained/tsp_20 --decode_strategy sample --width 1280 --eval_batch_size 1

Beam Search (not in the paper) is also recently added and can be used using --decode_strategy bs --width {beam_size}.

To run baselines

Baselines for different problems are within the corresponding folders and can be ran (on multiple datasets at once) as follows

python -m problems.tsp.tsp_baseline farthest_insertion data/tsp/tsp20_test_seed1234.pkl data/tsp/tsp50_test_seed1234.pkl data/tsp/tsp100_test_seed1234.pkl

To run baselines, you need to install Compass by running the install_compass.sh script from within the problems/op directory and Concorde using the install_concorde.sh script from within problems/tsp. LKH3 should be automatically downloaded and installed when required. To use Gurobi, obtain a (free academic) license and follow the installation instructions.

Other options and help

python run.py -h
python eval.py -h

Example CVRP solution

See plot_vrp.ipynb for an example of loading a pretrained model and plotting the result for Capacitated VRP with 100 nodes.

CVRP100

Acknowledgements

Thanks to pemami4911/neural-combinatorial-rl-pytorch for getting me started with the code for the Pointer Network.

This repository includes adaptions of the following repositories as baselines:

attention-learn-to-route's People

Contributors

lhoupert avatar ltluttmann avatar wouterkool avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

attention-learn-to-route's Issues

questions about how to model CVRP constraints

Hello,

Fisrt of all , I really admire your work and repository. I am very curious about how to model the condition that a vehicle meets the needs of all customers by repeatedly shipping goods into a MIP constraint? I'm having some difficulty with this. If it is convenient, can you provide the Gurobi code. Maybe I can find the answer from it.

Reimplementation in RL4CO

Hi there 👋🏼

First of all, thanks a lot for your library, it has inspired several works in our research group!
We are actively developing RL4CO, a library for all things Reinforcement Learning for Combinatorial Optimization. We started the library by modularizing the Attention Model, which is the basis for several other autoregressive models. We also used some recent software (such as TorchRL, TensorDict, PyTorch Lightning and Hydra) as well as routines such as FlashAttention, and made everything as easy to use as possible in the hope of helping practitioners and researchers.

We welcome you to check RL4CO out ^^

Solving the Multi vehicle VRP.

Hello,author
Due to the basic use of multi vehicle scheduling in reality,is it possible for this framework to solve multi vehicle VRP?

Number of available vehicles?

Hello 👋,

I am wondering if the model for VRP could be extended to take an input of a number of vehicles and the model outputs a desired number of routes? Any tips on how to approach it?

Btw, amazing work on the paper!

Many thanks 🙏

Question - how long does training the TSP model take?

Hi!

I'm executing the following command to train a TSP model:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'
i've run the code for 1 epocs and it took 10 hours. Is this expected?

I'm running the code under windows on a msi laptop using GPU and cuda enable.

Thanks in advance!

Reimplementation in RL platform (CleanRL)

Hello there, my team has been trying to implement the Attention Model in RL platforms so that we can try out different RL algorithms. Eventually, we succeed to implement the most efficient one with PPO in CleanRL. We are able to train the Attention Model in 3 hours for 50-nodes problems (it took 25 hours in the original code).

Moreover, we have broken down the Attention Model into several components. It would be a good resource for anyone interested in learning or developing the Attention Model.

We implemented the vehicle routing problems with the OpenAI gym interface. It may be easier to extend to other new problems.

We have released the source code for our implementation in RLOR: A Flexible Framework of Deep Reinforcement Learning for Operation Research. Feel free to check it out 😆 !

Running time

Hello!
I ran the code with the following command:
python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout' --val_dataset data/tsp/tsp20_validation_seed4321.pkl
but an epoch took nearly 3 hours.
I ran the code on a single GPU(Nvidia Tesla V100S ) and I verified that opts.device is indeed cuda:0. What is the reason for this?
Thanks in advance!

Evaluation with the pretrained model for TSP problem asserts an error

Command: python3 eval.py data/tsp/tsp20_validation_seed4321.pkl --model pretrained/tsp_20 --decode_strategy greedy

Trace:

  [*] Loading model from pretrained/tsp_20/epoch-99.pt
  0%|                                                                                            | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "eval.py", line 216, in <module>
    eval_dataset(dataset_path, width, opts.softmax_temperature, opts)
  File "eval.py", line 70, in eval_dataset
    results = _eval_dataset(model, dataset, width, softmax_temp, opts, device)
  File "eval.py", line 140, in _eval_dataset
    sequences, costs = model.sample_many(batch, batch_rep=batch_rep, iter_rep=iter_rep)
  File "/home/thyoon/attention-learn-to-route/nets/attention_model.py", line 288, in sample_many
    batch_rep, iter_rep
  File "/home/thyoon/attention-learn-to-route/utils/functions.py", line 189, in sample_many
    _log_p, pi = inner_func(input)
  File "/home/thyoon/attention-learn-to-route/nets/attention_model.py", line 285, in <lambda>
    lambda input: self._inner(*input),  # Need to unpack tuple into arguments
  File "/home/thyoon/attention-learn-to-route/nets/attention_model.py", line 234, in _inner
    batch_size = state.ids.size(0)
  File "/home/thyoon/attention-learn-to-route/problems/tsp/state_tsp.py", line 31, in __getitem__
    assert torch.is_tensor(key) or isinstance(key, slice)  # If tensor, idx all tensors by this tensor:
AssertionError

Running time

Hello!
I ran the code with the following command:
python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout' --val_dataset data/tsp/tsp20_validation_seed4321.pkl
but an epoch took nearly 3 hours.
I ran the code on a single GPU(Nvidia Tesla V100S ) and I verified that opts.device is indeed cuda:0. What is the reason for this?
Thanks in advance!

RuntimeError: CUDA error: unknown error

Hi, I enjoy your paper, and i met issues about your code, i don't know how to solve it, my environment:
cuda: 8.0 V8.0.61
python: 3.5
Pytorch: 0.4

this is my print

{'baseline': None,
'batch_size': 512,
'bl_alpha': 0.05,
'bl_warmup_epochs': 0,
'checkpoint_epochs': 1,
'embedding_dim': 128,
'epoch_size': 1280000,
'epoch_start': 0,
'eval_batch_size': 1024,
'eval_only': False,
'exp_beta': 0.8,
'graph_size': 20,
'hidden_dim': 128,
'load_path': None,
'log_dir': 'logs',
'log_step': 50,
'lr_critic': 0.0001,
'lr_decay': 1.0,
'lr_model': 0.0001,
'max_grad_norm': 1.0,
'model': 'attention',
'n_encode_layers': 3,
'n_epochs': 100,
'no_cuda': False,
'no_progress_bar': False,
'no_tensorboard': False,
'normalization': 'batch',
'output_dir': 'outputs',
'problem': 'tsp',
'resume': None,
'run_name': 'run_20180906T163914',
'save_dir': 'outputs/tsp_20/run_20180906T163914',
'seed': 1234,
'tanh_clipping': 10.0,
'use_cuda': True,
'val_dataset': None,
'val_size': 10000}
Traceback (most recent call last):
File "/home/savantning/PycharmProjects/attention-tsp-master/run.py", line 168, in
run(get_options())
File "/home/savantning/PycharmProjects/attention-tsp-master/run.py", line 67, in run
opts.use_cuda
File "/home/savantning/PycharmProjects/attention-tsp-master/utils.py", line 26, in maybe_cuda_model
model.cuda()
File "/home/savantning/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/savantning/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/home/savantning/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
param.data = fn(param.data)
File "/home/savantning/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: unknown error

Process finished with exit code 1

thank you very much

Are the experiments fair enough?

Hello,

I really enjoy your paper and code. But when I implement the baseline LKH-3 for the CVRP problem, I find out something strange.
In vrp_baseline.py, line 101, the parameter MAX_TRAILS is set to:

"MAX_TRIALS": 10000,

But when I look into the docs of LKH-3, I find the default set of MAX_TRAILS is "number of nodes" (which varies from 20 - 100 in the experiments). So I further experiment on several random graphs with 20 nodes and set MAX_TRAILS to 20 and 10000 respectively, here are the results:

left: MAX_TRAILS = 20, right: MAX_TRAILS = 10000

# Obj Time(s) Obj Time (s)
1 407 0.08 407 17.99
2 432 0.09 432 22.27
3 365 0.08 365 17.81
4 341 0.07 341 15.14
5 221 0.10 221 20.44

It seems that setting MAX_TRAILS to 20 or 10000 will return the same objective value but their run times have a 200x difference. So I have a doubt whether the experiments are fair enough for the baselines? Or is there anything I did wrong?

Bug in Beam Search

There might be a bug in the beam search. If my understanding is correct increasing the beam size should never worsen the result. When testing on very large problems (CVRP1000) with beam sizes up to 4096 I'm experiencing some inconsistent results. Up to like a beam size of 16 results get better and then they start fluctuate a bit with the tendency to get worse again. A beam size of 4096 yields the same result then as the beam size of two.

I'm running something like this:
srun python eval.py
data/vrp/vrp_uniform_1000.pkl
-f
--no_progress_bar
--decode_strategy bs
--eval_batch_size 1
--width 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
--model pretrained/cvrp_50/epoch-99.pt
--max_calc_batch_size 10000
--softmax_temperature 1

Is this known? Anything I should try?

Exception in thread pool task: mutex lock failed: Invalid argument

**Hello Wouter,

After finishing epochs 0 and 1 and before the validation starts, I got a mutex lock failed: invalid argument error. I have the torch properly installed. What might be a possible cause for it?**

`| 0/10 [00:00<?, ?it/s][E thread_pool.cpp:113] Exception in thread pool task: mutex lock failed: Invalid argument

[E thread_pool.cpp:113] Exception in thread pool task: mutex lock failed: Invalid argument

[E thread_pool.cpp:113] Exception in thread pool task: mutex lock failed: Invalid argument

[E thread_pool.cpp:113] Exception in thread pool task: mutex lock failed: Invalid argument

Traceback (most recent call last):

File "/Users/admin/Desktop/attention-learn-to-route-master/run.py", line 172, in

run(get_options())

File "/Users/admin/Desktop/attention-learn-to-route-master/run.py", line 158, in run

train_epoch(

File "/Users/admin/Desktop/attention-learn-to-route-master/train.py", line 116, in train_epoch

avg_reward = validate(model, val_dataset, opts)

File "/Users/admin/Desktop/attention-learn-to-route-master/train.py", line 22, in validate

cost = rollout(model, dataset, opts)

File "/Users/admin/Desktop/attention-learn-to-route-master/train.py", line 40, in rollout

return torch.cat([

File "/Users/admin/Desktop/attention-learn-to-route-master/train.py", line 41, in

eval_model_bat(bat)

File "/Users/admin/Desktop/attention-learn-to-route-master/train.py", line 37, in eval_model_bat

cost, _ = model(move_to(bat, opts.device))

File "/opt/homebrew/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl

return forward_call(*input, **kwargs)

File "/Users/admin/Desktop/attention-learn-to-route-master/nets/attention_model.py", line 137, in forward

_log_p, pi = self._inner(input, embeddings)

File "/Users/admin/Desktop/attention-learn-to-route-master/nets/attention_model.py", line 252, in _inner

log_p, mask = self._get_log_p(fixed, state)

File "/Users/admin/Desktop/attention-learn-to-route-master/nets/attention_model.py", line 358, in _get_log_p

log_p, glimpse = self._one_to_many_logits(query, glimpse_K, glimpse_V, logit_K, mask)

File "/Users/admin/Desktop/attention-learn-to-route-master/nets/attention_model.py", line 460, in _one_to_many_logits

compatibility = torch.matmul(glimpse_Q, glimpse_K.transpose(-2, -1)) / math.sqrt(glimpse_Q.size(-1))

RuntimeError: value < sizeINTERNAL ASSERT FAILED at "../aten/src/ATen/TensorIterator.cpp":1557, please report a bug to PyTorch.

0%| | 0/10 [00:01<?, ?it/s]`

Error when training a TSP

Hi:

I'm trying to train the model but i get an error when running python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'

The trace log looks like:


$ python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'
{'baseline': 'rollout',
 'batch_size': 512,
 'bl_alpha': 0.05,
 'bl_warmup_epochs': 1,      
 'checkpoint_encoder': False,
 'checkpoint_epochs': 1,     
 'data_distribution': None,  
 'embedding_dim': 128,       
 'epoch_size': 1280000,      
 'epoch_start': 0,
 'eval_batch_size': 1024,    
 'eval_only': False,
 'exp_beta': 0.8,
 'graph_size': 20,
 'hidden_dim': 128,
 'load_path': None,
 'log_dir': 'logs',
 'log_step': 50,
 'lr_critic': 0.0001,
 'lr_decay': 1.0,
 'lr_model': 0.0001,
 'max_grad_norm': 1.0,
 'model': 'attention',
 'n_encode_layers': 3,
 'n_epochs': 100,
 'no_cuda': False,
 'no_progress_bar': False,
 'no_tensorboard': False,
 'normalization': 'batch',
 'output_dir': 'outputs',
 'problem': 'tsp',
 'resume': None,
 'run_name': 'tsp20_rollout_20210507T142303',
 'save_dir': 'outputs\\tsp_20\\tsp20_rollout_20210507T142303',
 'seed': 1234,
 'shrink_size': None,
 'tanh_clipping': 10.0,
 'use_cuda': False,
 'val_dataset': None,
 'val_size': 10000}
Evaluating baseline model on evaluation dataset
check2
  0%|                                                                                                                                                                                                                               | 0/10 [00:00<?, ?it/sT 
raceback (most recent call last):
  File "run.py", line 172, in <module>
    run(get_options())
  File "run.py", line 104, in run
    baseline = RolloutBaseline(model, problem, opts)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\reinforce_baselines.py", line 151, in __init__
    self._update_model(model, epoch)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\reinforce_baselines.py", line 171, in _update_model
    self.bl_vals = rollout(self.model, self.dataset, self.opts).cpu().numpy()
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\train.py", line 43, in rollout
    in tqdm(DataLoader(dataset, batch_size=opts.eval_batch_size), disable=opts.no_progress_bar)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\train.py", line 42, in <listcomp>
    for bat
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\train.py", line 37, in eval_model_bat
    cost, _ = model(move_to(bat, opts.device))
  File "C:\Users\lucia\.virtualenvs\atention_model-OuevcIBs\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\nets\attention_model.py", line 137, in forward
    _log_p, pi = self._inner(input, embeddings)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\nets\attention_model.py", line 234, in _inner
    batch_size = state.ids.size(0)
  File "C:\Users\lucia\Desktop\Trucksters-repo\atention_model\problems\tsp\state_tsp.py", line 31, in __getitem__
    assert torch.is_tensor(key) or isinstance(key, slice)  # If tensor, idx all tensors by this tensor:
AssertionError
  0%|                                                                                                                                                                                                                               | 0/10 [00:00<?, ?it/s]

How can i get rid of it?

Thanks in advance

Unable to resize error

Hi. Many thanks for making your work public. It's been a pleasure reading your paper.

I tried running the code on Spyder. It works fine until at one point, it hits the following runtime error.

Start train epoch 12, lr=0.0001 for run run_20210510T145253
Evaluating baseline on dataset...
100%|██████████| 10/10 [00:00<00:00, 22.94it/s]
100%|██████████| 10/10 [00:03<00:00, 3.04it/s]
100%|██████████| 1/1 [00:00<00:00, 23.44it/s]
100%|██████████| 1/1 [00:00<00:00, 22.40it/s]
Finished epoch 12, took 00:00:03 s
Saving model and state...
Validating...
Validation overall avg_cost: -7.61328125 +- 0.06633966416120529
Evaluating candidate model on evaluation dataset
Epoch 12 candidate mean -7.60546875, baseline epoch 11 mean -7.64453125, difference 0.0390625
Start train epoch 13, lr=0.0001 for run run_20210510T145253
30%|███ | 3/10 [00:00<00:00, 22.74it/s]Evaluating baseline on dataset...
100%|██████████| 10/10 [00:00<00:00, 22.78it/s]
0%| | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\site-packages\torch\multiprocessing\reductions.py", line 88, in rebuild_tensor
t = torch._utils._rebuild_tensor(storage, storage_offset, size, stride)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\site-packages\torch_utils.py", line 133, in rebuild_tensor
return t.set
(storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable at ..\aten\src\TH\THStorageFunctions.cpp:87

The problem is op with const data distribution. To make problem simple, I set graph_size as 20, batch_size 512, epoch_size as 5120, eval_batch_size 512, and 100 epochs. Other parameters are set as before.

Any idea to tackle this problem?

Thanks in advance!

issue with warmup baseline

Hi Wouter,

first, big thanks for making the code available! I have a comment regarding the WarmupBaseline. As to my understanding, it should simply return the evaluation results of the "normal" baseline (self.baseline) once the number of warmup epochs is exceeded. However, the alpha attribute takes on values larger than 1 for all subsequent epochs:

def epoch_callback(self, model, epoch):
# Need to call epoch callback of inner model (also after first epoch if we have not used it)
self.baseline.epoch_callback(model, epoch)
self.alpha = (epoch + 1) / float(self.n_epochs)
if epoch < self.n_epochs:
print("Set warmup alpha = {}".format(self.alpha))

which causes the respective if-else statement to fail:

def eval(self, x, c):
if self.alpha == 1:
return self.baseline.eval(x, c)
if self.alpha == 0:
return self.warmup_baseline.eval(x, c)
v, l = self.baseline.eval(x, c)
vw, lw = self.warmup_baseline.eval(x, c)
# Return convex combination of baseline and of loss
return self.alpha * v + (1 - self.alpha) * vw, self.alpha * l + (1 - self.alpha * lw)

So, I would propose either a clipping of the alpha attribute to a maximum value of 1 in the epoch_callback function or changing the respective if-statement to if self.alpha >= 1

question regarding `_get_parallel_step_context`

Hi, thanks for making the code public!
I have a question regarding the function _get_parallel_step_context:
Here,

batch_size, num_steps = current_node.size()
, num_steps would always be 1 as the current_node reads the prev_a of the tsp state, so then this means that
if num_steps == 1: # We need to special case if we have only 1 step, may be the first or not
will always be hit, and the lines from 436 to 449 will never be used, is this correct or am I missing sth here?
Thanks in advance and looking forward for your reply!
Jingwei

Default value for embed_dim in MultiHeadAttention class will cause an error.

@wouterkool
if you don't provide the embed_dim parameter when instantiating a MultiHeadAttention object, you will get the following error when trying to invoke its forward method:
AttributeError: 'MultiHeadAttention' object has no attribute 'W_out'.
This is due to this piece of code in the constructor:

if embed_dim is not None:
    self.W_out = nn.Parameter(torch.Tensor(n_heads, key_dim, embed_dim))

which will mean you have no W_out if embed_dim was None.

Regd Decoder

Hi WouterKool,

I would like to get some insights into the decoder implementation as it appears (?) to be different from your description in the paper.

Referring to this part in the code:
https://github.com/wouterkool/attention-learn-to-route/blob/master/nets/attention_model.py#L451

From your paper, I understand that the key and context query are used to get the probabilities for each node. In the code, I see that you are using the head (that is attention based convex combination of values of each node) and logit_K.

How is it different from your description in the paper ? What is the functionality of logit_k ?

Thanks,
-Ravi

std / sqrt(len)

Hi,
maybe I'm missing something obvious here, but could you inform me as why the std is divided again by sqrt(len) as this should already be taken care of in the std calculation?

print("Average cost: {} +- {}".format(np.mean(costs), 2 * np.std(costs) / np.sqrt(len(costs))))

Thanks!

Loss converges to negative infinity (passing zero)

Hello Kool:

I am applying your codes for my paper's experiments, but I encountered a curious training result when I trained the AM.

The loss value decreases continuously, which makes me happy; however, the loss value passed zero value and continued to decline to negative infinity after 100 epochs. (By the way, I also encountered this issue before when I trained a GAN model. )

So I wonder what I can do to improve the training process. (ps: I don't change the loss function and the training codes.)

Thank you for your consideration!

Methods to build the context embedding and query

Hi Wouter,

Thanks for sharing the code, I've tried it and the result is quite promising :D
But I found that the method to build the context embedding as well as query may be different from the paper. In the paper it's built via horizontal concatenation:

kool2019_question

But in the code it seems that the projected graph embedding is added on the projected start and end nodes
https://github.com/wouterkool/attention-learn-to-route/blob/master/nets/attention_model.py#L349

Are these differences matter?

AssertionError in 'State_tsp.py'

When I'm run the 'run.py', there is always a assert error:

torch.is_tensor(key) or isinstance(key, slice) # If tensor, idx all tensors by this tensor

please help me! Thanks!

2 GPUs less efficient than 1 ?

I really like this work : learning is pretty quick on small instances.
I have tried to learn vrp100 with 2 GPUs and was surprised to discover that it is slower than with only 1 GPU. As it is my first test with multiple GPUs, I may have done something wrong : is there a parameter that I need to set to run the code on multiple GPU ?
This test was done with the original sources, on gcloud with pytorch 0.4.1 (some warning for deprecation) . I saw with the nvidia-smi command that both GPUs were busy. You paper mention a real gain with 2 GPUs ... I'd like to see it as well !
Thanks in advance for any tip !

Manually specifying key_dim in MultiHeadAttention class causing size mismatch

@wouterkool
I think the topic is self-explanatory. Here's a small example that reproduces the error (I was just testing with some random data to understand the forward function):

batch_size = 3
n_query = 2
graph_size = 4
input_dim = 5

h = torch.normal(0, 0.5, [batch_size, graph_size, input_dim])
q = torch.normal(0, 0.5. ,[batch_size, n_query, input_dim])

attn = MultiHeadAttention(n_heads=6, input_dim=input_dim, embed_dim=12, key_dim=10)

out = attn.forward(q, h)

This will cause the following error:

Traceback (most recent call last):
  File "attention.py", line 123, in <module>
    out = attn.forward(q, h)
  File "attention.py", line 104, in forward
    self.W_out.view(-1, self.embed_dim)
RuntimeError: size mismatch, m1: [6 x 12], m2: [30 x 12] at /opt/conda/conda-bld/pytorch_1591914855613/work/aten/src/TH/generic/THTensorMath.cpp:41

Which is basically saying that the torch.mm function is receiving arguments with bad sizes:

        out = torch.mm(
            heads.permute(1, 2, 0, 3).contiguous().view(-1, self.n_heads * self.val_dim),
            self.W_out.view(-1, self.embed_dim)
        ).view(batch_size, n_query, self.embed_dim)

changing the constructor arguments and passing key_dim=None will make the error go away.

The PCTSP related beam search issue

Hi, Thanks for open sourcing this excellent code! I am playing with the beam search decoder. However, I found that the beam search does not work correctly for PCTSP (including stochastic variants). The greedy decoder can give me around 3.19 on PCTSP-20, however, beam search with width 1 (this should be the same as greedy) gives me 7.44. I am wondering if there is a bug in the code or I did something wrong?

Thanks!

Beamsearch vs Sampling

Hi, thanks for the code! Could you briefly describe the difference between Sampling from the policy and your beamsearch implementation?

Runtime comparisons

Really enjoy the paper -- have you done any comparisons between the runtimes of your method and other heuristic solvers (either learned or unlearned)? In Table 2, you should you can get a better quality solution than OR-Tools, but are the runtimes comparable? Would they be comparable as instance size grew?

Thanks
Ben

issue with the class AttentionModelFixed when trying to run simple_tsp.ipynb

Hi! I'm having the following issue when trying to run simple_tsp.ipynb on colab, any quick fix ? Thanks in advance

[*] Loading model from pretrained/tsp_50/epoch-99.pt

AssertionError Traceback (most recent call last)
in ()
52 tour_p = []
53 while(len(tour) < len(xy)):
---> 54 p = oracle(tour)
55
56 if sample:

1 frames
in oracle(tour)
28
29 # Compute query = context node embedding, add batch and step dimensions (both 1)
---> 30 query = fixed.context_node_projected + model.project_step_context(step_context[None, None, :])
31
32 # Create the mask and convert to bool depending on PyTorch version

/content/drive/My Drive/attention-learn-to-route/nets/attention_model.py in getitem(self, key)
30
31 def getitem(self, key):
---> 32 assert torch.is_tensor(key) or isinstance(key, slice)
33 return AttentionModelFixed(
34 node_embeddings=self.node_embeddings[key],

AssertionError:

the implementation of BatchNorm maybe have a mistake

Thinks for your code. i have a question about the implementation of BN. In 140th line of graph_encoder.py, "self.normalizer(input.view(-1, input.size(-1))).view(*input.size())", u merge batch_size and node_size, but i think the last dimension is node_size. input = input.permute(0, 2, 1).
pls help me.

Runtime Error in Python 3.8

After switching to Python 3.8 I'm getting "RuntimeError: class not set defining 'BatchBeam' as <class 'utils.beam_search.BatchBeam'>. Was classcell propagated to type.new?"

After reading this post https://stackoverflow.com/questions/41343263/provide-classcell-example-for-python-3-6-metaclass and doing some initial debugging I'm guessing there should be some more arguments passed to super() method whenever you use it. It seems that all the classes deriving from NamedTuple are affected.

Using droplast=True in DataLoaders

Currently, an error occurs when the epoch_size argument is not an integer multiple of batchsize.

assert opts.epoch_size % opts.batch_size == 0, "Epoch size must be integer multiple of batch size!"

I assume this is to ensure that there is no batch with less than the batchsize amount of data samples.

Is there any reason not to use the droplast parameter in DataLoaders to throw away such batches instead of generating an error?

Cannot allocate memory

when I use
python run.py --problem sdvrp --graph_size 20 --baseline rollout --run_name 'sdvrp20_rollout'
I get

{'baseline': 'rollout',
 'batch_size': 512,
 'bl_alpha': 0.05,
 'bl_warmup_epochs': 1,
 'checkpoint_encoder': False,
 'checkpoint_epochs': 1,
 'data_distribution': None,
 'embedding_dim': 128,
 'epoch_size': 1280000,
 'epoch_start': 0,
 'eval_batch_size': 1024,
 'eval_only': False,
 'exp_beta': 0.8,
 'graph_size': 20,
 'hidden_dim': 128,
 'load_path': None,
 'log_dir': 'logs',
 'log_step': 50,
 'lr_critic': 0.0001,
 'lr_decay': 1.0,
 'lr_model': 0.0001,
 'max_grad_norm': 1.0,
 'model': 'attention',
 'n_encode_layers': 3,
 'n_epochs': 100,
 'no_cuda': False,
 'no_progress_bar': False,
 'no_tensorboard': False,
 'normalization': 'batch',
 'output_dir': 'outputs',
 'problem': 'sdvrp',
 'resume': None,
 'run_name': 'sdvrp20_rollout_20210308T181630',
 'save_dir': 'outputs/sdvrp_20/sdvrp20_rollout_20210308T181630',
 'seed': 1234,
 'shrink_size': None,
 'tanh_clipping': 10.0,
 'use_cuda': True,
 'val_dataset': None,
 'val_size': 10000}
Evaluating baseline model on evaluation dataset
Start train epoch 0, lr=0.0001 for run sdvrp20_rollout_20210308T181630
  0%|                                                                                                       | 0/2500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run.py", line 172, in <module>
    run(get_options())
  File "run.py", line 158, in run
    train_epoch(
  File "/tmp/pycharm_project_356/train.py", line 84, in train_epoch
    for batch_id, batch in enumerate(tqdm(training_dataloader, disable=opts.no_progress_bar)):
  File "/root/miniconda3/envs/python38/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/root/miniconda3/envs/python38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 355, in __iter__
    return self._get_iterator()
  File "/root/miniconda3/envs/python38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/root/miniconda3/envs/python38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 914, in __init__
    w.start()
  File "/root/miniconda3/envs/python38/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/root/miniconda3/envs/python38/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/root/miniconda3/envs/python38/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/root/miniconda3/envs/python38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/root/miniconda3/envs/python38/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

How to fix it? By the way, what is the version of TensorFlow?

About num_steps

I am confused about the function _get_parallel_step_context() in nets/attention_model.py#L367.
What does num_steps mean? Could you please explain it in detail?
Thanks! o( ̄▽ ̄)ブ

Question about training VRP model.

Hello, I tried to training vrp model, I changed the tsp commend, used the command "python run.py --graph_size 20 --baseline rollout --run_name 'vrp20_rollout' --val_dataset data/vrp/vrp20_validation_seed4321.pkl " to trained vrp model.
But the terminal showed "an ValueError: expected sequence of length 2 at dim 1 (got 20)", could you please tell me how to trained vrp model?

The code stuck on the 84 line in the train.py file

When I run the code, it stuck on the 84 line in the train.py file

for batch_id, batch in enumerate(tqdm(training_dataloader, disable=opts.no_progress_bar)):

I found the number of dateset is 1280000, but it seems that it takes to long for the enumerate process, the code can not even go into the for loop, I run the code for about 1 hour, it still can not go into the loop process the execute the train_bitch() line.
Can you help me with this problem?

About your modification to python3.8

This is relevant to #16. You remove the super of the customized dataloader, which causes the code cannot work well on python3.7. I can run pretty well with the old code. Just want to let you know about this issue.

Masking in SHA

Hi,

Would it be possible to apply masking only in the decoder single head attention? I think we have masking in both MHA and SHA in the decoder.

Best,
Shaghayegh

Problem with enumerate(tqdm(training_dataloader...))

Hi there,

I very much enjoy your paper. I try to run the code but it somehow stuck at:

enumerate(tqdm(training_dataloader, disable=opts.no_progress_bar))

I ran the code on Windows using anaconda. Can you please let me know if that is just normal and all I have to do is to wait?

Thanks.

VRP without demand/capacity

Hello, I'm trying to implement this work to solve a VRP routing problem in my research, but I don't have the demand/capacity requirement. Is there a simple way I could remove this input to have just a standard VRP solver?

About path length

Hi,
I have read your paper and find that in the TSP problem all paths have the same length, so I wonder can your method works when the length of paths isn't fixed?

Version of Google OR-tools

Hey, we appreciate the great contribution of AM to this field. I was wondering what is the version of google or-tools? I am having trouble when running it for PCTSP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.