noble-lab / casanovo Goto Github PK

View Code? Open in Web Editor NEW

102.0 7.0 35.0 26.36 MB

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model

Home Page: https://casanovo.readthedocs.io

License: Apache License 2.0

Python 100.00%

casanovo's Introduction

Casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model

If you use Casanovo in your work, please cite the following publications:

Yilmaz, M., Fondrie, W. E., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. in Proceedings of the 39th International Conference on Machine Learning - ICML '22 vol. 162 25514–25522 (PMLR, 2022). https://proceedings.mlr.press/v162/yilmaz22a.html
Yilmaz, M., Fondrie, W. E., Bittremieux, W., Melendez, C.F., Nelson, R., Ananth, V., Oh, S. & Noble, W. S. Sequence-to-sequence translation from mass spectra to peptides with a transformer model. in Nature Communications 15, 6427 (2024). doi:10.1038/s41467-024-49731-x

Documentation

https://casanovo.readthedocs.io/en/latest/

casanovo's People

Contributors

Stargazers

Watchers

casanovo's Issues

Add delta mass to model output

Document model configuration and options

The documentation should have detailed information about how to configure casanovo.

Add issue template for bugs, features etc.

This can wait until we have release notes and versions for Casanovo

I've been trying to run your code on Spyder through Anaconda, but I've been having trouble running the following line:
>>run casanovo --mode=denovo --model_path='pretrained_excl_clambacteria.ckpt' --test_data_path='dark_control_1.mgf' --config_path='config' --output_path='test.csv'

The following error is reported back:
**Traceback (most recent call last):

File "F:\Applications\Noble\casanovo\casanovo.py", line 3, in
from casanovo.denovo import train, test_evaluate, test_denovo

ModuleNotFoundError: No module named 'casanovo.denovo'; 'casanovo' is not a package**

I've noticed that some of the files being called appear to have different names to their current version on GitHub.
E.g. in casanovo.py, it tries to import 'train', 'test_evaluate', and 'test_denovo' from casanovo/denovo.
However, in this directory the files are named 'train_test', 'evaluate', and 'model'.

Having played around changing the relevant lines in this file to get it to work, it brought forth more import errors from other files.
Eventually 'fixing' them produces a circular import so I don't think what I've done is correct.
Do you know what might be causing this error? Am I correct in thinking these files are currently mis-named?

Any help would be greatly appreciated.
Best wishes,
Alex

de novo sequencing without evaluation

Greetings,

Firstly, I would like to thank you for providing this open source tool. I am currently interested to perform de novo sequencing without evaluation. Based on your github page, it was shown to run the command of:

casanovo --mode=denovo --model_path='path/to/pretrained' --test_data_path='path/to/test' --config_path='path/to/config' --output_path='path/to/output'

Could you assist me on where to obtain the --config_path file?
Does the --test_data_path mean the .raw proteomics file?
Do you have an example output file from Casanovo?

Regards,
Ben

pip install casanovo failed

Hi, I failed to pip install casanovo with many wrongs.
Is there a method that I can use python to run this project?

Add all de novo sequencing related arguments to CLI

Consider adding preprocess_spec , test_batch_size, num_workers, gpus

Errror running example job

When running the provided following command:
casanovo --mode=denovo --model=22-07-02_weights/pretrained_excl_mouse.ckpt --peak_path=sample_data/sample_preprocessed_spectra.mgf --config=casanovo/config.yaml

This error shows:
File "D:\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1497, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Spec2Pep:
size mismatch for decoder.aa_encoder.weight: copying a param with shape torch.Size([25, 512]) from checkpoint, the shape in current model is torch.Size([29, 512]).
size mismatch for decoder.final.weight: copying a param with shape torch.Size([25, 512]) from checkpoint, the shape in current model is torch.Size([29, 512]).
size mismatch for decoder.final.bias: copying a param with shape torch.Size([25]) from checkpoint, the shape in current model is torch.Size([29]).

Did you recently updated depthcharge-ms, the pretrained weights, the sample .mgf file or config.yaml?
Shall I try any older version of any of these files? Thanks!

Add simple tutorial with sample data

Step by step guide to de novo sequencing with Casanovo with a sample mgf file.

config.yaml in small walkthrough example

I just tried to run the "small walkthrough" that's in the Casanovo readme, but it didn't work. I thought it was strange that the example makes no mention of the yaml configuration file. Sure enough, when I ran the example, I got an error message saying it couldn't find that file:

casanovo --mode=denovo --model_path=../../../data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=mgf
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 58, in main
    with open(abs_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/config.yaml'

We need to update the README to describe how to set up the config file.

Where is the folder ‘depthcharge’？

I failed to find the ‘depthcharge’ folder

Casanovo should implement the precursor filter

We should implement the precursor mass filter in Casanovo by subtracting 1 from the peptide-level score if the peptide mass is outside of a user-specified range. This range should be specified in the yaml file in units of ppm. We can name it precursor_window and have a default value of 50.

Add citation information and link to paper

Add more detailed installation instructions

This should tell users what they need to have installed before installing casanovo (Python >= 3.7, < 3.10).
Add instructions on how to set up a python/conda enviroment to install casanovo.
It should also link to PyTorch so users can install the version appropriate for their GPU and CUDA version.

f.suffix.lower() erroneous for training mgf file

One minor thing - in casanovo/denovo/train_test.py, the f.suffix.lower() part is consistently throwing an error, saying str doesn't have attribute suffix. I have changed it to os.path.splitext(f)[1].lower() instead and it works fine.

RuntimeError: CUDA out of memory.

I tried to use Casanovo to make predictions on an MGF file containing 31,078 spectra, and it ran out of memory. Is there anything I can do to mitigate this problem, other than breaking up the input file into small pieces or switching to a different machine?

casanovo --mode=denovo --model_path=/net/noble/vol1/home/noble/proj/2022_varun_ls-casanovo/data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=20190227_231_15%_1 --output_path=20190227_231_15%_1 --config_path=config.yaml
INFO: De novo sequencing with Casanovo...
INFO: Created a temporary directory at /tmp/tmpzqps6s6h
INFO: Writing /tmp/tmpzqps6s6h/_remote_module_non_scriptable.py
INFO: Reading 1 files...
20190227_231_15%_1/20190227_231_15%_1.mgf: 31078spectra [00:08, 3647.09spectra/s]
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.
  f"Passing `Trainer(accelerator={self.distributed_backend!r})` has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing:  35% 11/31 [02:46<03:05,  9.26s/it]Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 83, in main
    denovo(test_data_path, model_path, config, output_path)  
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/train_test.py", line 246, in denovo
    trainer.test(model_trained, loaders.test_dataloader())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_evaluating(self)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 206, in start_evaluating
    self._results = trainer.run_stage()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1286, in run_stage
    return self._run_evaluate()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1334, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 213, in _evaluation_step
    output = self.trainer.accelerator.test_step(step_kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 247, in test_step
    return self.training_type_plugin.test_step(*step_kwargs.values())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 450, in test_step
    return self.lightning_module.test_step(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 403, in test_step
    pred_seqs, scores = self.predict_step(batch)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 188, in predict_step
    return self(batch[0], batch[1])
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 163, in forward
    scores, tokens = self.greedy_decode(spectra, precursors)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 212, in greedy_decode
    memories, mem_masks = self.encoder(spectra)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/depthcharge/components/transformers.py", line 105, in forward
    return self.transformer_encoder(peaks, src_key_padding_mask=mask), mask
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 238, in forward
    output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 456, in forward
    src_mask if src_mask is not None else src_key_padding_mask,
RuntimeError: CUDA out of memory. Tried to allocate 714.00 MiB (GPU 0; 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Training fails upon empty spectrum

Hello again,

For some reason, the on_train_epoch_end hook is called before training is complete. I am trying to train with a pretty large dataset (~1.2M spectra between the training and validation sets, with 1093774 spectra (17,091 batches of 64) for training and 123930 spectra (1,937 batches of 64) for validation). For some reason, casanovo is thinking that training is done after only 11,927 batches (which is significantly less than the 17,091 expected) and so calls on_train_epoch_end, which looks for a history to add the train loss to and fails:

...
/home/ubuntu/smsnet_val_data/1049.mgf: 46spectra [00:00, 2628.76spectra/s]
/home/ubuntu/smsnet_val_data/1276.mgf: 62spectra [00:00, 2859.35spectra/s]
/home/ubuntu/smsnet_val_data/1073.mgf: 1spectra [00:00, 4156.89spectra/s]
/home/ubuntu/smsnet_val_data/1224.mgf: 4spectra [00:00, 5257.67spectra/s]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Global seed set to 454
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type             | Params
---------------------------------------------
0 | encoder | SpectrumEncoder  | 18.9 M
1 | decoder | PeptideDecoder   | 28.4 M
2 | softmax | Softmax          | 0     
3 | celoss  | CrossEntropyLoss | 0     
---------------------------------------------
47.3 M    Trainable params
0         Non-trainable params
47.3 M    Total params
189.387   Total estimated model params size (MB)
/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /home/ubuntu/checkpoints exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Epoch 0:   0%|                                                                                                                                        | 0/19028 [00:00<?, ?it/s]/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:59: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 64. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
Epoch 0:   0%|                                                                                                                  | 1/19028 [00:01<6:18:18,  1.19s/it, loss=0.868]INFO: Reducer buckets have been rebuilt in this iteration.
Epoch 0:  50%|█████████████████████████████████████████████████████████▏                                                        | 9549/19028 [30:34<30:20,  5.21it/s, loss=1.03]Epoch 0:  50%|█████████████████████████████████████████████████████████▎                                                        | 9575/19028 [30:39<30:16,  5.20it/s, loss=1.04]
Epoch 0:  50%|█████████████████████████████████████████████████████████▎                                                        | 9576/19028 [30:40<30:16,  5.20it/s, loss=1.04]

Epoch 0:  50%|█████████████████████████████████████████████████████████▍                                                        | 9577/19028 [30:40<30:16,  5.20it/s, loss=1.03]
Epoch 0:  50%|█████████████████████████████████████████████████████████▍                                                        | 9582/19028 [30:41<30:15,  5.20it/s, loss=1.01]
Epoch 0:  50%|█████████████████████████████████████████████████████████▍                                                        | 9583/19028 [30:41<30:15,  5.20it/s, loss=1.01]
Epoch 0:  50%|█████████████████████████████████████████████████████████▍                                                        | 9584/19028 [30:41<30:14,  5.20it/s, loss=0.99]
Epoch 0:  50%|████████████████████████████████████████████████████████▉                                                        | 9585/19028 [30:42<30:14,  5.20it/s, loss=0.986]
Epoch 0:  50%|████████████████████████████████████████████████████████▉                                                        | 9586/19028 [30:42<30:14,  5.20it/s, loss=0.975]
Epoch 0:  63%|██████████████████████████████████████████████████████████████████████▏                                         | 11929/19028 [39:47<23:40,  5.00it/s, loss=0.932]Traceback (most recent call last):
  File "/home/ubuntu/mambaforge/envs/denovo/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/casanovo/casanovo.py", line 32, in main
    train(train_data_path, val_data_path, model_path, config_path)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/casanovo/denovo/train_test.py", line 134, in train
    trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
    self.trainer.call_hook("on_train_epoch_end")
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/casanovo/denovo/model.py", line 425, in on_train_epoch_end
    self._history[-1]["train"] = train_loss
IndexError: list index out of range
Epoch 0:  63%|██████▎   | 11929/19028 [39:51<23:43,  4.99it/s, loss=0.932]

I tried wrapping self._history[-1]["train"] = train_loss in a try and except where if the self._history is not of length at least 1, this step is skipped. However, somehow this skips validation entirely and goes to the next training epoch without validation:

/home/ubuntu/mambaforge/envs/denovo/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:59: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 64. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
Global seed set to 454                                                                                                                                                          
Epoch 0:   0%|                                                                                                          | 1/19028 [00:01<6:21:44,  1.20s/it, loss=3.5, lr=0.000]INFO: Reducer buckets have been rebuilt in this iteration.
Epoch 0:  63%|███████████████████████████████████████████████████████████████▎                                     | 11929/19028 [39:49<23:41,  4.99it/s, loss=2.17, lr=5.96e-5]INFO: ---------------------------------------------------------------------------------------------------------
INFO:   Epoch |   Train Loss  |  Valid Loss | Valid AA precision | Valid AA recall | Valid Peptide recall 
INFO: ---------------------------------------------------------------------------------------------------------
INFO:       0 |      2.398110 |      3.474908 |      0.000851 |      0.007873 |      0.000000 
Epoch 1:   4%|████▍                                                                                                  | 829/19028 [02:41<59:12,  5.12it/s, loss=2.22, lr=6.37e-5]

I also added the learning rate to this printout just to make sure it hadn't stopped learning or something funky with learning rate had made it stop training. I am running with the same warmup and max iteration configuration that you are running with (100000, 600000). Is there anything that you have encountered that would suggest where the problem is coming from?

TypeError: init() got an unexpected keyword argument 'batch_first'

Hi,
I have tried two ways to run this project but both failed.
Method 1: casanovo --mode=eval --model_path='./casanovo_pretrained_model_weights/' --test_data_path='./Casanovo_preprocessed_data/' --config_path='../casanovo/config.py'
I got

INFO: Evaluating Casanovo...
Traceback (most recent call last):
  File "/usr/local/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/casanovo/casanovo.py", line 37, in main
    test_evaluate(test_data_path, model_path, config_path)
  File "/usr/local/lib/python3.8/dist-packages/casanovo/denovo/train_test.py", line 146, in test_evaluate
    model_trained = Spec2Pep().load_from_checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/casanovo/denovo/model.py", line 103, in __init__
    self.encoder = SpectrumEncoder(
  File "/usr/local/lib/python3.8/dist-packages/depthcharge/components/transformers.py", line 60, in __init__
    layer = torch.nn.TransformerEncoderLayer(
TypeError: __init__() got an unexpected keyword argument 'batch_first'

Method 2: python casanovo.py --mode=eval --model_path='../../data/casanovo_pretrained_model_weights/' --test_data_path='../../data/Casanovo_preprocessed_data/' --config_path='config.py'
I got:

Traceback (most recent call last):
  File "casanovo.py", line 3, in <module>
    from casanovo.denovo import train, test_evaluate, test_denovo
  File "/home/bio/casanovo-main/casanovo/casanovo.py", line 3, in <module>
    from casanovo.denovo import train, test_evaluate, test_denovo
ModuleNotFoundError: No module named 'casanovo.denovo'; 'casanovo' is not a package

Casanovo should automatically download the pre-trained model weights and be able to use them easily.

IndexError on particularly ugly spectrum

With --mode=denovo and pretrained human weights this spectrum:

BEGIN IONS
TITLE=crasher
RTINSECONDS=2800
PEPMASS=1551
CHARGE=4+
1552.40144197 301.23929755
1551.20606409 1051.24587242
1550.88090936 1053.2816881
END IONS

raises an IndexError at https://github.com/Noble-Lab/casanovo/blob/main/casanovo/data/datasets.py#L185

To trigger, the peaks need to be sufficiently close to each other and to PEPMASS.

Max out the num_workers

See if we can assign the maximum num_workers, have to check PyTorch Lightning Documentation

Track input spectra origin

Currently predicted PSMs are identified by the index of their spectra in the data loader. However, when predicting from multiple input files, this doesn't include information from which file the spectra came, pretty much making the index useless.

Can we track the origin of the input spectra, i.e. using filename and scan number? This probably needs to be modified in the index in depthcharge, and should then be passed through when predicting (but no need to move that information into a tensor on the GPU) so that it is available when writing the output results.

Results

Hello!

I am interested in using Casanovo for inference, and I just wanted to make sure that I am using it correctly.

I am using PyTorch version 1.10.2 with Cuda 11.3 on a machine with a GPU and 8 workers, and adjusted the config.py file accordingly:

#Hardware options
num_workers = 8
gpus = [0] #None for CPU, int list to specify GPUs

When I run inference on the attached file (renamed to test.mgf.txt so it can attach here), I get the folllowing result:

spectrum_id,denovo_seq
0,LLAETLLR

However, I know that the inference from MSFragger yields QLEQVIAK, which is the true peptide. I am using the pretrained human weights. Would you mind terribly testing the algorithm on the attached file to see if I am simply using Casanovo wrong? Thank you so much, and I really appreciate the help.

test.mgf.txt

MGF parser failed

Hi~ I got the following error. My mgf file was converted from the .raw file using msconvert.

File "/home/songjian/anaconda2/envs/casanovo/lib/python3.7/site-packages/depthcharge/data/parsers.py", line 157, in parse_spectrum
self.annotations.append(spectrum["params"]["seq"])

Is there something wrong with me?

add a random vector to peaks encoded vector

Hi,
I read the code but can not understand the following:

        self.latent_spectrum = torch.nn.Parameter(torch.randn(1, 1, dim_model))

        # Add the spectrum representation to each input:
        latent_spectra = self.latent_spectrum.expand(peaks.shape[0], -1, -1)

        peaks = torch.cat([latent_spectra, peaks], dim=1)

why we should add a random vector to the peaks encoded vector?

Thanks for your kind answer.

Solve pytorch ligtning warning flags and compatibility with version>= 1.6

Create a documentation website

Use mkdocs (https://www.mkdocs.org/). Alternatively sphinx (https://www.sphinx-doc.org/en/master/)

Fix parentheses

See validation data in the below line.

casanovo/casanovo/denovo/train_test.py

Line 25 in a7cefa2

    
           train_mgf_files = [train_data_path / f for f in os.listdir(train_data_path) if train_data_path/f.suffix.lower() == ".mgf"]

parameter of reversing the sequence

Hi,
the param 'reverse_peptide_seqs' in config.py is defined but not used anywhere else.
Btw, why we should reverse the peptide sequence when encoding them?

Thanks for your kind answer.

Write unit tests

At least adapt the tests from Depthcharge. Other tests to be written TBD.

Data path compatibility in Windows

Hi Melih,

I think I have solved the install issue.
I was trying to install it with Python 3.10 and it appears torch is not fully supported on it yet and was going unreported when trying to install through Spyder.
Downgrading Python to 3.9 and installing through the command line correctly installed it and the packages are called perfectly.

I have however had issues trying to load in my datasets.
The example command is:
casanovo --mode=denovo --model_path='path/to/pretrained' --test_data_path='path/to/test' --config_path='path/to/config' --output_path='path/to/output'

I've tried different combinations of the following for the paths but have had no luck.

Specifying the paths as: path='path/to/file.mgf' and path='path/to/folder'
Writing the paths as: path='paths/specified/with/quotes' and path=paths/without/quotes.
Writing the paths with: path='paths/with/forward/slash' and path='paths\with\back\slash'.
I've put a few examples at the end of this message.

Could you please give some guidance on the best way to call the script with the correct arguments on Windows?

Best wishes,
Alex

A couple of examples:

casanovo --mode=denovo --model_path='F:\PhD_Data\Casanovo_Data' --test_data_path='F:\PhD_Data\Casanovo_Data' --config_path=F:\PhD_Data\Casanovo_Data' --output_path='F:\PhD_Data\Casanovo_Data'
casanovo --mode=denovo --model_path='F:\PhD_Data\Casanovo_Data\pretrained_excl_clambacteria.ckpt' --test_data_path='F:\PhD_Data\Casanovo_Data\dark_control_1.mgf' --config_path='F:\PhD_Data\Casanovo_Data\config_params.py' --output_path='F:\PhD_Data\Casanovo_Data\Output.csv'
casanovo --mode=denovo --model_path=F:\PhD_Data\Casanovo_Data\pretrained_excl_clambacteria.ckpt --test_data_path=F:\PhD_Data\Casanovo_Data\dark_control_1.mgf --config_path=F:\PhD_Data\Casanovo_Data\config_params.py --output_path=F:\PhD_Data\Casanovo_Data\Output.csv
casanovo --mode=denovo --model_path=F:/PhD_Data/Casanovo_Data/pretrained_excl_clambacteria.ckpt --test_data_path=F:/PhD_Data/Casanovo_Data/dark_control_1.mgf --config_path=F:/PhD_Data/Casanovo_Data/config_params.py --output_path=F:/PhD_Data/Casanovo_Data/Output.csv

Originally posted by @alexmunroclark in #2 (comment)

.d to mgf conversion for de novo

Hi,
I'd like to use a timsTOF .d file as an input. I've tried converting this to a .mgf with multiple tools yet I continually get an error. How can I process this?
Below is a screenshot of the error I get.
Thank you,
Jonathan

Casanovo fails when output path is not explicitly specified

It looks like if you don't specify --output_path=foo.tsv then Casanovo fails with this error message:

 File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 435, in on_test_epoch_end
    with open(os.path.join(str(self.output_path),'casanovo_output.csv'), 'w') as f: 
FileNotFoundError: [Errno 2] No such file or directory: 'None/casanovo_output.csv'

FYI, the full output is below.

 casanovo --mode=denovo --model_path=../../../data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=mgf --config_path=config.yaml
INFO: De novo sequencing with Casanovo...
INFO: Created a temporary directory at /tmp/tmpfm17yd8s
INFO: Writing /tmp/tmpfm17yd8s/_remote_module_non_scriptable.py
INFO: Reading 1 files...
mgf/2022_01_13_HAB_timeseries_DDA_54_17-17.mgf: 53109spectra [00:18, 2879.99spectra/s]
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.
  f"Passing `Trainer(accelerator={self.distributed_backend!r})` has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:36057 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:36057 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:36057 (errno: 97 - Address family not supported by protocol).
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 100% 52/52 [12:38<00:00,  9.69s/it]Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 83, in main
    denovo(test_data_path, model_path, config, output_path)  
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/train_test.py", line 246, in denovo
    trainer.test(model_trained, loaders.test_dataloader())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_evaluating(self)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 206, in start_evaluating
    self._results = trainer.run_stage()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1286, in run_stage
    return self._run_evaluate()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1334, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
    self._on_evaluation_epoch_end()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
    self.trainer.call_hook(hook_name)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 435, in on_test_epoch_end
    with open(os.path.join(str(self.output_path),'casanovo_output.csv'), 'w') as f: 
FileNotFoundError: [Errno 2] No such file or directory: 'None/casanovo_output.csv'

Provide model output in mzTab format

Use types for our command line arguments and improve --help doc

Right now, all of them say TEXT - For example, the --mode option looks like it takes two choices, but I don’t know what those are purely from looking at the help (in general, if an option has two choices, it would be better served by a flag instead). Try to be more descriptive with those options and improve documentation for --help.

Another format for config file

A .py file is not a proper config file format - Choose a standard such as .ini (https://docs.python.org/3/library/configparser.html), TOML, or YAML.

Document input and output

On README.md and on the website.

CPU mode does not work

I'd like to run Casanovo on a machine with no GPU, just in "de novo" mode (i.e., without training the model). The sample config.yaml file indicates that this is possible:

gpus: [0] #None for CPU, int list to specify GPUs

However, when I changed the above line to

gpus: None #None for CPU, int list to specify GPUs

I still get an error message:

 File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 131, in _normalize_parse_gpu_string_input
    return int(s.strip())
ValueError: invalid literal for int() with base 10: 'None'

The full output is listed below.

casanovo --mode=denovo --model_path=../../../data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=mgf --config_path=config.yaml
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
INFO: De novo sequencing with Casanovo...
INFO: Created a temporary directory at /tmp/tmp5l3bzhid
INFO: Writing /tmp/tmp5l3bzhid/_remote_module_non_scriptable.py
INFO: Reading 1 files...
mgf/2022_01_13_HAB_timeseries_DDA_54_17-17.mgf: 53109spectra [00:47, 1122.26spectra/s]
Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 83, in main
    denovo(test_data_path, model_path, config, output_path)  
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/train_test.py", line 242, in denovo
    num_sanity_val_steps=config['num_sanity_val_steps']
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
    return fn(self, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 426, in __init__
    gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1543, in _parse_devices
    gpu_ids = device_parser.parse_gpu_ids(gpus)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids
    gpus = _normalize_parse_gpu_string_input(gpus)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 131, in _normalize_parse_gpu_string_input
    return int(s.strip())
ValueError: invalid literal for int() with base 10: 'None'

About output file

Hi,
This question may be a little silly. In the peptides of the output file, there are peptides such as "C+57.021GHTNNLRPK". I don't quite understand what "+57.021" means. The peptide predicted by the algorithm is "CGHTNNLRPK" or there are some unknown amino acid between C and GHTNNLRPK?

0 | LAHYNKR | 0.991221964
1 | VKEDPDGEAHR | 0.965903959
2 | C+57.021GHTNNLRPK | 0.991358876
3 | VVQEQGTHPK | 0.987940228
4 | KGKPELR | 0.991316216
5 | SLSHSPGK | 0.993293308

Thanks,
LeeLee

"suffix" error + strange issue with validation data

Hi @melihyilmaz,

One minor thing - in casanovo/denovo/train_test.py, the f.suffix.lower() part is consistently throwing an error, saying str doesn't have attribute suffix. I have changed it to os.path.splitext(f)[1].lower() instead and it works fine.

On another note, I am trying to train on a new dataset and am running into an issue with how the number of batches are calculated. I adjust the training and validation batch sizes to both be 64 in the config file.

However, when I do this with the attached files as train and validation data respectively - command is

casanovo --mode=train --model_path=/home/ubuntu/darkspectra/casanovo_pretrained_model_weights/pretrained_excl_human.ckpt --train_data_path=/home/ubuntu/play_train_data/  --val_data_path=/home/ubuntu/play_val_data

where I stick train.mgf.txt in play_train_data and val.mgf.txt in play_val_data (and rename them back to .mgf), I am seeing that the progress bar computes the number of train batches based on the number of combined spectra between the two files ((1698+713)/64 = 38, as opposed to the desired 27).

Can you replicate this behavior with these files? Thank you very much.

train.mgf.txt
val.mgf.txt

EDIT: It doesn't seem to be an issue with the number of batches like I suspected but rather with the validation data, since I tried swapping out the validation data with just the same training data file and it seems to work...

CUDA out of memory using a custom mgf file

Hi~
I trained Casanovo using my our mgf file （https://figshare.com/articles/dataset/Casanovo-Train-MGF/19204794， size 85M）. But the log reported:

attn = torch.bmm(q, k.transpose(-2, -1))
RuntimeError: CUDA out of memory. Tried to allocate 2.62 GiB (GPU 1; 10.76 GiB total capacity; 7.72 GiB already allocated; 1.84 GiB free; 7.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I used the default config.py (batch size 32) and 2080Ti.

Is there something wrong with my mgf file or else?

Casanovo predicts for invalid spectra

Invalid spectra after preprocessing are replaced by a dummy spectrum, but Casanovo still predicts a peptide for them. The resulting predictions are naturally incorrect, consisting of long peptide sequences with low(ish) scores (but not obviously wrong).

Instead invalid spectra should be filtered out or no prediction should be given. The former is probably better, because it might be a factor during training as well. I haven't fully been able to figure out how to skip items in the dataloader though.

pretrained_excl_yeast.ckpt not compatible with current implementation

The pretrained_excl_yeast.ckpt checkpoint has some missing/unexpected keys, it also differs in size compared to all other checkpoints (by a few bytes):

Missing key(s) in state_dict: "encoder.peak_encoder.sin_term", "encoder.peak_encoder.cos_term", "encoder.peak_encoder.int_encoder.weight", "decoder.charge_encoder.weight", "decoder.mass_encoder.sin_term", "decoder.mass_encoder.cos_term". 
	Unexpected key(s) in state_dict: "encoder.mz_encoder.sin_term", "encoder.mz_encoder.cos_term", "encoder.mz_encoder.linear.weight", "decoder.precursor_encoder.weight", "decoder.precursor_encoder.bias".

Correct type of YAML config values

Make sure that the YAML config values are parsed as the correct type. I.e. learning rate 5e-4 should be a float, not a string.

About depthcharge

Hi,

I use pip install git+https://github.com/Noble-Lab/casanovo.git#egg=casanovo code to install casanovo, Then I ran into some trouble, the following is my error message:

ERROR: Could not find a version that satisfies the requirement depthcharge-ms (unavailable) (from casanovo) (from versions: none)
ERROR: No matching distribution found for depthcharge-ms (unavailable)

I also tried installing from https://github.com/wfondrie/depthcharge, But it doesn't seem to work properly. The following is the error message:
ModuleNotFoundError: No module named 'depthcharge.embed'
how can I solve this problem?

Thanks,
LeeLee

Check if input files in the mzML, mzXML format work

Implement if not

Start release notes

Bullet list of what’s added and when bottom of README.md e.g. release 1.1 22-06-03: - config file format changed etc. ...

Output file naming

I think we should add a command line parameter such as --output-root that specifies the root of the output files produced by Casanovo. Then we should have two files, .casanovo.txt and .casanovo.log.txt, where the former contains the PSMs and the latter contains all of the messages sent to stderr.

train_data_path does not support multiple mgf files

Hi~
Thanks for your wonderful work on de novo sequencing. When I trained a model using multiple mgf files, errors happened.
My command:
casanovo --model=train --train_data_path=/my-mgf-folder ......
The error:
ValueError(f"Only MGF files are supported.")