huangnengcsu / sacall-basecaller Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 2.0 21.63 MB

Python 87.79% Shell 12.21%

sacall-basecaller's People

Stargazers

Watchers

Forkers

monikaheinzl li-michael

sacall-basecaller's Issues

Training script for the transformer

I was wondering if there might be a separate script to train the transformer model as defined by Transformer in ./transformer/modules.py? Something similar to ./ctc_train.py.

Many thanks in advance!

Fasta file after basecalling empty

Hi,
I am interested in using your basecaller. I have trained it with my own data but in the basecalling step the produced fasta file is still empty afterwards.

I found out that there is an issue in the decoder step of the ctcdecode module because the variable seq_lens, which is retrieved by the decode function of your class BeamCTCDecoder in the script ctc_decoder.py (line 130), contains only zeros. This variable you use then in the convert_to_strings function in the same class which causes that the decoded sequence is empty. Can you help me with this issue?

Many thanks,
Monika

你好，请问你在论文中所使用的数据集可以分享一下吗？我自己处理的数据现在按照你提供的方法训练会出现loss会出现Nan，不清楚是由于我数据处理问题还是框架版本不一致引起的，我想比对一下。谢谢

Running ctc_train.py to create a custom model - example/instruction

Hi Huang Neng,

I am trying to create a custom model using your ctc_train.py python script, and as a first step am trying to replicate your creation of the Klebsiella pneumoniae model with some teammates. I see with issue #3 you addressed finding training datasets, but I'm still confused how to plug this in to the python script. Do you have an example of a command line use of this script?

Specifically, we went to the website indicated in issue #3 where training datasets can be obtained. We went into the training dataset folder and downloaded one of the files, which has the following general structure:

- Klebsiella_pneumoniae_KSB1_7F
  - sloika_hdf5s
    - remapped_0000.hdf5
    - remapped_0001.hdf5
  - training_fast5s
    - 0000
    - 0001
    - strands_0000.txt
    - strands_0001.txt
  - validation_fast5s
  - filtered_strand_list.txt
  - read_references.fasta

Your python script requires 5 inputs, namely -save_model (path for saving your model), -train_signal_path (directory with training fast5 files), -train_label_path (presumably a file with expected sequences for the fast5 files), -test_signal_path (directory with testing fast5 files), and -test_label_path (presumably a file with expected sequences for the fast5 files).

In this instance, we're not sure which of the given files are the label files. Is it the sloika hdf5 files, like the following:

python3 ctc_train.py -save_model ./test_ouput2 -as ../Klebsiella_pneumoniae_KSB1_7F/training_fast5s/0000 -al ../Klebsiella_pneumoniae_KSB1_7F/sloika_hdf5s/remapped_0000.hdf5 -es ../Klebsiella_pneumoniae_KSB1_7F/training_fast5s/0001 -el ../Klebsiella_pneumoniae_KSB1_7F/sloika_hdf5s/remapped_0001.hdf5

We ran this as a shot in the dark, but Data.DataLoader() wasn't happy with that, possibly because we were using the incorrect label files.

Traceback (most recent call last):
  File "ctc_train.py", line 246, in <module>
    main()
  File "ctc_train.py", line 240, in main
    train(model=model,
  File "ctc_train.py", line 43, in train
    batch = train_provider.next()
  File "/home/abd/Documents/basecaller/SACall-basecaller/generate_dataset/train_dataloader.py", line 59, in next
    signal, label = self.dataiter.next()
AttributeError: '_SingleProcessDataLoaderIter' object has no attribute 'next'

Alternatively, we were wondering if we needed to artificially manipulate the read_reference.fasta file to split the data into training and testing datasets, but by this time we're mostly just guessing.

Do you know what the command line arguments should be if we were to make a model using the dataset referenced in issue #3, including file types and/or required preprocessing of the data?

torch

Hey, I've been trying to get SACall to run on my Windows and Mac but to no avail.
I've been successful with my Ubuntu, however, I'm unable to workaround this error

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
Traceback (most recent call last):
File "call.py", line 215, in <module>
main()
File "call.py", line 210, in main
call_model = Call(argv).to(device)
File "call.py", line 56, in __init__
checkpoint = torch.load(opt.model)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I re-installed pytorch with the settings below, but still got the error.

I modified line 56
checkpoint = torch.load(opt.model,map_location=torch.device('cpu'))

and got this error

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
[Info] Trained model state loaded.
Traceback (most recent call last):
File "call.py", line 215, in <module>
main()
File "call.py", line 210, in main
call_model = Call(argv).to(device)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to
return self._apply(convert)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I then changed line 207 to
device = torch.device('cpu')

and get

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
[Info] Trained model state loaded.
0it [00:00, ?it/s]

Could you help figure out the problem?

huangnengcsu / sacall-basecaller Goto Github PK

sacall-basecaller's People

Stargazers

Watchers

Forkers

sacall-basecaller's Issues

Training script for the transformer

Fasta file after basecalling empty

你好，请问你在论文中所使用的数据集可以分享一下吗？我自己处理的数据现在按照你提供的方法训练会出现loss会出现Nan，不清楚是由于我数据处理问题还是框架版本不一致引起的，我想比对一下。谢谢

Running ctc_train.py to create a custom model - example/instruction

torch

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent