Coder Social home page Coder Social logo

sacall-basecaller's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sacall-basecaller's Issues

Training script for the transformer

I was wondering if there might be a separate script to train the transformer model as defined by Transformer in ./transformer/modules.py? Something similar to ./ctc_train.py.

Many thanks in advance!

Fasta file after basecalling empty

Hi,
I am interested in using your basecaller. I have trained it with my own data but in the basecalling step the produced fasta file is still empty afterwards.

I found out that there is an issue in the decoder step of the ctcdecode module because the variable seq_lens, which is retrieved by the decode function of your class BeamCTCDecoder in the script ctc_decoder.py (line 130), contains only zeros. This variable you use then in the convert_to_strings function in the same class which causes that the decoded sequence is empty. Can you help me with this issue?

Many thanks,
Monika

Running ctc_train.py to create a custom model - example/instruction

Hi Huang Neng,

I am trying to create a custom model using your ctc_train.py python script, and as a first step am trying to replicate your creation of the Klebsiella pneumoniae model with some teammates. I see with issue #3 you addressed finding training datasets, but I'm still confused how to plug this in to the python script. Do you have an example of a command line use of this script?

Specifically, we went to the website indicated in issue #3 where training datasets can be obtained. We went into the training dataset folder and downloaded one of the files, which has the following general structure:

- Klebsiella_pneumoniae_KSB1_7F
  - sloika_hdf5s
    - remapped_0000.hdf5
    - remapped_0001.hdf5
  - training_fast5s
    - 0000
    - 0001
    - strands_0000.txt
    - strands_0001.txt
  - validation_fast5s
  - filtered_strand_list.txt
  - read_references.fasta

Your python script requires 5 inputs, namely -save_model (path for saving your model), -train_signal_path (directory with training fast5 files), -train_label_path (presumably a file with expected sequences for the fast5 files), -test_signal_path (directory with testing fast5 files), and -test_label_path (presumably a file with expected sequences for the fast5 files).

In this instance, we're not sure which of the given files are the label files. Is it the sloika hdf5 files, like the following:

python3 ctc_train.py -save_model ./test_ouput2 -as ../Klebsiella_pneumoniae_KSB1_7F/training_fast5s/0000 -al ../Klebsiella_pneumoniae_KSB1_7F/sloika_hdf5s/remapped_0000.hdf5 -es ../Klebsiella_pneumoniae_KSB1_7F/training_fast5s/0001 -el ../Klebsiella_pneumoniae_KSB1_7F/sloika_hdf5s/remapped_0001.hdf5

We ran this as a shot in the dark, but Data.DataLoader() wasn't happy with that, possibly because we were using the incorrect label files.

Traceback (most recent call last):
  File "ctc_train.py", line 246, in <module>
    main()
  File "ctc_train.py", line 240, in main
    train(model=model,
  File "ctc_train.py", line 43, in train
    batch = train_provider.next()
  File "/home/abd/Documents/basecaller/SACall-basecaller/generate_dataset/train_dataloader.py", line 59, in next
    signal, label = self.dataiter.next()
AttributeError: '_SingleProcessDataLoaderIter' object has no attribute 'next'

Alternatively, we were wondering if we needed to artificially manipulate the read_reference.fasta file to split the data into training and testing datasets, but by this time we're mostly just guessing.

Do you know what the command line arguments should be if we were to make a model using the dataset referenced in issue #3, including file types and/or required preprocessing of the data?

torch

Hey, I've been trying to get SACall to run on my Windows and Mac but to no avail.
I've been successful with my Ubuntu, however, I'm unable to workaround this error

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
Traceback (most recent call last):
File "call.py", line 215, in <module>
main()
File "call.py", line 210, in main
call_model = Call(argv).to(device)
File "call.py", line 56, in __init__
checkpoint = torch.load(opt.model)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I re-installed pytorch with the settings below, but still got the error.
Screen Shot 2023-03-18 at 9 52 58 AM

I modified line 56
checkpoint = torch.load(opt.model,map_location=torch.device('cpu'))

and got this error

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
[Info] Trained model state loaded.
Traceback (most recent call last):
File "call.py", line 215, in <module>
main()
File "call.py", line 210, in main
call_model = Call(argv).to(device)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to
return self._apply(convert)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/dyalcin/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I then changed line 207 to
device = torch.device('cpu')

and get

multiprocessing generate records file...
missing some component in fast5 file
finish.
Time cost: 0.00 min
[Info] Trained model state loaded.
0it [00:00, ?it/s]

Could you help figure out the problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.