jbloomlab / alignparse Goto Github PK
View Code? Open in Web Editor NEWAlign sequences and then parse features.
License: Other
Align sequences and then parse features.
License: Other
The number of passes are not parsed from ccs
5.0 output FASTQs. While this information is not essential, it would be nice to be able to get it same way as for ccs
4.*
@khdusenbury:
The environment.yml
file has to have all of the dependencies used by the Jupyter notebooks. Ones that are also required by alignparse
in setup.py
will automatically be installed, but anything else used by the notebooks needs to be specified in environment.yml
. For our notebooks, that is the dms_variants
package.
So in environment.yml
, underneath this line add:
- dms_variants>=0.4.1
You can then confirm it works by going to mybinder
and running the notebooks. Right now, the link you have in the examples won't work as it points to the master
branch -- but that is fine, we want to keep it pointing to master
as soon we'll merge docs
into master
. But confirm the mybinder
works by going to the link for docs
, which is:
https://mybinder.org/v2/gh/jbloomlab/alignparse/docs?filepath=notebooks
After you've done this, perhaps the docs branch is done? If you think it is then all done (except for the paper, increment the version in the CHANGELOG
and dms_variants/__init__.py
to 0.1.0
and submit a pull request to merge docs
into master
.
Is there a minimum version requirement for minimap2? If so it should be added to the documentation.
Hello,
I'm trying to use the minimap2 module of alignparse to map reads to a genome while retaining several tags in the fastq headers.
I'm using data that contains information on modified bases in MM and ML tags.
The ML tag is of type B, and this gives an error in line 308 of minimap2.py: ValueError: bad tag type B for tag ML
Could the script be updated to also support tags with type B?
For instances with multiple, very similar targets, it would nice to only have to specify one set of feature_parse_specs
rather than having to have a file with specs for each target.
We definitely want the default to be requiring the feature_parse_specs
file to have a dict keyed with the target name, but I think a boolean option of something like "apply_to_all" to allow the user to provide a .yaml
file with a single specs dictionary and apply it to all targets might be nice. We could even customize it more and allow the user to provide a dict keyed by the fps
dict names in .yaml
file with a list of which targets it should apply to.
This can be added later though as it is not necessary for current analyses.
Right now, if you want to output the aligned and filtered reads to .csv
, the files are given a default name of the form target_aligned.csv
. If there are multiple runs with the same targets, the files for each run will have the same names. If to_csv
is used, the user should be able to specify the file name to keep track of which run the reads came from.
Thank you for maintaining wonderful package.
I found out that the codes include f"variable=" expression in utils.py, which is available in python 3.8 or higher. But the installation document says it requires Python 3.7 or higher and is currently tested on Python 3.9. I think the codes need to be modified to general formatting (e.g., .format() or dependency information should be updated to python >= 3.8.
Thank you so much.
In the Examples page in the docs, add a link the notebooks on mybinder
. Here is an example from dms_variants of how to do this so it looks like this when rendered.
pip install alignparse
fails for me inside a clean MacOS 10.14 + Python 3.8 environment created with conda due to a PySAM build issue. Using Python 3.7 or else installing pysam with conda in 3.8 works just fine.
Quite likely to be a Mac only issue that will be resolved in a future pysam update, but you could consider creating an evironment.yml to allow conda to users to install prebuilt dependencies via
conda env create -f environment.yml
Edit: conda would also allow managed installation of minimap2
The link in the README for citing alignparse (https://jbloomlab.github.io/alignparse/acknowledgements.html) returns a 404 instead of linking to the JOSS paper.
Hey! I'm having trouble getting the Jupyter notebooks to run on MyBinder, it seems to be a connection error (screenshot attached). Is this working for you guys?
I did try a different notebook on MyBinder and that's able to run
Maybe expand the main package README to have a short summary similar to that found on the index page of the docs.
Right now the README is very bare-bones compared to the docs index.
Current alignment with OPTIONS_VIRUS_W_DEL
only aligns reads mapping to the forward strand. It should be corrected to align to both strands. The options should be as follows: '-xsplice:hq', '-un', '-C0', '--splice-flank=no', '-M=1','--end-seed-pen=2', '--end-bonus=2', '--secondary=no', '--cs'
I'm now getting a 'Cannot allocate memory' error when creating Summaries
objects, even for my small test dataset in the alignparse/notebooks
directory.
I'm not familiar enough with how the pathos
multiprocessing works to really know how to debug this.
This is in alignparse/targets.py
That genbank field is allowed to be present multiple times. But your use of that field for a name breaks this expected behavior. This makes writing scripts to slice our vector maps into amplicons that your very helpful package requires very tedious and seems like something that could be helped very easily just by respecting the genbank format. For instance we have many primer_bind
type features but their NAMES are different. But you dont look at the names. You look at the type.
Can I ask why?
Forgive me if I missed it on the doc site, but the only places I see these mentioned are in the notebook examples and those take for granted that the person either already has amplicon files made for them or that they know what these are so that they can correctly make their own amplicon genbank files.
Is there a place that you can point me to that provides documentation regarding their definitions and how I should make sure that what I end up labeling as these things in my files are what your software will be expecting them to be?
In the Lassa virus docs, I found it confusing how the explanation of the parse specs and the targets was interspersed. You first discuss the parse specs, but then show the targets before showing the YAML. Maybe the order of this could be simplified by first doing all of the targets stuff and then doing all of the parse specs stuff?
The consensus
module is useful but currently it's use is not illustrated in examples.
@khdusenbury, I looked at the JOSS manuscript submission instructions, and I have the following suggestions to build on your progress so far:
First let's finish the docs, merge the docs
branch, and make the stable version 0.1.0 of the package. That way the docs
branch isn't mixing the paper and the documentation writing.
Then let's make a new paper
branch and finish the JOSS manuscript in that.
I think the JOSS manuscript should go in its own subdirectory of the repo with a README etc; maybe this could be a ./paper/
subdirectory or something.
Based on your current paper draft, I think it should be much more tailored to describe the specific application. Right now I think it focuses too much on general long-read sequencing. Our package definitely is not useful for standard uses of long-read sequencing like genome assembly: it is for the specific case of amplicons with defined features that need to be separated and analyzed. You sort of make this point, but not clearly and quickly enough. Make it much clearer the exact need it is filling: parsing complex sets of features from long sequences. I also definitely suggest splitting into sections, and describing specific uses cases (like barcoded DMS, where you can link to existing Shendure lab papers) and single-cell virus sequencing (where you can link to our paper). Also obviously link the examples and key parts of the API in the paper similar to the example.
Do you want to work along these lines and then maybe we can meet in person to revise?
The first question I had after reading the paper was whether I could use alignments that are already generated. This is answered nicely by the Lassa virus example in the documentation, but this isn't in the paper itself. I think the paper would benefit from mentioning this capability explicitly and referencing this example.
Warning:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
We should run the code over "real-size" data: like all of Danny's RecA PacBio runs and all of Kate's VEP pilot.
We want to make sure that:
It runs without error and returns sensible results.
It is not excessively slow. If it is too slow, we can work on speeding up code.
These would be cs_to_sequence
, cs_to_mutation_str
, and cs_to_nt_mutation_count
, cs_to_op_mutation_count
, and cs_to_clip_count
. Probably these should have an optional parameter (which I'm not sure about best default, maybe even False although we'd then set to true) that handle our custom cs
strings with the clip in them.
The mutations should probably be in the same space-delimited form handled in the consensus
module. Right now that allows to insertion forms: 'ins5len7' or ins5GAAGGCT'. I think we should use the latter, the former was just for backward compatibility. Also pay attention to this using 1-based numbering as that will be the "human readable" format. I think the 'del5to5' is in inclusive numbering as that also seems more human readable.
Right now the notebooks refer to alignparse
functions like Targets
, but those are not hyperlinked to the actual API documentation.
We want them to be hyperlinked. For instance, look at the dms_variants examples: when it refers to CodonVariantTable
that is actually hyperlinked like this CodonVariantTable.
Now that you have built an initial draft of the gh-pages
docs, you can go back and hyperlink key references to the API in all three notebooks. For instance, what you currently have represented as Targets
would now by hyperlinked in the Jupyter notebook Markdown like this Targets.
You can get the exact hyperlinks for classes / methods / functions from the index here.
After running the examples, it doesn't seem like there's any information in the Genbank files that couldn't be included in the YAML file. This would be more straightforward for users who are unfamiliar with the Genbank Flat File format and/or don't have the tools to generate these files.
Two options would be to continue to require two input files, but accept the sequence as a FASTA file instead of Genbank, or to allow the user to include the sequence in the YAML file itself.
This would be especially welcome because I didn't see anything in the alignparse
documentation about how to generate the Genbank file. Some example code or a link to the relevant documentation elsewhere (Biopython?) should be added.
It would be helpful to have additional information about filtered reads. I was thinking of adding a column with the cs
tag of the feature that failed the filter to the 'filtered' dataframe. This can at least help users see if all reads failing a particular filter are failing for the same reason.
The VEP_pilot.ipynb
should be updated to use parse_alignment
.
Then we should decide: do we want this is an actual example that is fully fleshed out / explained for the docs? Or do we just want to keep it internal as a test since the project is ongoing and we may not want to add it docs now?
I found a potential bug with alignparse's calling of consensus mutations (here, pertaining to an indel). I have attached a reduced table of ccs's (debugging_ccs.csv) and jupyter notebook (alignparse_bug.md) that demonstrate the problem alongside a different consensus variant with an indel that is being properly handled, for reference. I was able to reduce the default max_indel_diffs parameter to 1 to deal with my specific issue, but it made me wonder if this was a more general issue to be addressed.
In the attachments, I am trying to call a consensus variant for the barcode TCTCAATTAAGAAAAA, which has two ccs reads that agree on the point mutations as well as the presence of an insertion at residue 349, but one ccs annotates the insertion as 12 nt and the other has an additional 13th nt.
When I run simple_mutconsensus with default parameters, I get a consensus sequence for this barcode that contains all of the point mutations, but no indel at all, misleadingly looking like it's a shorter-length variant.
My interpretation of where this is going wrong is: when there are 2 ccs, if an indel (and I worry whether the same thing happens with a point mutant?) is present in only one of the two variants, it is not a consensus indel because it is not present in >50% of ccs. In this case, therefore, neither of these 2 ccs have a consensus indel, as there is no consensus on this ins349 identity among the 2 ccs. What goes wrong is that if neither of these conflicting ins349 insertions are consensus, they should be treated as minor indels, and they're each present at >0.25 frequency so per the max_minor_indel_frac=0.25 default parameter, this barcode should be dropped as an unresolvable consensus -- but this is evidently not happening!
My fix was to set max_indel_diffs=1, which then should discard this variant from consensus calling because the two ccs's have a total of 2 indel differences from one another. But, this shouldn't be the ideal way to eliminate this type of barcode from consensus calling and seems to point to something wrong occuring when either or both situations are at play in terms of there being exactly 2 ccs's (so a consensus shoudl require that both ccs are in complete agreement), or something wrong with indel QCing.
debugging_ccs.csv
alignparse_bug.md
The API documentation says that gzip-compressed input files can be used, but this isn't mentioned in the manuscript or higher-level docs. This is a really nice convenience feature that should be mentioned in a more accessible place, possibly featured in one of the examples.
The tests don't seem to include compressed input files (I only saw uncompressed FASTQ), and these tests should be added if this is a supported feature.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.