Coder Social home page Coder Social logo

alignparse's People

Contributors

afrubin avatar andrewwbutler avatar arfon avatar bernadetadad avatar jbloom avatar khdcrawford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

alignparse's Issues

dependencies for mybinder

@khdusenbury:

The environment.yml file has to have all of the dependencies used by the Jupyter notebooks. Ones that are also required by alignparse in setup.py will automatically be installed, but anything else used by the notebooks needs to be specified in environment.yml. For our notebooks, that is the dms_variants package.

So in environment.yml, underneath this line add:

- dms_variants>=0.4.1

You can then confirm it works by going to mybinder and running the notebooks. Right now, the link you have in the examples won't work as it points to the master branch -- but that is fine, we want to keep it pointing to master as soon we'll merge docs into master. But confirm the mybinder works by going to the link for docs, which is:
https://mybinder.org/v2/gh/jbloomlab/alignparse/docs?filepath=notebooks

After you've done this, perhaps the docs branch is done? If you think it is then all done (except for the paper, increment the version in the CHANGELOG and dms_variants/__init__.py to 0.1.0 and submit a pull request to merge docs into master.

minimap2 module: retain ML:B tag

Hello,
I'm trying to use the minimap2 module of alignparse to map reads to a genome while retaining several tags in the fastq headers.
I'm using data that contains information on modified bases in MM and ML tags.
The ML tag is of type B, and this gives an error in line 308 of minimap2.py: ValueError: bad tag type B for tag ML
Could the script be updated to also support tags with type B?

Allow single `feature_parse_specs` dict for multiple targets

For instances with multiple, very similar targets, it would nice to only have to specify one set of feature_parse_specs rather than having to have a file with specs for each target.

We definitely want the default to be requiring the feature_parse_specs file to have a dict keyed with the target name, but I think a boolean option of something like "apply_to_all" to allow the user to provide a .yaml file with a single specs dictionary and apply it to all targets might be nice. We could even customize it more and allow the user to provide a dict keyed by the fps dict names in .yaml file with a list of which targets it should apply to.

This can be added later though as it is not necessary for current analyses.

Specify `.csv` name in `parse_alignment`

Right now, if you want to output the aligned and filtered reads to .csv, the files are given a default name of the form target_aligned.csv. If there are multiple runs with the same targets, the files for each run will have the same names. If to_csv is used, the user should be able to specify the file name to keep track of which run the reads came from.

f"string=" issue

Thank you for maintaining wonderful package.

I found out that the codes include f"variable=" expression in utils.py, which is available in python 3.8 or higher. But the installation document says it requires Python 3.7 or higher and is currently tested on Python 3.9. I think the codes need to be modified to general formatting (e.g., .format() or dependency information should be updated to python >= 3.8.

Thank you so much.

PySAM problems with Python 3.8 + OS X

pip install alignparse fails for me inside a clean MacOS 10.14 + Python 3.8 environment created with conda due to a PySAM build issue. Using Python 3.7 or else installing pysam with conda in 3.8 works just fine.

Quite likely to be a Mac only issue that will be resolved in a future pysam update, but you could consider creating an evironment.yml to allow conda to users to install prebuilt dependencies via
conda env create -f environment.yml

Edit: conda would also allow managed installation of minimap2

MyBinder Connection Failed

Hey! I'm having trouble getting the Jupyter notebooks to run on MyBinder, it seems to be a connection error (screenshot attached). Is this working for you guys?

image

I did try a different notebook on MyBinder and that's able to run

expand README

Maybe expand the main package README to have a short summary similar to that found on the index page of the docs.

Right now the README is very bare-bones compared to the docs index.

Strand alignment with OPTIONS_VIRUS_W_DEL

Current alignment with OPTIONS_VIRUS_W_DEL only aligns reads mapping to the forward strand. It should be corrected to align to both strands. The options should be as follows: '-xsplice:hq', '-un', '-C0', '--splice-flank=no', '-M=1','--end-seed-pen=2', '--end-bonus=2', '--secondary=no', '--cs'

Why use Genbank's type for your feature name?

This is in alignparse/targets.py

image

That genbank field is allowed to be present multiple times. But your use of that field for a name breaks this expected behavior. This makes writing scripts to slice our vector maps into amplicons that your very helpful package requires very tedious and seems like something that could be helped very easily just by respecting the genbank format. For instance we have many primer_bind type features but their NAMES are different. But you dont look at the names. You look at the type.

Can I ask why?

What do {'spacer', 'termini3', 'gene', 'termini5'} each represent?

Forgive me if I missed it on the doc site, but the only places I see these mentioned are in the notebook examples and those take for granted that the person either already has amplicon files made for them or that they know what these are so that they can correctly make their own amplicon genbank files.

Is there a place that you can point me to that provides documentation regarding their definitions and how I should make sure that what I end up labeling as these things in my files are what your software will be expecting them to be?

LASV example targets / parse specs

In the Lassa virus docs, I found it confusing how the explanation of the parse specs and the targets was interspersed. You first discuss the parse specs, but then show the targets before showing the YAML. Maybe the order of this could be simplified by first doing all of the targets stuff and then doing all of the parse specs stuff?

JOSS paper

@khdusenbury, I looked at the JOSS manuscript submission instructions, and I have the following suggestions to build on your progress so far:

  1. First let's finish the docs, merge the docs branch, and make the stable version 0.1.0 of the package. That way the docs branch isn't mixing the paper and the documentation writing.

  2. Then let's make a new paper branch and finish the JOSS manuscript in that.

  3. I think the JOSS manuscript should go in its own subdirectory of the repo with a README etc; maybe this could be a ./paper/ subdirectory or something.

  4. Based on your current paper draft, I think it should be much more tailored to describe the specific application. Right now I think it focuses too much on general long-read sequencing. Our package definitely is not useful for standard uses of long-read sequencing like genome assembly: it is for the specific case of amplicons with defined features that need to be separated and analyzed. You sort of make this point, but not clearly and quickly enough. Make it much clearer the exact need it is filling: parsing complex sets of features from long sequences. I also definitely suggest splitting into sections, and describing specific uses cases (like barcoded DMS, where you can link to existing Shendure lab papers) and single-cell virus sequencing (where you can link to our paper). Also obviously link the examples and key parts of the API in the paper similar to the example.

Do you want to work along these lines and then maybe we can meet in person to revise?

Lassa virus example/other aligners in manuscript

The first question I had after reading the paper was whether I could use alignments that are already generated. This is answered nicely by the Lassa virus example in the documentation, but this isn't in the paper itself. I think the paper would benefit from mentioning this capability explicitly and referencing this example.

`pandas` has deprecated `append`

Warning:

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Test on real-size data

We should run the code over "real-size" data: like all of Danny's RecA PacBio runs and all of Kate's VEP pilot.

We want to make sure that:

  1. It runs without error and returns sensible results.

  2. It is not excessively slow. If it is too slow, we can work on speeding up code.

Add to `cs_parse` module functions to parse `cs` strings

These would be cs_to_sequence, cs_to_mutation_str, and cs_to_nt_mutation_count, cs_to_op_mutation_count, and cs_to_clip_count. Probably these should have an optional parameter (which I'm not sure about best default, maybe even False although we'd then set to true) that handle our custom cs strings with the clip in them.

The mutations should probably be in the same space-delimited form handled in the consensus module. Right now that allows to insertion forms: 'ins5len7' or ins5GAAGGCT'. I think we should use the latter, the former was just for backward compatibility. Also pay attention to this using 1-based numbering as that will be the "human readable" format. I think the 'del5to5' is in inclusive numbering as that also seems more human readable.

add hyperlinks to API in example notebooks

Right now the notebooks refer to alignparse functions like Targets, but those are not hyperlinked to the actual API documentation.

We want them to be hyperlinked. For instance, look at the dms_variants examples: when it refers to CodonVariantTable that is actually hyperlinked like this CodonVariantTable.

Now that you have built an initial draft of the gh-pages docs, you can go back and hyperlink key references to the API in all three notebooks. For instance, what you currently have represented as Targets would now by hyperlinked in the Jupyter notebook Markdown like this Targets.

You can get the exact hyperlinks for classes / methods / functions from the index here.

YAML + FASTA or YAML only input files

After running the examples, it doesn't seem like there's any information in the Genbank files that couldn't be included in the YAML file. This would be more straightforward for users who are unfamiliar with the Genbank Flat File format and/or don't have the tools to generate these files.

Two options would be to continue to require two input files, but accept the sequence as a FASTA file instead of Genbank, or to allow the user to include the sequence in the YAML file itself.

This would be especially welcome because I didn't see anything in the alignparse documentation about how to generate the Genbank file. Some example code or a link to the relevant documentation elsewhere (Biopython?) should be added.

Add more info about filtered reads

It would be helpful to have additional information about filtered reads. I was thinking of adding a column with the cs tag of the feature that failed the filter to the 'filtered' dataframe. This can at least help users see if all reads failing a particular filter are failing for the same reason.

Flesh out `VEP` example -- or just make private tests?

The VEP_pilot.ipynb should be updated to use parse_alignment.

Then we should decide: do we want this is an actual example that is fully fleshed out / explained for the docs? Or do we just want to keep it internal as a test since the project is ongoing and we may not want to add it docs now?

possible bug with indel consensus?

I found a potential bug with alignparse's calling of consensus mutations (here, pertaining to an indel). I have attached a reduced table of ccs's (debugging_ccs.csv) and jupyter notebook (alignparse_bug.md) that demonstrate the problem alongside a different consensus variant with an indel that is being properly handled, for reference. I was able to reduce the default max_indel_diffs parameter to 1 to deal with my specific issue, but it made me wonder if this was a more general issue to be addressed.

In the attachments, I am trying to call a consensus variant for the barcode TCTCAATTAAGAAAAA, which has two ccs reads that agree on the point mutations as well as the presence of an insertion at residue 349, but one ccs annotates the insertion as 12 nt and the other has an additional 13th nt.

When I run simple_mutconsensus with default parameters, I get a consensus sequence for this barcode that contains all of the point mutations, but no indel at all, misleadingly looking like it's a shorter-length variant.

My interpretation of where this is going wrong is: when there are 2 ccs, if an indel (and I worry whether the same thing happens with a point mutant?) is present in only one of the two variants, it is not a consensus indel because it is not present in >50% of ccs. In this case, therefore, neither of these 2 ccs have a consensus indel, as there is no consensus on this ins349 identity among the 2 ccs. What goes wrong is that if neither of these conflicting ins349 insertions are consensus, they should be treated as minor indels, and they're each present at >0.25 frequency so per the max_minor_indel_frac=0.25 default parameter, this barcode should be dropped as an unresolvable consensus -- but this is evidently not happening!

My fix was to set max_indel_diffs=1, which then should discard this variant from consensus calling because the two ccs's have a total of 2 indel differences from one another. But, this shouldn't be the ideal way to eliminate this type of barcode from consensus calling and seems to point to something wrong occuring when either or both situations are at play in terms of there being exactly 2 ccs's (so a consensus shoudl require that both ccs are in complete agreement), or something wrong with indel QCing.
debugging_ccs.csv
alignparse_bug.md

Support for compressed input files

The API documentation says that gzip-compressed input files can be used, but this isn't mentioned in the manuscript or higher-level docs. This is a really nice convenience feature that should be mentioned in a more accessible place, possibly featured in one of the examples.

The tests don't seem to include compressed input files (I only saw uncompressed FASTQ), and these tests should be added if this is a supported feature.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.