jbloomlab / alignparse Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 9.0 8.34 MB

Align sequences and then parse features.

License: Other

Python 24.58% Jupyter Notebook 74.99% TeX 0.43%

alignparse's People

Contributors

Stargazers

Watchers

Forkers

wook2014 vaslanzadeh greenkidneybean xguse oracle5th cdebuck andrewwbutler loicma jianfcpku

alignparse's Issues

Specify `.csv` name in `parse_alignment`

Right now, if you want to output the aligned and filtered reads to .csv, the files are given a default name of the form target_aligned.csv. If there are multiple runs with the same targets, the files for each run will have the same names. If to_csv is used, the user should be able to specify the file name to keep track of which run the reads came from.

`pandas` has deprecated `append`

Warning:

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Add more info about filtered reads

It would be helpful to have additional information about filtered reads. I was thinking of adding a column with the cs tag of the feature that failed the filter to the 'filtered' dataframe. This can at least help users see if all reads failing a particular filter are failing for the same reason.

Strand alignment with OPTIONS_VIRUS_W_DEL

Current alignment with OPTIONS_VIRUS_W_DEL only aligns reads mapping to the forward strand. It should be corrected to align to both strands. The options should be as follows: '-xsplice:hq', '-un', '-C0', '--splice-flank=no', '-M=1','--end-seed-pen=2', '--end-bonus=2', '--secondary=no', '--cs'

dependencies for mybinder

@khdusenbury:

The environment.yml file has to have all of the dependencies used by the Jupyter notebooks. Ones that are also required by alignparse in setup.py will automatically be installed, but anything else used by the notebooks needs to be specified in environment.yml. For our notebooks, that is the dms_variants package.

So in environment.yml, underneath this line add:

- dms_variants>=0.4.1

You can then confirm it works by going to mybinder and running the notebooks. Right now, the link you have in the examples won't work as it points to the master branch -- but that is fine, we want to keep it pointing to master as soon we'll merge docs into master. But confirm the mybinder works by going to the link for docs, which is:
https://mybinder.org/v2/gh/jbloomlab/alignparse/docs?filepath=notebooks

After you've done this, perhaps the docs branch is done? If you think it is then all done (except for the paper, increment the version in the CHANGELOG and dms_variants/__init__.py to 0.1.0 and submit a pull request to merge docs into master.

Read accuracy vs. empirical accuracy

Thanks for the code. Lots of useful stuff like the functions parsing cs-tags.

You define ccs accuracy as Avg(Phred score over the feature). As per:

alignparse/alignparse/utils.py

Line 165 in 972f005

def qvals_to_accuracy(qvals, encoding="numbers"):

But then you calculate empirical accuracy (if I understand correctly) as epsilon = the probability of seeing any error in the feature. Isn't this an apples-to-orchard comparison? And wouldn't you rather want to report the ccs accuracy as Prod(Phred score over the feature) to make it comparable to the empirical accuracy?
At least that makes more sense to me if you want to do an explicit comparison, as in the recA DMS example:
https://jbloomlab.github.io/alignparse/recA_DMS.html

parse passes from `ccs` version 5.0 FASTQ ccs files

The number of passes are not parsed from ccs 5.0 output FASTQs. While this information is not essential, it would be nice to be able to get it same way as for ccs 4.*

minimap2 module: retain ML:B tag

Hello,
I'm trying to use the minimap2 module of alignparse to map reads to a genome while retaining several tags in the fastq headers.
I'm using data that contains information on modified bases in MM and ML tags.
The ML tag is of type B, and this gives an error in line 308 of minimap2.py: ValueError: bad tag type B for tag ML
Could the script be updated to also support tags with type B?

Refactor VEP example notebook to work with `Targets` having `feature_parse_specs`.

Allow single `feature_parse_specs` dict for multiple targets

For instances with multiple, very similar targets, it would nice to only have to specify one set of feature_parse_specs rather than having to have a file with specs for each target.

We definitely want the default to be requiring the feature_parse_specs file to have a dict keyed with the target name, but I think a boolean option of something like "apply_to_all" to allow the user to provide a .yaml file with a single specs dictionary and apply it to all targets might be nice. We could even customize it more and allow the user to provide a dict keyed by the fps dict names in .yaml file with a list of which targets it should apply to.

This can be added later though as it is not necessary for current analyses.

add hyperlinks to API in example notebooks

Right now the notebooks refer to alignparse functions like Targets, but those are not hyperlinked to the actual API documentation.

We want them to be hyperlinked. For instance, look at the dms_variants examples: when it refers to CodonVariantTable that is actually hyperlinked like this CodonVariantTable.

Now that you have built an initial draft of the gh-pages docs, you can go back and hyperlink key references to the API in all three notebooks. For instance, what you currently have represented as Targets would now by hyperlinked in the Jupyter notebook Markdown like this Targets.

You can get the exact hyperlinks for classes / methods / functions from the index here.

Probably a bug in minimap2 error handling?

Line 262 in minimap2.py

with contextlib.ExitStack() as stack:
    sam = stack.enter_context(open(untagged_samfile, "w"))
    err = stack.enter_context(tempfile.TemporaryFile(mode="w"))  ----------Line 251
    try:
        _ = subprocess.check_call(cmds, stdout=sam, stderr=err)
    except Exception as exc:
        if os.path.isfile(untagged_samfile):
            os.remove(untagged_samfile)
        exc_type = type(exc).__name__
        err.seek(0)
        raise RuntimeError(
            f"Error running commands:\n{cmds}\n\n"
            f"Exception:\n{exc_type}: {exc}\n\n"
            f"Error message:\n{err.read()}\n\n"   --------Line 262
        )

The tempfile for errors was created in mode "w" but read during the error handling. This will cause io.UnsupportedOperation: not readable. Maybe the mode should be changed to "w+".

Citation/acknowledgements

The link in the README for citing alignparse (https://jbloomlab.github.io/alignparse/acknowledgements.html) returns a 404 instead of linking to the JOSS paper.

Test on real-size data

We should run the code over "real-size" data: like all of Danny's RecA PacBio runs and all of Kate's VEP pilot.

We want to make sure that:

It runs without error and returns sensible results.
It is not excessively slow. If it is too slow, we can work on speeding up code.

PySAM problems with Python 3.8 + OS X

pip install alignparse fails for me inside a clean MacOS 10.14 + Python 3.8 environment created with conda due to a PySAM build issue. Using Python 3.7 or else installing pysam with conda in 3.8 works just fine.

Quite likely to be a Mac only issue that will be resolved in a future pysam update, but you could consider creating an evironment.yml to allow conda to users to install prebuilt dependencies via
conda env create -f environment.yml

Edit: conda would also allow managed installation of minimap2

Flesh out `VEP` example -- or just make private tests?

The VEP_pilot.ipynb should be updated to use parse_alignment.

Then we should decide: do we want this is an actual example that is fully fleshed out / explained for the docs? Or do we just want to keep it internal as a test since the project is ongoing and we may not want to add it docs now?

MyBinder Connection Failed

Hey! I'm having trouble getting the Jupyter notebooks to run on MyBinder, it seems to be a connection error (screenshot attached). Is this working for you guys?

I did try a different notebook on MyBinder and that's able to run

`Targets.parse_alignment_cs` gets clipping and cs in separate columns

flesh out docs and structure for `Targets.parse_alignment`.

add `mybinder` link to examples

In the Examples page in the docs, add a link the notebooks on mybinder. Here is an example from dms_variants of how to do this so it looks like this when rendered.

YAML + FASTA or YAML only input files

After running the examples, it doesn't seem like there's any information in the Genbank files that couldn't be included in the YAML file. This would be more straightforward for users who are unfamiliar with the Genbank Flat File format and/or don't have the tools to generate these files.

Two options would be to continue to require two input files, but accept the sequence as a FASTA file instead of Genbank, or to allow the user to include the sequence in the YAML file itself.

This would be especially welcome because I didn't see anything in the alignparse documentation about how to generate the Genbank file. Some example code or a link to the relevant documentation elsewhere (Biopython?) should be added.

What do {'spacer', 'termini3', 'gene', 'termini5'} each represent?

Forgive me if I missed it on the doc site, but the only places I see these mentioned are in the notebook examples and those take for granted that the person either already has amplicon files made for them or that they know what these are so that they can correctly make their own amplicon genbank files.

Is there a place that you can point me to that provides documentation regarding their definitions and how I should make sure that what I end up labeling as these things in my files are what your software will be expecting them to be?

expand README

Maybe expand the main package README to have a short summary similar to that found on the index page of the docs.

Right now the README is very bare-bones compared to the docs index.

Why use Genbank's type for your feature name?

This is in alignparse/targets.py

That genbank field is allowed to be present multiple times. But your use of that field for a name breaks this expected behavior. This makes writing scripts to slice our vector maps into amplicons that your very helpful package requires very tedious and seems like something that could be helped very easily just by respecting the genbank format. For instance we have many primer_bind type features but their NAMES are different. But you dont look at the names. You look at the type.

Can I ask why?

f"string=" issue

Thank you for maintaining wonderful package.

I found out that the codes include f"variable=" expression in utils.py, which is available in python 3.8 or higher. But the installation document says it requires Python 3.7 or higher and is currently tested on Python 3.9. I think the codes need to be modified to general formatting (e.g., .format() or dependency information should be updated to python >= 3.8.

Thank you so much.

Lassa virus example/other aligners in manuscript

The first question I had after reading the paper was whether I could use alignments that are already generated. This is answered nicely by the Lassa virus example in the documentation, but this isn't in the paper itself. I think the paper would benefit from mentioning this capability explicitly and referencing this example.

Version requirement for minimap2?

Is there a minimum version requirement for minimap2? If so it should be added to the documentation.

possible bug with indel consensus?

I found a potential bug with alignparse's calling of consensus mutations (here, pertaining to an indel). I have attached a reduced table of ccs's (debugging_ccs.csv) and jupyter notebook (alignparse_bug.md) that demonstrate the problem alongside a different consensus variant with an indel that is being properly handled, for reference. I was able to reduce the default max_indel_diffs parameter to 1 to deal with my specific issue, but it made me wonder if this was a more general issue to be addressed.

In the attachments, I am trying to call a consensus variant for the barcode TCTCAATTAAGAAAAA, which has two ccs reads that agree on the point mutations as well as the presence of an insertion at residue 349, but one ccs annotates the insertion as 12 nt and the other has an additional 13th nt.

When I run simple_mutconsensus with default parameters, I get a consensus sequence for this barcode that contains all of the point mutations, but no indel at all, misleadingly looking like it's a shorter-length variant.

My interpretation of where this is going wrong is: when there are 2 ccs, if an indel (and I worry whether the same thing happens with a point mutant?) is present in only one of the two variants, it is not a consensus indel because it is not present in >50% of ccs. In this case, therefore, neither of these 2 ccs have a consensus indel, as there is no consensus on this ins349 identity among the 2 ccs. What goes wrong is that if neither of these conflicting ins349 insertions are consensus, they should be treated as minor indels, and they're each present at >0.25 frequency so per the max_minor_indel_frac=0.25 default parameter, this barcode should be dropped as an unresolvable consensus -- but this is evidently not happening!

My fix was to set max_indel_diffs=1, which then should discard this variant from consensus calling because the two ccs's have a total of 2 indel differences from one another. But, this shouldn't be the ideal way to eliminate this type of barcode from consensus calling and seems to point to something wrong occuring when either or both situations are at play in terms of there being exactly 2 ccs's (so a consensus shoudl require that both ccs are in complete agreement), or something wrong with indel QCing.
debugging_ccs.csv
alignparse_bug.md

test `Targets.parse_alignment`.

This testing is now implemented in commit 1d2bd92 and so should be done if pull request #20 is merged.

`OSError: [Errno 12] Cannot allocate memory` with `ccs.Summaries()`

I'm now getting a 'Cannot allocate memory' error when creating Summaries objects, even for my small test dataset in the alignparse/notebooks directory.

I'm not familiar enough with how the pathos multiprocessing works to really know how to debug this.

Expand example to show use of `consensus` module.

The consensus module is useful but currently it's use is not illustrated in examples.

JOSS paper

@khdusenbury, I looked at the JOSS manuscript submission instructions, and I have the following suggestions to build on your progress so far:

First let's finish the docs, merge the docs branch, and make the stable version 0.1.0 of the package. That way the docs branch isn't mixing the paper and the documentation writing.
Then let's make a new paper branch and finish the JOSS manuscript in that.
I think the JOSS manuscript should go in its own subdirectory of the repo with a README etc; maybe this could be a ./paper/ subdirectory or something.
Based on your current paper draft, I think it should be much more tailored to describe the specific application. Right now I think it focuses too much on general long-read sequencing. Our package definitely is not useful for standard uses of long-read sequencing like genome assembly: it is for the specific case of amplicons with defined features that need to be separated and analyzed. You sort of make this point, but not clearly and quickly enough. Make it much clearer the exact need it is filling: parsing complex sets of features from long sequences. I also definitely suggest splitting into sections, and describing specific uses cases (like barcoded DMS, where you can link to existing Shendure lab papers) and single-cell virus sequencing (where you can link to our paper). Also obviously link the examples and key parts of the API in the paper similar to the example.

Do you want to work along these lines and then maybe we can meet in person to revise?

Add to `cs_parse` module functions to parse `cs` strings

These would be cs_to_sequence, cs_to_mutation_str, and cs_to_nt_mutation_count, cs_to_op_mutation_count, and cs_to_clip_count. Probably these should have an optional parameter (which I'm not sure about best default, maybe even False although we'd then set to true) that handle our custom cs strings with the clip in them.

The mutations should probably be in the same space-delimited form handled in the consensus module. Right now that allows to insertion forms: 'ins5len7' or ins5GAAGGCT'. I think we should use the latter, the former was just for backward compatibility. Also pay attention to this using 1-based numbering as that will be the "human readable" format. I think the 'del5to5' is in inclusive numbering as that also seems more human readable.

LASV example targets / parse specs

In the Lassa virus docs, I found it confusing how the explanation of the parse specs and the targets was interspersed. You first discuss the parse specs, but then show the targets before showing the YAML. Maybe the order of this could be simplified by first doing all of the targets stuff and then doing all of the parse specs stuff?

Support for compressed input files

The API documentation says that gzip-compressed input files can be used, but this isn't mentioned in the manuscript or higher-level docs. This is a really nice convenience feature that should be mentioned in a more accessible place, possibly featured in one of the examples.

The tests don't seem to include compressed input files (I only saw uncompressed FASTQ), and these tests should be added if this is a supported feature.