Coder Social home page Coder Social logo

villen-lab / pyascore Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 5.0 649 KB

A python package for fast post translational modification localization, powered by Cython.

Home Page: https://pyascore.readthedocs.io/

License: MIT License

Python 54.03% C++ 26.67% Cython 19.30%
bioinformatics proteomics post-translational-modification python cython cpp

pyascore's People

Contributors

anthonyofseattle avatar enrimassi avatar flobay avatar freejstone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pyascore's Issues

Add mzIdentML reading

I seem to offer the ability to read pepXML and Percolator and Mokapot output, but for some reason have failed to add in MzIdentML reading. This is an oversight and should be remedied.

Enhancements before publication

Before version 1.0.0, which will be the first version to appear alongside publication, I want to make the following improvements to the repository.

  1. Convert to modern python package design and upload to python package index.
    Currently, I am using an older setup for a python package which I think makes it difficult to control compilation
    and submit to the python package index. I would like to improve the package meta data and submit to PIP so
    that users don't need to go through the drawn out compilation process on Linux or Mac.

  2. Enhance ModifiedPeptide class to be a natural python iterator
    Right now, the ModifiedPeptide class lacks a lot of the things that I would like to see in a good python iterator.
    I would like to get this class working as smoothly as possible in python, with options over precisely which ions
    are returned (site determining or not).

Compiling with mac

Hi there!

I have been trying to embed pyAscore with a tool that I have recently created to localize peptide-identifications. I just wanted to flag that I had some issues trying to install it for my mac computers (both M1 and intel-based). I was successfully able to install it by changing the c++ compiler from clang to g++ homebrew, and specifying in the setup.py file extra_compile_args=['-std=c++11']. I am not very familiar with Cython and C/C++ generally, so I am not sure if this is the best solution.

Cheers,

Jack

unittest errors relating to pyteomics

Hi me again!

Whilst I don't think there is any problems with your module, I wanted to flag some some issues that occurred when running the unittests. I think there might be a small bug in the pyteomics package which I made an issue there. The errors are attached in the text file. After fixing the small bug in the pyteomics package, things work however I come across one last error. Are you able to reproduce these errors that are triggered by the pyteomics package?

unitttest-errors.txt

Cheers,

Jack

Scan extraction from mzIdentML files is incorrect

I was contacted by a user who noted that mzIdentML input from MS-GF+ was not working. Looking into the code, I determined that this line is the problem:

https://github.com/AnthonyOfSeattle/pyAscore/blob/c6d0146fcdcf6b5e7aba5f9700a7bc103e24b411/pyascore/parsing/id_parsers.py#L308

I have been doing my testing with Comet, which outputs spectrumIDs that look like scan=25465, but MS-GF+ outputs a lot more info: controllerType=0 controllerNumber=1 scan=25465.

Extract ascores for all potential PTM sites

Thanks for this modern implementation of Ascore!

I would like to get scores for all potential PTM sites, but only the best score seems to be reported, even though the alternative sites are listed in the output. For instance:

Scan LocalizedSequence PepScore Ascores AltSites
3767 ALLSLRS[80]HK 23.64276885986328 12.159486 4

Additionally, when more than one alternate sites are present, the indices seem to repeat the first alternate site, instead of listing both alternate sites correctly:

Scan LocalizedSequence PepScore Ascores AltSites
4190 ALLSLHS[80]SK 35.06369400024414 7.7721076 4,4

Am I missing an option to report all scores, or could a simple change to the (Python) code allow me to parse all scores?

Command line argument problems

Afte independent tests by another lab member, a couple issues were found:

  • Using lowercase to specify the modified amino acids results in a seg fault
  • The documentation states that neutral losses are specified with , but they are actually specified with ;. The former is better since the latter doesn't work well with unix.

Error when number of scans in id file does not match spectra file

Sometimes, if a user specifies an identification file which has more/less scans than the spectrum file, pyAscore will error without an intuitive message. This situation happens if you aggregate all PSMs together (e.g. percolator or mokapot --aggregate output) and try to feed it to pyAscore. This is not a something I want to support, but I need to be more transparent about the message of what is going on.

mass cannot be matched

hi pyAscore team.
I have a problem that my mass spectrometry data had two modifications to K, and when I submitted the code, the following example spectral data data seemed to be wrong. , and two K modifications become the same.

## input code
pyascore.IdentificationParser(psm_file, "pepXML")

###  pepXML infomation
<aminoacid_modification aminoacid="K" massdiff="42.010565" mass="170.105528446600005" variable="Y" binary="N" description="Acetyl (K)"/>
<aminoacid_modification aminoacid="C" massdiff="57.021464000000002" mass="160.030648985200003" variable="Y" binary="N" description="Carbamidomethyl (C)"/>
<aminoacid_modification aminoacid="K" massdiff="43.005814000000001" mass="171.100777414700019" variable="Y" binary="N" description="Carbamyl (K)"/>
....
<spectrum_query spectrum="01CPTAC_UCEC_A_PNNL_20180621_B1S1_f03.3256.3256.2" start_scan="3256" end_scan="3256" precursor_neutral_mass="1155.692700233000096" assumed_charge="2" index="3255" retention_time_sec="638.6" >
<search_result>
	<search_hit hit_rank="1" peptide="KKSLNPR" peptide_prev_aa="R" peptide_next_aa="R" protein="DECOY_sp|Q5TCZ1|SPD2A_HUMAN" num_tot_proteins="1" num_matched_ions="0" tot_num_ions="0" calc_neutral_pep_mass="1155.692700233000096" massdiff="0.0" num_tol_term="1" num_missed_cleavages="0" is_rejected="0" protein_descr="Protein No. 1">
		<modification_info modified_peptide="n[230]K[171]K[170]SLNPR" mod_nterm_mass="230.170757031900024">
			<mod_aminoacid_mass position="1" mass="171.100777414700019"/>
			<mod_aminoacid_mass position="2" mass="170.105528446599976"/>
		</modification_info>
		<search_score name="Posterior Error Probability" value="0.711939"/>
		<search_score name="Posterior Error Probability" value="0.711939"/>
		<analysis_result analysis="peptideprophet">
			<peptideprophet_result probability="0.288061" all_ntt_prob="(0.0000,0.0000,0.288061)"/>
		</analysis_result>
	</search_hit>
</search_result>
</spectrum_query>
...


## ouput data
scan	charge_state	score	peptide	mod_positions	mod_masses
....
3256	2		KKSLNPR	[0 0 2]	[230.17076111  42.010565    42.01056677]
....

Allow to calculate Peptide Score for unambiguous psms

Hey,

Great package that you put there together.

We would love to implement this into our pipelines. For this, though, it would be great to have a Peptide Score for unambiguous peptide sequences as well. Of course, the Ascore is infinite, but the Peptide score is still available for the single (=best) possibility, as far as I understand the process.

In particular:
https://github.com/AnthonyOfSeattle/pyAscore/blob/25263db3d95fa1b7bfe8c80dd051c78a6ab012e4/pyascore/ptm_scoring/cpp/Ascore.cpp#L267-L271

I was wondering if one could add a flag to support this if performance is really that critical. But frankly, a 90 min raw file on a desktop computer was done in 5 seconds. And for us, it's roughly 20-25 % unambiguous, so that extra would then increase to 7 seconds. IMO still super fast :). So just making it possible as default would be amazing!

Off-topic:

  • The pip install works nicely
  • Building it from git directly did not work for me because of compilation errors pip install git+https://github.com/Villen-Lab/pyAscore

Fix numpy binary incompatability

After install, an error or warning during import seems to occur for some installations:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

A quick poke around SO seems to indicate that this is an incompatibility stemming from which numpy version I used for compilation vs which one is installed for users. I imagine it can be fixed by raising the minimum numpy version.

Add False Localization Rate Estimate

After some reading and discussions at ASMS, it seems that it would be really nice to add an FLR calculation to pyAscore. For the user, this would just be another command line argument. Internally, the Ascore agorithm will be run on each PSM with the decoy amino acid tacked on to the possible amino acids for a mod. I.e. if a user specifies phospho on STY and a decoy AA as A, then internally we run Ascore as phospho on STYA and handle the FLR calculations. Right now the idea would be to sort the PSMs based on score and then report the FLR at each score threshold.

Extra PTMs and Unimod

@AnthonyOfSeattle I have a lot of files that the extra modifications are all Unimod modifications. I would like to use your API to compute large-scale the Ascore for a lot of PTMs but need some way to pass the other ptms (e.g Oxidation) to the algorithm.

Do you have an example how to do that for Unimod PTMs

Reading mzIdentML from Comet

I have been having trouble reading mzIdentML files from Comet (v. 2021010). It seems there are some unresolved references in the file that are causing pyteomics' MzIdentML class to read the file super inefficiently. Sadly, this is not something I can fix here, but I want to make note of it. When using Comet, the pepXML output may be best as an input to pyAscore. Or, better yet, just use the TXT file output from Mokapot or Percolator.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.