villen-lab / pyascore Goto Github PK

View Code? Open in Web Editor NEW

18.0 2.0 5.0 649 KB

A python package for fast post translational modification localization, powered by Cython.

Home Page: https://pyascore.readthedocs.io/

License: MIT License

Python 54.03% C++ 26.67% Cython 19.30%

bioinformatics proteomics post-translational-modification python cython cpp

pyascore's People

Contributors

Stargazers

Watchers

Forkers

ralfg anthonyofseattle flobay freejstone enrimassi

pyascore's Issues

Add mzIdentML reading

I seem to offer the ability to read pepXML and Percolator and Mokapot output, but for some reason have failed to add in MzIdentML reading. This is an oversight and should be remedied.

Enhancements before publication

Before version 1.0.0, which will be the first version to appear alongside publication, I want to make the following improvements to the repository.

Convert to modern python package design and upload to python package index.
Currently, I am using an older setup for a python package which I think makes it difficult to control compilation
and submit to the python package index. I would like to improve the package meta data and submit to PIP so
that users don't need to go through the drawn out compilation process on Linux or Mac.
Enhance ModifiedPeptide class to be a natural python iterator
Right now, the ModifiedPeptide class lacks a lot of the things that I would like to see in a good python iterator.
I would like to get this class working as smoothly as possible in python, with options over precisely which ions
are returned (site determining or not).

I have been trying to embed pyAscore with a tool that I have recently created to localize peptide-identifications. I just wanted to flag that I had some issues trying to install it for my mac computers (both M1 and intel-based). I was successfully able to install it by changing the c++ compiler from clang to g++ homebrew, and specifying in the setup.py file extra_compile_args=['-std=c++11']. I am not very familiar with Cython and C/C++ generally, so I am not sure if this is the best solution.

Cheers,

Jack

unittest errors relating to pyteomics

Hi me again!

Whilst I don't think there is any problems with your module, I wanted to flag some some issues that occurred when running the unittests. I think there might be a small bug in the pyteomics package which I made an issue there. The errors are attached in the text file. After fixing the small bug in the pyteomics package, things work however I come across one last error. Are you able to reproduce these errors that are triggered by the pyteomics package?

unitttest-errors.txt

Cheers,

Jack

Scan extraction from mzIdentML files is incorrect

I was contacted by a user who noted that mzIdentML input from MS-GF+ was not working. Looking into the code, I determined that this line is the problem:

https://github.com/AnthonyOfSeattle/pyAscore/blob/c6d0146fcdcf6b5e7aba5f9700a7bc103e24b411/pyascore/parsing/id_parsers.py#L308

I have been doing my testing with Comet, which outputs spectrumIDs that look like scan=25465, but MS-GF+ outputs a lot more info: controllerType=0 controllerNumber=1 scan=25465.

Extract ascores for all potential PTM sites

Thanks for this modern implementation of Ascore!

I would like to get scores for all potential PTM sites, but only the best score seems to be reported, even though the alternative sites are listed in the output. For instance:

Scan	LocalizedSequence	PepScore	Ascores	AltSites
3767	ALLSLRS[80]HK	23.64276885986328	12.159486	4

Additionally, when more than one alternate sites are present, the indices seem to repeat the first alternate site, instead of listing both alternate sites correctly:

Scan	LocalizedSequence	PepScore	Ascores	AltSites
4190	ALLSLHS[80]SK	35.06369400024414	7.7721076	4,4

Am I missing an option to report all scores, or could a simple change to the (Python) code allow me to parse all scores?

Command line argument problems

Afte independent tests by another lab member, a couple issues were found:

Using lowercase to specify the modified amino acids results in a seg fault
The documentation states that neutral losses are specified with , but they are actually specified with ;. The former is better since the latter doesn't work well with unix.

PercolatorTXT 'queue_size' argument

Hi,

I am having issues running pyAscore with files in the percolator TXT format.

Error when number of scans in id file does not match spectra file

Sometimes, if a user specifies an identification file which has more/less scans than the spectrum file, pyAscore will error without an intuitive message. This situation happens if you aggregate all PSMs together (e.g. percolator or mokapot --aggregate output) and try to feed it to pyAscore. This is not a something I want to support, but I need to be more transparent about the message of what is going on.

mass cannot be matched

hi pyAscore team.
I have a problem that my mass spectrometry data had two modifications to K, and when I submitted the code, the following example spectral data data seemed to be wrong. , and two K modifications become the same.

## input code
pyascore.IdentificationParser(psm_file, "pepXML")

###  pepXML infomation
<aminoacid_modification aminoacid="K" massdiff="42.010565" mass="170.105528446600005" variable="Y" binary="N" description="Acetyl (K)"/>
<aminoacid_modification aminoacid="C" massdiff="57.021464000000002" mass="160.030648985200003" variable="Y" binary="N" description="Carbamidomethyl (C)"/>
<aminoacid_modification aminoacid="K" massdiff="43.005814000000001" mass="171.100777414700019" variable="Y" binary="N" description="Carbamyl (K)"/>
....
<spectrum_query spectrum="01CPTAC_UCEC_A_PNNL_20180621_B1S1_f03.3256.3256.2" start_scan="3256" end_scan="3256" precursor_neutral_mass="1155.692700233000096" assumed_charge="2" index="3255" retention_time_sec="638.6" >
<search_result>
	<search_hit hit_rank="1" peptide="KKSLNPR" peptide_prev_aa="R" peptide_next_aa="R" protein="DECOY_sp|Q5TCZ1|SPD2A_HUMAN" num_tot_proteins="1" num_matched_ions="0" tot_num_ions="0" calc_neutral_pep_mass="1155.692700233000096" massdiff="0.0" num_tol_term="1" num_missed_cleavages="0" is_rejected="0" protein_descr="Protein No. 1">
		<modification_info modified_peptide="n[230]K[171]K[170]SLNPR" mod_nterm_mass="230.170757031900024">
			<mod_aminoacid_mass position="1" mass="171.100777414700019"/>
			<mod_aminoacid_mass position="2" mass="170.105528446599976"/>
		</modification_info>
		<search_score name="Posterior Error Probability" value="0.711939"/>
		<search_score name="Posterior Error Probability" value="0.711939"/>
		<analysis_result analysis="peptideprophet">
			<peptideprophet_result probability="0.288061" all_ntt_prob="(0.0000,0.0000,0.288061)"/>
		</analysis_result>
	</search_hit>
</search_result>
</spectrum_query>
...


## ouput data
scan	charge_state	score	peptide	mod_positions	mod_masses
....
3256	2		KKSLNPR	[0 0 2]	[230.17076111  42.010565    42.01056677]
....

Allow to calculate Peptide Score for unambiguous psms

Hey,

Great package that you put there together.

We would love to implement this into our pipelines. For this, though, it would be great to have a Peptide Score for unambiguous peptide sequences as well. Of course, the Ascore is infinite, but the Peptide score is still available for the single (=best) possibility, as far as I understand the process.

In particular:
https://github.com/AnthonyOfSeattle/pyAscore/blob/25263db3d95fa1b7bfe8c80dd051c78a6ab012e4/pyascore/ptm_scoring/cpp/Ascore.cpp#L267-L271

I was wondering if one could add a flag to support this if performance is really that critical. But frankly, a 90 min raw file on a desktop computer was done in 5 seconds. And for us, it's roughly 20-25 % unambiguous, so that extra would then increase to 7 seconds. IMO still super fast :). So just making it possible as default would be amazing!

Off-topic:

The pip install works nicely
Building it from git directly did not work for me because of compilation errors pip install git+https://github.com/Villen-Lab/pyAscore

Linux version needs to be set for github runners

Looks like github recently changed the version of linux that is referred to by ubuntu-latest. We need to explicitly set 20.04.

Fix numpy binary incompatability

After install, an error or warning during import seems to occur for some installations:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

A quick poke around SO seems to indicate that this is an incompatibility stemming from which numpy version I used for compilation vs which one is installed for users. I imagine it can be fixed by raising the minimum numpy version.

Add False Localization Rate Estimate

After some reading and discussions at ASMS, it seems that it would be really nice to add an FLR calculation to pyAscore. For the user, this would just be another command line argument. Internally, the Ascore agorithm will be run on each PSM with the decoy amino acid tacked on to the possible amino acids for a mod. I.e. if a user specifies phospho on STY and a decoy AA as A, then internally we run Ascore as phospho on STYA and handle the FLR calculations. Right now the idea would be to sort the PSMs based on score and then report the FLR at each score threshold.

Extra PTMs and Unimod

@AnthonyOfSeattle I have a lot of files that the extra modifications are all Unimod modifications. I would like to use your API to compute large-scale the Ascore for a lot of PTMs but need some way to pass the other ptms (e.g Oxidation) to the algorithm.

Do you have an example how to do that for Unimod PTMs

Reading mzIdentML from Comet

I have been having trouble reading mzIdentML files from Comet (v. 2021010). It seems there are some unresolved references in the file that are causing pyteomics' MzIdentML class to read the file super inefficiently. Sadly, this is not something I can fix here, but I want to make note of it. When using Comet, the pepXML output may be best as an input to pyAscore. Or, better yet, just use the TXT file output from Mokapot or Percolator.

Incorrect calculation of score for narrow spectra bins.

Normally, spectra will use 100 m/z bins. However, results on synthetic peptide datasets suggest that narrowing that bin size may help for high res data. This needs to be supported at the score level.

https://github.com/AnthonyOfSeattle/pyAscore/blob/25263db3d95fa1b7bfe8c80dd051c78a6ab012e4/pyascore/ptm_scoring/cpp/Ascore.cpp#L33

tqdm missing as requirement

Add this at some point.