klamt-lab / autopacmen Goto Github PK

Retrieves kcat data and adds protein allocation constraints to stoichiometric metabolic models according to the sMOMENT method

License: Apache License 2.0

MATLAB 2.98% Python 31.08% HTML 65.88% Batchfile 0.03% Shell 0.03%

bioinformatics computational-biology metabolic-models systems-biology

autopacmen's People

Contributors

Stargazers

Watchers

Forkers

maozhitao hongzhonglu almaaslab arendma maskot1977 amirhossein207 rieco432

autopacmen's Issues

Enzyme pool reported as mmol/gDW in print

Hey!

When running the function "get_initial_spreadsheets_with_sbml" (I'm calling the script directly, as I want my pipeline fully automated), there's a somewhat confusing print towards the end:

NOTE: project_name_protein_data.xlsx has as default value for the enzyme pool P 0.095 mmol/gDW.

I was under the impression that the enzyme pool is given in g/gDW, and indeed that is what is reported in the resulting excel file, so I assume this is just a minor typo, and should therefore be an easy fix :)

Cheers!

pip install does not actually install package sources

When running pip install autopacmen-Paulocracy, all dependencies are installed as well as the package metadata. However, none of the actual scripts in the package are installed. When checking the Pypi index and downloading the sources, it turns out that the sources from Pypi do not include the scripts, only the package metadata. The same problem appears when cloning this repository and using pip to install the package locally.

get_reactions_kcat_mapping

Hi @Paulocracy,
I would like to say congratulations on this an absolutely fantastic piece of code/paper you published :D

I have however ran into a problem with the kcat_database_combined.json file that is generated from the create_combined_kcat_database.py function. It creates kcat values for the species Salmonella enterica subsp. enterica serovar Typhimurium A0A0F6B484. This entry is not in the cache/ncbi_taxonomy folder. This causes an error later on when I use the function...

get_reactions_kcat_mapping(sbml_path, project_folder, project_name, organism, kcat_database_path, protein_kcat_database_path)

It runs halfway before raising the error 'OSError: [Errno 22] Invalid argument: C:/file_path/cache/ncbi_taxonomy/Salmonella enterica subsp. enterica serovar Typhimurium A0A0F6B484'.

It is possible to solve and allow complete model construction by manually deleting the 'Salmonella enterica subsp. enterica serovar Typhimurium A0A0F6B484' entries in the kcat_database_combined.json file.

Whilst it is possible to work around this, I thought it would be worth raising this for anyone else whom may have this problem.

Keep up the good work :D

Add default argument to get_reactions_kcat_mapping

When using get_reactions_kcat_mapping, you may not have any user-defined protein database. Hence, requiring the argument protein_kcat_database_path to have a value is therefore counterintuitive. From inspecting the function, I find that supplying an empty string as the value will make the function ignore the database. I therefore suggest having the empty string as the default argument for protein_kcat_database_path.

Error in data_parse_brenda_textfile.py

In line 45 of data_parse_brenda_textfile.py, there is a bug where the click options type=click.Path(exists=True, dir_okay=True) force the output file to exist even though the purpose of the function is to create it. Hence, trying to run the command results in the error: type=click.Path(file_okay=True, dir_okay=True, writable=True). I suggest changing these options to type=click.Path(file_okay=True, dir_okay=True, writable=True) as in data_parse_brenda_json_for_model.py.

UnboundLocalError in parse_brend_textfile.py

when calling data_parse_brenda_textfile.py I get an error due to missing variable declaration in submodules/parse_brenda_textfile.py :


  File "autopacmen/autopacmen/submodules/parse_brenda_textfile.py", line 170, in parse_brenda_textfile
    word = word.replace("\t", "")
UnboundLocalError: local variable 'word' referenced before assignment
  File "autopacmen/autopacmen/submodules/parse_brenda_textfile.py", line 170, in parse_brenda_textfile
    word = word.replace("\t", "")
UnboundLocalError: local variable 'word' referenced before assignment

cannot import autopacmen-Paulocracy

Hi!

After successfully installing autopacmen-Paulocracy using pip I cannot import the package.

import autopacmen-Paulocracy

produces a syntax error due to the hyphen. Attempting to import the package otherwise such as

impo = import("autopacemn-Paulocracy")

produces a ModuleNotFoundError.

How does one access the module after a pip install?

Malformed enzyme stoichiometry spreadsheet

This an error which is related to issues #15 and #12. With the most recent versions of autoPACMEN and dependent packages, get_initial_spreadsheets_with_sbml() produces an enzyme stoichiometry spreadsheet which looks like this:

As you can see, gene annotations which consist of zero or one genes are surrounded by brackets with quotes. This causes problems in create_smoment_model_reaction_wise_with_sbml() which does not recognize the brackets and quotes. Previously, I used older package versions (I can't recall the exact numbers), I instead got the expected behavior which excludes reactions without annotations and writes annotations with a single gene without brackets and quotes. I assume this bug is caused by changes in either autoPACMEN or xlsxwriter, but I have not yet got time to investigate thoroughly.

Error in modeling_create_smoment_model.py

Hey @Paulocracy,

I am trying to run the sMOMENT model generation and get a KeyError and a "need to pass list" warning during the call to modeling_create_smoment_model.py:


    /home/miniconda3/envs/apacmen/lib/python3.8/site-packages/cobra/core/group.py:107: UserWarning: need to pass in a list
    warn("need to pass in a list")
    Traceback (most recent call last):
    File "modeling_create_smoment_model.py", line 93, in 
      create_smoment_model_cli()
    File "/home/miniconda3/envs/apacmen/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
      return self.main(*args, **kwargs)
    File "/home/miniconda3/envs/apacmen/lib/python3.8/site-packages/click/core.py", line 1062, in main
      rv = self.invoke(ctx)
    File "/home/miniconda3/envs/apacmen/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/home/miniconda3/envs/apacmen/lib/python3.8/site-packages/click/core.py", line 763, in invoke
      return __callback(*args, **kwargs)
    File "modeling_create_smoment_model.py", line 84, in create_smoment_model_cli
       create_smoment_model_reaction_wise_with_sbml(input_sbml_path, output_sbml_name, project_folder, project_name,
    File "/home/Software/autopacmen/autopacmen/submodules/create_smoment_model_reaction_wise.py", line 337, in create_smoment_model_reaction_wise_with_sbml
      create_smoment_model_reaction_wise(model, output_sbml_name,
    File "/home/Software/autopacmen/autopacmen/submodules/create_smoment_model_reaction_wise.py", line 267, in create_smoment_model_reaction_wise
      number_units = reaction_id_gene_rules_protein_stoichiometry_mapping[
    KeyError: 'A0A2K3DYJ4'

Since this is the first reaction linked to a single gene / homoeric enzyme I think it has to do something with the parsing of GPR rules. Thanks a lot for designing this tool and for your help!

enzyme names ore not written correctly to stoichiometry excel sheet

Hi @Paulocracy,

in the updated version of "autopacmen/submodules/get_initial_spreadsheets.py" single enzyme names are stored as lists and contain brackets inside the excel sheet.
This later on produces an Error, when trying to read single enzyme names.

Procedure for protein mass determination

I am a bit puzzled by the method to determine protein mass from UniProt (https://github.com/ARB-Lab/autopacmen/blob/69a158003d5bab3f597ec5da727515d250f35a43/autopacmen/submodules/get_protein_mass_mapping.py#L133). First the UniProt is queried for the amino acid sequence and then the amino acid sequence is analyzed for molecular mass. However, UniProt can be queried directly for the mass such as (https://www.uniprot.org/uniprot/?query=HXKA_YEAST%20OR%20G6PI_YEAST&format=tab&columns=id,mass). Why is not this simpler approach used in AutoPacmen? Beware though, UniProt outputs the mass with comma as a thousand separator, so you have to write something like float(mass.replace(',','')to parse the result.

Uniprot API change

From my understanding, Uniprot's web API has changed from the time autoPACMEN was created and I used it. For this reason the Uniprot ID -> Protein mass mapping no longer works (https://github.com/klamt-lab/autopacmen/blob/cb828391d4cbb17e50ba9752cc974d78775d836d/autopacmen/submodules/get_protein_mass_mapping.py#L116C8-L116C105). However, I one of my group members have come up with a suggested solution; this is: Replace

uniprot_query_url = f"https://www.uniprot.org/uniprot/?query={query}&format=tab&columns=id,mass"

with

uniprot_query_url = f"https://rest.uniprot.org/uniprotkb/search?query=accession:{query}&format=tsv&fields=accession,mass"

I have tested both approaches in the web browser and can confirm that the old query string does no longer work, whereas the newly suggested one does.

Potential bug in k-cat lookup

I have noticed a some strange code related to lookup of k-cat values

autopacmen/autopacmen/submodules/get_reactions_kcat_mapping.py

Lines 64 to 67 in 180e382

    
           if kcat_direction == searched_direction == "forward": 
        
               max_kcats.append(max_kcat) 
        
           else: 
        
               max_kcats.append(max_kcat)

I believe this block of code is there in order to ensure that forward and backwards k-cat values are not mixed together. Note however that the same action is taken regardless of the truth value of kcat_direction == searched_direction == "forward". I suggest that the code should have been

  # Ensures that only kcat values for the same direction
  # is used
  if kcat_direction == searched_direction:
            max_kcats.append(max_kcat)

However, I ask you to double check this as I might be missing something.

Reactions with isoenzymes are not split correctly if proteomics data is provided

Hello :)
I would like to integrate proteomics data. But it seems that the reactions that have isoenzymes are not split correctly. I saw it with my metabolic model, but also in the model provided here "iJO1366_sMOMENT_2019_06_25_GECKO_ANALOGON.xml". For example the reaction PFK is split into PFK_GPRSPLIT_1 and PFK_GPRSPLIT_2. Each reaction should have only one of the isoenzymes in the reactants but the reactions are identical and they contain both isoenzymes and also the protein_pool.

        <listOfReactants>
          <speciesReference species="M_atp_c" stoichiometry="1" constant="true"/>
          <speciesReference species="M_f6p_c" stoichiometry="1" constant="true"/>
          <speciesReference species="M_ENZYME_b3916" stoichiometry="3.77565590812736e-06" constant="true"/>
          <speciesReference species="M_ENZYME_b1723" stoichiometry="3.77565590812736e-06" constant="true"/>
          <speciesReference species="M_prot_pool" stoichiometry="0.000122541206209238" constant="true"/>
        </listOfReactants>
        <listOfProducts>
          <speciesReference species="M_adp_c" stoichiometry="1" constant="true"/>
          <speciesReference species="M_fdp_c" stoichiometry="1" constant="true"/>
          <speciesReference species="M_h_c" stoichiometry="1" constant="true"/>
          <speciesReference species="M_armm_PFK" stoichiometry="1" constant="true"/>
        </listOfProducts>

I have been trying to figure out how to fix it and one problem I found is in the script create_smoment_model_reaction_wise.py.
The function get_model_with_separated_measured_enzyme_reactions() updates the objects reaction_id_gene_rules_mapping and reaction_id_gene_rules_protein_stoichiometry_mapping where I assume it creates new gene rules for these split reactions.
However, later in the script in the main loop, the string "_GPRSPLIT_" is removed from the reaction_id and then gene_rules are retrieved for the original reaction instead of the split reactions.

Another problem seems to be that it always adds the protein_pool metabolite even if there is proteomics data available. I think this is because of the line 304 reaction.add_metabolites(metabolites). It is always run even if the reaction already contains the individual enzyme. Would it make sense to add a condition that if all proteomics data is available, this line would not be run?

Thanks in advance :)

bug in BRENDA parsing

There is a bug where organism names of the BRENDA database are not parsed correctly from the .txt file if they contain a tab ("\t").

For example:
organism line:
"Salmonella enterica subsp. enterica serovar Typhimurium A0A0F6B484 \tand A0A0F6B483 UniProt <138>"

result:
"Salmonella enterica subsp. enterica serovar Typhimurium A0A0F6B484 \tand".

Expected result:
"Salmonella enterica subsp. enterica serovar Typhimurium"

Amino acids ambiguties are not handled correctly

When running the function get_protein_mass_mapping_with_sbml I got the following error: KeyError: 'G8ZSL3' at line 141 in get_protein_mass_mapping.py. The key error does not trigger on all queries, but it does on the query with UniProt ID G8ZSL3. After a manual query (https://www.uniprot.org/uniprot/?query=G8ZSL3&format=tab&columns=id,sequence,mass), we see that this accession ID is valid, but has two amino acids ambiguties. Hence ProteinAnalysis will yield a ValueError which is handled by progressing to the next iteration of the loop. However, the dictionary uniprot_id_protein_mass_mapping does not get updated with the key 'G8ZSL3', which triggers the error when trying to access the entry later. I suggest you instead implement the solution I suggested in #8 (comment), because even though two amino acids are ambigous, UniProt can still find a reasonable protein mass. Alternatively, you could replace the ambigous amino acids from the string before feeding it into ProteinAnalysis, which will still give a good esimate as ambigious amino acids consititue only small fractions of proteins.
https://github.com/ARB-Lab/autopacmen/blob/69a158003d5bab3f597ec5da727515d250f35a43/autopacmen/submodules/get_protein_mass_mapping.py#L133

PERFORMANCE: `create_smoment_model_reaction_wise_with_sbml` gets bogged down in recent cobrapy release

This is an issue I think is due to updates in cobra. With cobra version 0.21.0 create_smoment_model_reaction_wise_with_sbml() runs in a reasonable amount of time. However, once I use an environment with cobra version 0.26.3, the same function consumed a prohibiting amount of time. After drilling into the details, I realized that the time was spent deepcopying metabolic reactions

autopacmen/autopacmen/submodules/helper_create_model.py

Line 186 in e115b91

forward_reaction = copy.deepcopy(reaction)

From my understanding, the autoPACMEN code has not changed on this point, but cobra probably has changed its procedures for copying reactions, leading to a very slow recursive process. As a side note, I have experienced cobrapy to be slow and cumbersome for modifying models and their components. There exist an alternative metabolic modeling package named reframed which I have used to resolve such problems, but due to technological dept, replacing cobra with reframed would take a considerable effort.

fraction of unmeasured proteins in model compared to all unmatched proteins

Hi @Paulocracy,

I'd like to note that there might be a misleading description for one of the input values in the "protein_data.xlsx" excel sheet.
The description "Fraction of masses of model-included enzymes in comparison to all enzymes" (first sheet) made me think, that this is the sum of all model included protein masses divided by the total mass of all proteins of an organism.

But according to the GECKO Appendix (section 2.5) it is something like the "fraction of unmeasured proteins of the model compared to all unmatched proteins" (and that's also kind of the term you used in your code).

I hope I'm not confusing anything, but this got me puzzled :D

get_initial_spreadsheets does not correctly create enzyme_stoichiometries.xlsx file

Hello! I just started using the package and I am trying to apply it to my model. But I noticed that the gene rules that I get in the "...enzyme_stoichiometries.xlsx" file are not interpreted correctly (they still contain some "and"). The problem seems to be the function _gene_rule_as_list().
I tested the function alone and it seems that it does not work with the example provided.
"(b0001 or b0002) and b0003" is returned as ['b0001', 'b0002 and b0003'] and not as [["b0001", "b0002"], "b0003"]

from typing import Any, Dict, List, Union

def _gene_rule_as_list(gene_rule: str) -> List[Any]:
    """Returns a given string gene rule in list form.

    I.e. (b0001 or b0002) and b0003 is returned as
    [["b0001", "b0002"], "b0003"]

    Arguments:
    *gene_rule: str ~ The gene rule which shall be converted into the list form.
    """
    # Gene rules: Only ) or (, (in blocks only and); No ) and (
    gene_rule_blocks = gene_rule.split(" ) or ( ")
    gene_rule_blocks = [x.replace("(", "").replace(")", "") for x in gene_rule_blocks]
    gene_rules_array: List[Union[str, List[str]]] = []
    for block in gene_rule_blocks:
        if " or " in block:
            block_list = block.split(" or ")
            block_list = [x.lstrip().rstrip() for x in block_list]
            gene_rules_array += block_list
        elif " and " in block:
            block_list = block.split(" and ")
            block_list = [x.lstrip().rstrip() for x in block_list]
            gene_rules_array.append(block_list)
        else:  # single enzyme
            gene_rules_array.append(block)
    return gene_rules_array

gene_rule = "(b0001 or b0002) and b0003"
_gene_rule_as_list(gene_rule)

This outputs:
['b0001', 'b0002 and b0003']

I hope I did not accidentally mess something up :) I would appreciate any help with this. Thank you!

Handling of missing Sabio-RK entries

I got KeyError: '3.6.1.40.5' in line 69 of create_combined_kcat_database.py. Tracing back the error, I observed that searching for EC-number 3.6.1.40.5 in Sabio-RK with autopacmen does not give any results on any wildcard level. The last lines of output from create_combined_kcat_database may shed more light on the problem:
Wildcard level 3... ['3.6.*.*.*'] Performing query [{'ECNumber': '3.6.*.*.*', 'Parametertype': 'kcat', 'EnzymeType': 'wildtype'}]... SABIO-RK API error with query: ((ECNumber:3.6.*.*.* AND Parametertype:kcat AND EnzymeType:wildtype)) Wildcard level 4... ['3.*.*.*.*'] Performing query [{'ECNumber': '3.*.*.*.*', 'Parametertype': 'kcat', 'EnzymeType': 'wildtype'}]... SABIO-RK API error with query: ((ECNumber:3.*.*.*.* AND Parametertype:kcat AND EnzymeType:wildtype))
Sabio-RK has of course entries for these high wildcard levels, but there might just be too many of them for the API of return any results. This mean that even with the wildcard search, you may expect to have some EC-numbers to which no entry is obtainable. Consequently, we must handle the case where a Sabio-RK entry is not available for the combined database.

error running ec_model_2019_06_25_sMOMENT_iJO_CREATION.py

Hi!

After downloading the 'brenda_download.txt' and saving it in the '/ecModel_2019_06_25_input'-folder and running ec_model_2019_06_25_sMOMENT_iJO_CREATION.py, I get the following error message.

Traceback (most recent call last):
  File "C:/Users/cga32/OneDrive/autopacmen-master/autopacmen/ec_model_2019_06_25_sMOMENT_iJO_CREATION.py", line 46, in <module>
    parse_brenda_textfile(brenda_textfile_path, bigg_metabolites_json_folder, json_output_path)
  File "C:\Users\cga32\OneDrive\autopacmen-master\autopacmen\submodules\parse_brenda_textfile.py", line 84, in parse_brenda_textfile
    lines = f.readlines()
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 4133: character maps to <undefined>

I figured I could fix this by adding "encoding='utf-8' in the open function in parse_brenda_textfile.py at line 83.
However, doing this leads to another error:

Traceback (most recent call last):
  File "C:/Users/cga32/OneDrive/autopacmen-master/autopacmen/ec_model_2019_06_25_sMOMENT_iJO_CREATION.py", line 46, in <module>
    parse_brenda_textfile(brenda_textfile_path, bigg_metabolites_json_folder, json_output_path)
  File "C:\Users\cga32\OneDrive\autopacmen-master\autopacmen\submodules\parse_brenda_textfile.py", line 147, in parse_brenda_textfile
    ec_number.lower().split("(transferred to ec")[1].replace(")", "").lstrip()
IndexError: list index out of range

The script should run smoothly, so would anyone have any idea what is going wrong here? Maybe the encoding should be something else than utf-8, but from what I've understood the brenda_download.txt file is in utf-8 format.

modeling_create_smoment_model cannot handle 'NoneType' objects

Hello,

This might be a bit early to report an issue, however I am really interested in the construction of an enzyme constrained model for yeast using autopacmen. Even though I am not very familiar with the python environment, I was able to follow your manual up to some point (thank you for the clear explanations!), unfortunately I've encountered several small errors that you might want to fix since you consider the autopacmen as an extension for cobrapy:

(skip bullets for the main question)

All scripts with data_parse_ in their names (except data_parse_bigg_metabolites_file) require an output file to be in the path beforehand. They overwrite the file, but throw an error about the path if file is not there in the first place.
In the manual, part usage: data_parse_brenda_textfile.py example uses a different script data_parse_bigg_metabolites_file. This is probably a copy/paste typo.
The argument type_of_kcat_selection is missing in the function get_reactions_kcat_mapping() in the script modeling_get_reactions_kcat_mapping. I was able to add the parameter and continue without a problem.
data_create_combined_kcat_database asks for a "BRENDA path" input instead of an "output path" if you run the python script without parameters in the terminal.
Similar to previous one, modeling_create_smoment_model, asks an input for "SBML name" for a second time, instead of "excluded reactions".

This was the point that I could not continue further because I got several errors in modeling_create_smoment_model. As you can see from the title, it throws an error for my model:

File "~/autopacmen/autopacmen/submodules/helper_general.py", line 243, in get_float_cell_value
    cell_value = cell_value.replace(",", ".")
AttributeError: 'NoneType' object has no attribute 'replace'

I have obtained this error using Python 3.7.5 on Linux (Ubuntu 18.04.4 LTS). I have tried to modify get_float_cell_value to bypass replace for NoneType objects, unfortunately my solutions failed in the downstream (mostly in the function add_prot_pool_reaction). As I mentioned before, I am not good at Python programming, so my solutions can be considered weak. There are several metabolites and enzymes in my model that no information available in the databases (retrival scripts showed NA's and warnings for them). I believe these should not cause a problem, therefore I am asking for a solution.

Sorry if I am not clear or went wrong. I hope these small problems I reported above help you to enhance autopacmen and you can provide a generalized solution for my problem soon.

Thank you in advance,
Handan

	if kcat_direction == searched_direction == "forward":
	max_kcats.append(max_kcat)
	else:
	max_kcats.append(max_kcat)