zhanglab / psamm Goto Github PK

View Code? Open in Web Editor NEW

34.0 14.0 14.0 106.51 MB

Curation and analysis of metabolic models

Home Page: https://zhanglab.github.io/psamm/

License: GNU General Public License v3.0

Python 100.00%

python metabolic-models bioinformatics metabolic-network

psamm's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger amenge forkbackups dozingpip spikeliu jonls keithdt jgyllinsky bishopbrian25 ssuchacz zhenjiaofenjie borenzheng maozhitao sergiopachondotor

psamm's Issues

Allow gapfill of specific compound(s)

Currently the gapfill command will first determine all blocked compounds and then try to construct an extended model where all of the compounds are unblocked. This could be extended so that the user can specify a specific subset of the compounds to unblock.

YAML reaction incorrectly parsing decimal stoichiometry as float

The following model fails when running FBA. It appears that the 0.001 stoichiometric value for a in the biomass is parsed as a float while the 0.5 stoichiometric value for a in rxn_1 is parsed as Decimal. When running FBA, these two values should be added to construct the LP equations but float and Decimal cannot be added. This would be solved if the stoichiometric values for the YAML reaction (biomass in this case) were correctly parsed as Decimal.

biomass: biomass
reactions:
  - id: rxn_1
    equation: (0.5) a => (0.5) c
  - id: biomass
    equation:
      left:
        - id: a
          value: 0.001
      right:
        - id: b
          value: 0.002

Combine entries in media key to produce the model medium

The media key in model.yaml is currently interpreted as a list of separate media (although only the first one is used, and a warning is generated if more than one is included). It has been proposed that instead the entries in the media should be combined into one single medium. The medium definition could then be split into separate files based on a logical subdivision of the compounds.

Extend compatibility with older Cplex versions by only using set_offset when available

Currently the set_offset() method is used with the Cplex solver when setting the LP problem objective. Using the set_offset() method is not strictly necessary to get the correct solution but does allow Cplex to report the correct objective value when a constant term is included in the objective. However, set_offset() is only available starting from 12.6.2 which means that PSAMM fails to work with older versions of Cplex. We can extend compatibility to older versions of Cplex by only using set_offset() when available.

Port to Python 3

The long term plan is to port to Python 3 but currently we are blocked by some external modules that don't support Python 3 yet:

Cplex python module
QSopt_ex python module

Extend fastgapfill command line options

The fastgapfill command still has some hardcoded parameters: the weights of automatically generated reactions (transport/exchange), ~~the epsilon parameter~~, and the model compartments. The model compartments are also needed for some other commands. It would be nice if these parameters could be specified through command line options instead of having to edit the python code manually. This should be done by adding parser arguments in the init_parser() method and using those arguments stored in kwargs in the __call__ method.

When loading a medium YAML file the extracellular compartment should be the default compartment

Currently the compartment must be specified twice:

extracellular: e
media:
  - compartment: e
    compounds:
      - id: abc

Instead, the media should by default use the extracellular compartment for compounds.

Consistently parse include statements in native module

Currently the format detection is implemented by regexp matching on the file path multiple places in the native module. This should be refactored so that the same rules are applied consistently. In addition, we may want to allow the user to override the format detection by using a value for the format key in the same dictionary as the include key.

GapFind/GapFill python tests

We have an implementation of GapFind and GapFill in python as well as the previous implementation in GAMS. Currently the GAMS implementation has been tested on a number of models and we are quite confident that it is working correctly (although it has inherent limitations) but we have not yet tested the python implementation to see if it gives the same result.

The python implementation should be tested on a couple of larger models to make sure that it gives the same result as the GAMS implementation. In addition it would be nice to have a couple of small test cases in tests/test_gapfill.py that can be run automatically. These can be modeled on the existing tests in tests/test_fastcore.py.

psamm-import command not found

Hello to everybody,

when I try import SBML files in PSAMM the terminal not found the command psamm-import.

(PSAMM) arturo@arturo-HP:~/Modelos/PSAMM_pruebas$ psamm-import sbml --source e_coli_core.xml --dest E_coli_yaml
psamm-import: no se encontró la orden

Performance regression with Cplex 12.6.2

There seems to be a performance regression in Cplex 12.6.2 compared to 12.6 with some of the MILP problems that we are using. Running FBA with thermodynamic constraints on the iJO1366 model takes more than 10 minutes with the new Cplex 12.6.2 but only ~12 seconds with 12.6. In comparison, the tFBA on iJO1366 takes ~1:30 minutes with Gurobi 6.0.4. The same performance degradation also happens in PSAMM 0.11 and 0.10.2.

Factor out parsing list and array files

We are using a customized (~8 lines of code) parser in a number of places to parse lists of reactions, or list of (reaction, value)-tuples. In one case, we are parsing a table containing reaction IDs in the first columns, and lower bounds in the second column. The optional third column contains the upper bound.

Common to all of these parsers is that comments, starting with '#', are filtered out. In addition, all kinds of whitespace should be acceptable as a column separator (this is so that columns can be lined up nicely). Both of these requirements make the built-in csv module useless so we need to keep the existing solution, but we can factor out the code so we don't reimplement the same parser a number of times. The following shows one example of this type of parser, parsing the penalty weights for reactions:

for line in kwargs['penalty']:
  line, _, comment = line.partition('#')
  line = line.strip()
  if line == '':
    continue
  rxnid, weight = line.split(None, 1)
  weights[rxnid] = float(weight)

The new parser should at least be able to handle 1) files with only one column (e.g. model specification files) 2) files with two columns. The format must be specified when the parser is called. It would be nice if it could also handle the case where there are two columns and one optional third column. The new parser can be put in the util module.

Support default value (-) in limits table parser

The medium table format supports dash (-) to indicate that the default value should be used. For consistency, this should be supported in the limits table too.

Move YAML writing code from psamm-import to psamm.datasource.native

Currently, the code that is responsible for writing YAML model files is embedded in psamm-import. This means that API users will have to copy this code from psamm-import or reimplement it in order to write new model files or convert existing files. Since the YAML-model reading code is already in the main PSAMM package, it would make sense to also include the YAML-writing code.

Allow user to set value of epsilon used in gapfill command

Currently it is only possible to use the default value of the epsilon parameter for the gapfind and gapfill functions. Using the default value fails for some models that require very small fluxes to be viable.

Split gene associations from reaction database

Currently the gene associations are interleaved with the reaction database. This makes it hard to use shared reaction databases for different organisms. @keitht547 suggested moving the gene associations from the reaction database into the model specification (list of reactions in the model).

Allow reactions with fixed flux in the limits definition

Currently to fix the flux of a reaction to a specific value it is necessary to specify a lower and upper bound of that specific value. Instead, we can allow a key fixed as a shorthand for setting both lower and upper bound to the same value. When this is implemented, psamm-import should be changed to use this format.

SBML parser test case

The minimal SBML parser in the sbml module recently broke because of refactoring in the code it depends on. This is bound to happen since the sbml module does not have any test cases yet. To catch the majority of regressions, write a small test case where an SBML model is loaded using StringIO instead of a file (https://docs.python.org/2/library/stringio.html). The tests should check that the methods of SBMLDatabase work as expected.

Rename --reaction options to --objective

For fba, robustness, fva and similar commands there is an option called --reaction to select a different reaction to optimize than the biomass reaction specified in model.yaml. The name of this option is not very descriptive. --objective seems to be a better choice.

Other options: --maximize, --optimize, --biomass...

Produce a warning about missing exchange reactions

Produce a warning during simulation (FBA, ...?) when a compound defined in the extracellular space does not have an exchange reaction. This warning would be silenced by adding the compound to the medium definition (possibly with bounds set to lower: 0, upper: 0 if no exchange in or out is desired).

Use python3 compatible print function

To make a future transition to Python3 easier, it would be nice to have our Python scripts use the new print function that was introduced in Python3: http://legacy.python.org/dev/peps/pep-3105/ It can be enabled in Python2 by adding from __future__ import print_function at the top of the python file. In addition we have some cases where the write-method of sys.stderr is used directly. These should also be changed to use the new print function as the print function can take a file object to print to.

A statement like

print 'Two numbers: {} {}'.format(a, b)

should be changed to the function call (and adding from __future__ import print_function at the top of the file)

print('Two numbers: {} {}'.format(a, b))

and a call to the write method on sys.stderr like

sys.stderr.write('Two numbers: {}, {}\n'.format(a, b))

should be changed to (notice that the explicit newline character at the end disappears)

print('Two numbers: {} {}'.format(a, b), file=sys.stderr)

Add support for open source LP solver

Currently the Cplex and QSoptex solvers are supported. QSoptex is a special case since it is an exact solver, so this leaves Cplex as the only normal solver and also the only solver to support MILP problems. Cplex is proprietary, and although they give out free academic licenses it would be nice to have support for a free, open source solver. GLPK may be the best option.

Add the current table-based model format as a separate data source module

Currently the code to load models in the table-based format is somewhat embedded in the metabolicmodel and database modules. This should be split off into a separate module since the internal representation should not depend on the external data format.

Missing documentation for medium table format

The documentation only includes information on the YAML-based medium format.

Allow specification of extracellular compartment name

This would be used by the gap filling commands instead of assuming that the name is e.

How to add reactions identified by gap fill to model

Is there an easy way to add the reactions reported by the gapfill or fastgapfill commands to a model? And do you need to add the artificial transporters and exchanges too?

KEGG reaction information parsing

We will probably need automated access to the KEGG reaction information some time in the near future. We are currently able to parse the KEGG reaction equation format, and we are also able to parse the file containing the information record on the compounds.

It would be nice to have a function in the kegg module that similarly parses the reaction records into ReactionEntry objects. The new function should be called parse_reaction_file and should take a file object and return an iterator over ReactionEntry objects. Since the file format is very similar, the code from the compound parser can be reused or even factored out into a common function. The ReactionEntry object should expose all properties through a general interface (like CompoundEntry.__getitem__) but can also include convenience access for other properties (like name, enzymes, formula, etc. in CompoundEntry). Specifically, the reaction pairs should be easily accessible.

Incorrect solution to Fastcore LP10 with GLPK

When solving the LP10 problem in Fastcore with GLPK, the objective becomes very small so that GLPK seems to consider it equal to zero. It may be possible to solve this by reformulating LP10 to include the scaling within the problem, i.e. by multiplying all constraints on fluxes by the scaling.

Make NativeModel constructor take string, dict or file object

NativeModel currently only receives exact file name as the parameter. For creating test cases, it's not flexible enough. So, in addition, it should be able to take a string or file object to parse it, or take a dict object and use it directly.

New command to perform a single gene deletion experiment

This command should delete a specific gene (or try all genes one-by-one?) and perform FBA on the resulting model.

SBML export of flux limits

Currently, the flux limits of the model is not exported to SBML when using the exportsbml command. These limits could be encoded using the COBRA-compatible extension or using the level 3 package fbc.

No warning given when ID is not a string

This issue was discovered when a user tried to use "no" as a compound name. The "no" results in a boolean value from the YAML parser instead of the expected "no" string. In the specific case the issue can be worked around by quoting the compound name with single quotes. There should be an error message and an explanation of the issue when a user tries to use a non-string type as an ID.

Allow exporting text-based model to SBML

To be compatible with existing metabolic modeling software it would be useful to be able to export the format used in model_script to an SBML file. This would allow users to create a model using model_script and later export the data to use the tools from COBRA/COBRApy or to compare the results from model_script with those software packages.

In commands, add a separate command error to signal execution errors

Currently the CommandError exception is available to signal that a command failed. However, this exception causes the argument parser to print out usage information which is only appropriate if the error was caused by incorrect arguments supplied by the user. In other cases, the command may wish to signal an error that does not cause usage information to be printed. Most commands where errors are possible either raise an exception or let an existing exception bubble up. This accomplishes the goal of exiting the command with an error code but does not provide a good error message to the user. Ideally, the stack trace of the exception should be logged (at "debug" level) and a good error message should be logged at "error" level.

gapfill: Inform user that a lower epsilon value could be used

If the gapfill command fails it may be necessary to run the command with a lower epsilon value. Currently an exception is raised by gapfind/gapfill which simply results in a stack trace being shown to the user. With #73 and #74 it should be possible to provide an error message to the user that explains that the user can try a lower epsilon value.

Generalize FVA, flux minimization and consistency check

Currently, FVA, flux minimization and the consistency check functions in the fluxanalysis module use an instantiation of FluxBalanceProblem directly. It would be neat if these functions could accept an optional FluxBalanceProblem instance as a parameter (instead of solver) such that a FluxBalanceTDProblem could be passed and we would automatically have access to the corresponding thermodynamically constrained functions.

Allow additional features in medium table definition

It was proposed that additional features should be allowed in the medium table format. Currently the medium table format consists of 4 column (2 required, 2 optional), specifying compound ID, compartment, lower bound and upper bound. With this proposal, additional user defined properties should be allowed after the four existing columns.

The user defined properties should be parsed and be made accessible through the API. For example, with a user defined property "class":

compound        compartment     lower   upper   class
akg     e       0       400     carbon-source
glcD    e       -10     -       carbon-source

This would require that a header be added to the medium table format so that a key can be specified for each property. It would also change the format of the table file to be strictly tab-separated instead of being space-separated as it is now. A new class MediumEntry can be defined so that the additional user defined properties can be held. The properties can be made available from the NativeModel through parse_medium() which would iterate over MediumEntry objects instead of tuples.

Move reaction parsers to separate data source module

The code for parsing reactions is currently embedded in the reaction module. The internal representation does not really depend on the external data format, so these two parts can be split up. This will reduce the complexity of the reaction module especially as the number of reaction parsers can grow in the future.

Use Exception subclass instead of generic Exception

In a number of places we are raising generic Exceptions when an error is encountered. This is discouraged since these exceptions cannot be explicitly caught and this can hide errors when the exception is caught.

Go through all instances where we raise an Exception and replace it with an instance of a more specific Exception subclass. In some cases the built-in exceptions can be used (e.g. often it is appropriate to use ValueError, IndexError, etc.). If no built-in exception applies, we can create a specific one, e.g.

class FluxBalanceError(Exception):
    '''Raised when an error occurs in solving a flux balance problem'''