dizak / prwlr Goto Github PK
View Code? Open in Web Editor NEWIntegrating genetic interactions networks and phylogenetic profiles.
Home Page: https://dizak.github.io/prwlr
License: BSD 3-Clause "New" or "Revised" License
Integrating genetic interactions networks and phylogenetic profiles.
Home Page: https://dizak.github.io/prwlr
License: BSD 3-Clause "New" or "Revised" License
Since #93 the build fails for each python version except for 3.4. The failure applies to apis
unit-tests. This part of the code was not changed at all. All the tests pass locally.
The following possible causes should be checked:
The first version to be at PyPI will be v0.0.1
pip package should be prepared.
The aim is to test as much as possible. Though the test coverage will increase as the legacy code will be finally removed, there are still methods with no tests at all.
At the moment, nosetests --with-coverage --cover-package prowler
gives:
Name Stmts Miss Cover
---------------------------------------------
prowler.py 4 0 100%
prowler/apis.py 126 58 54%
prowler/databases.py 186 72 61%
prowler/errors.py 8 0 100%
prowler/genome.py 198 183 8%
prowler/interactions.py 134 124 7%
prowler/network.py 34 22 35%
prowler/profiles.py 64 11 83%
prowler/stats.py 367 278 24%
prowler/templater.py 26 17 35%
prowler/utils.py 45 33 27%
---------------------------------------------
TOTAL 1192 798 33%
----------------------------------------------------------------------
Ran 28 tests in 1.830s
OK
The PePy badge showing the number of downloads should be added to the project website.
Returning just PSS bins is sometimes too little. When permuting small dataframes, there is no reason for returning just that.
One option is to make an arg for returning full dataframe each iteration.
Another one is to let the user pass a functions that would be applied to permuted dataframe before returning the final value.
There is a bunch doc that referer to the old code, eg doc from databases.SGA1 contains info about non-existing Ortho_Interactions.interact_df that was the current class precursor.
Stats._permute_profiles
should be clean as drop_dups
is not needed and uses .size
instead of __len__
.Stats.permute_profiles
should finally be functional.databases.parse_organism_info
should have a way of usage not requiring any awareness of apis
.
There is quite a lot of old code pieces, that was rewritten and should be removed.
It might mean that calculating PSS should be moved to Stats
For a convenient way of using Stats
- as it uses attribs and returns just as well - it should initialize without any args for __init__
passed.
The prowler project should have its small, simple landing page at github pages.
All the columns in the KEGG Orthology dataframe beside ORF_ID and ENTRY seem redundant. Not parsing them should be a default action. Stripping down the final interactions dataframe reduces the time of each permutation (during the brute-force permutation test) twice.
High-level functions should be brought to the __init__.py
file. For instance, one function should be used for getting the profiles (without databases.KEGG
initialization). That would make the module easier to use and make the import shorter.
The functions can be:
profilize_organism - get ORF-Profile Dataframe.
read_sga - read an existing Costanzo sga, v1 or v2 (or v3 in the future)
merge - merge sga with profilized organism. Just pandas.merge
with predefined names siffixes and merge_on
.
calculate_pss - easily calculate pss without need of using apply
from pandas
Each function should evaluate whether the columns are properly named.
The setup.py
file should detect the prwlr
CLI script.
The top-level function calculate pss should utilize numpy.vectorize
. It speeds things up, despite what is the official purpose of numpy.vectorize
(according to the docs, it is mainly convenience), probably due to reduction of iteration overhead.
The conda
venv yml files should be prepared in a minimalistic way, no versions specified if not needed.
Being stuck with python2.7 is simply a shame. A version-agnostic code would be best as python2.7 is in use, though if it turns out to be too difficult to maintain - python3 should be the way to go.
ProfInt
class is small. There is no need for it to hold an attribute instead of returning the result.
bin
directory must be added parallel to the module directory. It can contain the place-holder argparse
.
Code restructuring might be needed for network
just as it was with genome
and interactions
The command calling the unittest
in the .travis.yml
file is repeated.
prowler.profiles.Profile
are not equal even if created from the same data source.
p1 = prwl.profiles.Profile(list('abcde'), list('abc'))
p2 = prwl.profiles.Profile(list('abcde'), list('abc'))
p1 == p2
True
p1 == p2
False
The network-based test that demand the external host availability should be skipped unless the host are pingable. Now, the build fails since the Costanzo's supplement sites are temporarily down.
Parsing KEGG database works for KEGG Orthology but is still not very reliable. Does not really work for the other KEGG database or works poorly. The main problem is probably a proper way of handling multiline in regex.
Prowler name is already taken. The whole project must be renamed.
Switching to:
prowlr
it is free at PyPI, Anaconda Cloud, github.
Before the rename is done, all the pickled files in test_data
must be replaced with the text ones so that they do not depend on the module name.
What has been done so far:
The highest-level convenience method prwlr.save_network should be implemented in prwlr.core so that whole network can be saved without losing prwlr.profiles.Profile objects.
There should be a feature of downloading the Costanzo's API v2 just as there is for the v1
The query suffix is hard-coded to _Q. It does not manifest until the module-level settings are not changed, nevertheless, it is a bad typo.
Line 66 in f62feb8
prowler.stats.Selector
should hold proper genetic interactions names (positive DMF is not one) and there should be more of them.
The highest-level convenience method prwlr.read_network
should be implemented in prwlr.core
so that whole network can be re-read without losingprwlr.profiles.Profile
objects.
prowler.stats.Stats.permute_profiles
works but the load are not sufficiently distributed on large machines.prwlr.profiles.Profile.from_string
method should be implemented.
An iterable cannot be saved to a flat file but can be saved/read to str with join/split methods. Both are present in the Python's standard library and pandas.Series.str
apis.get_KOs_db_X_ref does not work when used for live download.
It should download temporary files and convert them to pandas.DataFrames, then merge them, drop whatever should be dropped and return a proper pandas.DataFrame.
It throws a KeyError
:
208 """
209 def f(i):
--> 210 print("{i} ".format(), flush=True, end='\r')
211 res = rq.get('{}/{}/{}/{}'.format(
212 self.home,
KeyError: 'i'
The highest-level convenience method prwlr.read_profiles
should be implemented in prwlr.core
analogous to prwlr.core.read_sga
. There should be also be prwlr.save_profiles
avilable.
Amend the __doc__
strings. These need to good enough not only for the CLI but also for publishing as HTML.
apis.py
core.py
databases
errors.py
network.py
profiles.py
stats.py
utils.py
Cannot run stats.calculate_enrichment
function.
Pass two dataframes to stats.calculate_enrichment
as described by the function __doc__
stats.calculate_enrichment
should return dataframe with enrichment scores.
stats.calculate_enrichment
gives
AttributeError: ("type object 'Columns' has no attribute '_score'", 'occurred at index 0')
The KEGG modules should be handled. Probably as another prowler.apis.get_db_X_ref function. It can be fetched with eg http://rest.kegg.jp/link/md/K02030
that returns straight CSV, not twisted database entry.
Duplicated phylogenetic profiles should be either dropped or merged.
For some ORFs KEGG Orthology gives more than one KO group ID. It produces more than one phylogenetic profile.
import prowler as prwl
api = prwl.apis.KEGG_API()
api.get_organisms_ids('tmp_org_ids.csv')
api.get_org_db_X_ref('Saccharomyces cerevisiae', 'orthology', out_file_name='tmp_org_KO.csv')
api.org_db_X_ref_df[api.org_db_X_ref_df.duplicated(subset=['ORF_ID'])]
ORF_ID | KEGG_ID |
---|
api.org_db_X_ref_df[api.org_db_X_ref_df.duplicated(subset=['ORF_ID'])]
ORF_ID | KEGG_ID |
---|---|
YBR019C | K01785 |
RDN37-1 | K01982 |
RDN37-2 | K01982 |
prowler.stats.permute_profiles
does not accept the recent feature of PSS calculation with different distance measures. It should.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.