hammerlab / cohorts Goto Github PK
View Code? Open in Web Editor NEWUtilities for analyzing mutations and neoepitopes in patient cohorts
License: Apache License 2.0
Utilities for analyzing mutations and neoepitopes in patient cohorts
License: Apache License 2.0
Summarize provenance function fails when comparing an existing cache provenance file vs a non-existing cache provenance file.
AttributeErrorTraceback (most recent call last)
<ipython-input-5-e89ba68b93f3> in <module>()
----> 1 cohort = data.init_cohort(join_with=["ensembl_coverage"])
[ ... some contents omitted ... ]
/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_data_sources(self)
1202 - provenance_file_summary: summary of provenance file contents (see `?cohorts.Cohort.summarize_provenance`)
1203 """
-> 1204 provenance_file_summary = self.summarize_provenance()
1205 dataframe_hash = self.summarize_dataframe()
1206 results = {
/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_provenance(self)
1181 summary_provenance,
1182 left_outer_diff = "In %s but not in %s" % (cache, summary_provenance_name),
-> 1183 right_outer_diff = "In %s but not in %s" % (summary_provenance_name, cache)
1184 )
1185 ## compare provenance across cached items
/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in compare_provenance(this_provenance, other_provenance, left_outer_diff, right_outer_diff)
1253 Number of discrepancies (0: None)
1254 """
-> 1255 this_items = set(this_provenance.items())
1256 other_items = set(other_provenance.items())
1257
AttributeError: 'NoneType' object has no attribute 'items'
(in an existing cohort
)
init_cohort()
When I run nosetests test
locally, I see:
tavi@tavi-machine-clone:~/cohorts$ nosetests test
................../home/tavi/cohorts/cohorts/utils.py:136: UserWarning: Warning: strip_column_names (if run) would introduce duplicate names. Reverting column names to the original.
warnings.warn(warn_str)
............
----------------------------------------------------------------------
@jburos is this expected?
Example:
vc = cohort.load_variants()
vc[0].metadata
{}
vc = cohort.load_variants(filter_fn=None)
vc[0].metadata
{Variant(contig='1', ...
We're currently frozen on varcode
and isovar
, and also have version limits on topiary
and mhctools
. We need to fix cohorts
to work on the latest versions of all these packages.
For final publication these would be nicer if these are not True/False but Benefit vs No Benefit
Described here: #52 (comment)
See @arahuja's comment: #14 (comment)
Look into this warning:
objc[5253]: Class TKApplication is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKMenu is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKContentView is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKWindow is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
Assuming all the data is >= 0. This will be addressed while addressing #109
@jburos suggested adding environment variables to provenance to keep track of paths, but it just occurred to me that this repository doesn't know about any environment variables. It does know about bam_path_rna
, bam_path_dna
, etc.
From this comment: #111 (comment) 6eed798#r74260647
This should be fixed to be more clear.
My naive attempt to fix this (per this old commit) failed because most of the test cases for provenance files do not include a file for each patient. IE there are either 3 or 4 patients in the test data & only 1 of them has a provenance file.
Perhaps this is a realistic scenario? Currently summarize_data_sources()
fails when there isn't a provenance file for each patient. I don't know the correct behavior in this case, so for now postponing the problem.
Ideally, multiple Cohort
could live in the same database.
Note the following scenario:
2
removed from variant cache.2
underlying VCF file path deleted.load_effects
called.2
doesn't exist; print("Variants did not exist for patient %s" % patient.id)
load_effects
called again, and nothing printed this time, because the 0
variants were cached in load_effects
.Long story short: errors need to be thrown so that we don't cache an error.
By adding something like this, albeit less hacky, to plot_boolean
:
if type(boolean_col) == FunctionType:
cols, df = self.as_dataframe([on, boolean_col], **kwargs)
boolean_col = cols[1]
df = filter_not_null(df, boolean_col)
When using as_dataframe
the additional_data
field from patients is merged back in, but assumes that one of keys is patient_id
.
additional_data_all_patients = defaultdict(list)
for patient in self:
if patient.additional_data is not None:
for key, value in patient.additional_data.items():
additional_data_all_patients[key].append(value)
if len(additional_data_all_patients) > 0:
df = df.merge(pd.DataFrame(additional_data_all_patients), on="patient_id", how="left")
To highlight uncertainty in light of censored data. cc @iskandr
@tavinathanson I'm a bit confused about the behavior of the filter_fn
arg of Cohort
. Should this a function from FilterableVariant
-> bool? or FilterableEffect
?
For example something like
def qcfilter(filterable_variant):
somatic_stats = variant_stats_from_variant(filterable_variant.variant,
filterable_variant.variant_metadata)
...
works with either since both have a variant
field and variant_metadata
but if we ever looks at exclusive properties, load_variants
or load_effects
would fail.
Via implementing the re-scaling as in tximport
Sometimes I see integer sorting and sometimes I see alphanumeric sorting; at the least, we should be sure that IDs always match up correctly (i.e. the right sample ID with the right BAM IDs).
load_variants
returns a dictionary from patient_id
to VariantCollection
, while load_neoepitopes
returns a DataFrame
with neoepitopes spanning all patients. This should be more consistent, though I'm not sure what the cleanest solution is. (I'd rather not convert everything to a DataFrame
and lose the ability to easily filter VariantCollections
natively.)
See #86 (comment)
One solution that we currently have is manually disabling summary printing.
Maybe we don't want this but perhaps a warning if I do
cohort.as_dataframe(join_with='non-existant data')
We've often run into trouble with different de-duping mechanisms prior to grouping variants, epitopes, etc. This should be better tested.
To make it more clear what's going on!
Inspired by #96
Seems very easy to accidentally forget the following lines at the beginning of functions:
filter_fn = first_not_none_param([filter_fn, cohort.filter_fn], no_filter)
normalized_per_mb = first_not_none_param([normalized_per_mb, cohort.normalized_per_mb], False)
...and thereby not use the Cohort
default.
After tagging a release 0.1.0
, I get the following from Travis:
HTTPError: 400 Client Error: Invalid version, cannot use PEP 440 local versions on PyPI. for url: https://pypi.python.org/pypi
It appears that cohorts.__version__
is not 0.1.0
, but I'm not sure why not.
From @arahuja:
We definitely need a way to not filter which no_fitler solves, but for some reason, I'd rather None did that? Does it make sense for filter_fn to have 3 possible args: 'default', a filter fn, or None. Where 'default' goes to the cohort? This way the user never needs to know of anything special, either the default happens or the user specifies a filter function or None? I worry about the discoverability of no_filter and just None being more natural?
e.g. Cohort(read_only=True)
would allow use of a shared cache without fear of overwriting anything in it.
Travis builds on the summarize_provenance branch continue to fail, even after freezing isovar to v 0.0.6
.
E.g. in build 329:
python2.7:
Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
Downloading scikit-bio-0.5.0.zip (8.4MB)
100% |████████████████████████████████| 8.4MB 159kB/s
Complete output from command python setup.py egg_info:
scikit-bio can only be used with Python 3. You are currently running Python 2.
python3.4:
************* Module cohorts.load
E:1067,17: No value for argument 'join_with' in method call (no-value-for-parameter)
E:1067,17: No value for argument 'join_how' in method call (no-value-for-parameter)
Not sure whether either of these is related to the changes in this branch. Posting here for assistance.
@iskandr is changing a few things around. Also, his comment from before:
My only suggestion would be to maybe increase the minimum MAPQ to 1 and if you're ultimately using the counts of neoepitopes with >= 3 spanning reads then this will be very sensitive to the degree of exonic coverage. Maybe normalize by number of reads mapping to exons?
Hitting a bug with verify_cache
def verify_cache(self, cache_names):
bad_caches = []
for cache_name in cache_names.values():
cache_dir = path.join(self.cache_dir, cache_name)
if path.exists(cache_dir):
cache_subdirs = set(listdir(cache_dir))
cache_int_subdirs = set([int(name) for name in cache_subdirs])
if len(cache_subdirs) != len(cache_int_subdirs):
bad_caches.append(cache_name)
if len(bad_caches) > 0:
raise ValueError("Caches %s have duplicate int/str directories" %
str(bad_caches))
/demeter/users/ahujaa01/src/hammerlab/cohorts/cohorts/load.pyc in verify_cache(self, cache_names)
254 if path.exists(cache_dir):
255 cache_subdirs = set(listdir(cache_dir))
--> 256 cache_int_subdirs = set([int(name) for name in cache_subdirs])
257 if len(cache_subdirs) != len(cache_int_subdirs):
258 bad_caches.append(cache_name)
ValueError: invalid literal for int() with base 10: 'TCGA-55-7907'
Is this still needed? Seems to assume that the sample ids are int
s?
See @arahuja's comment: #14 (comment)
Maybe out of scope for this project, but for TCGA data it would be useful to support PFS or OS being set to NaN. Right now, this gives an error because of the PFS <= OS assertion
Currently, builds against the latest versions of isovar fail. See, for example, build #327.
The error message says :
ImportError: No module named protein_sequence
@tavinathanson mentions this is because the latest isovar API has changed.
For now, the requirements.txt
file has been frozen at isovar==0.0.6
.
Not sure what the right way is to specify git packages in setup.py
, but right now pip install .
or python setup.py install
give me the following message.
Collecting isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)
Could not find a version that satisfies the requirement isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa) (from versions: )
No matching distribution found for isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)
I can work around it with running the following first
pip install git+git://github.com/hammerlab/isovar
https://github.com/hammerlab/cohorts/blob/master/cohorts/load.py#L541 should call _merge_variant_collections
to get the metadata
into the expected format, even if no merging is necessary.
Once that's done, Patient
no longer needs to point to Cohort
.
@arahuja based on our conversation yesterday, seems like filter_fn
should default back to None
for two reasons:
None
is looking appropriate for our cohort.Agreed?
For example:
{'dataframe_hash': -312166828261663650,
'provenance_file_summary': {'cached-effects': {u'cohorts': u'0+untagged.260.g71e0082',
u'isovar': u'0.0.6',
u'mhctools': u'0.2.3',
u'numpy': u'1.11.0',
u'pandas': u'0.18.1',
u'pyensembl': u'0.9.3',
u'scipy': u'0.17.0',
u'topiary': u'0.0.21',
u'varcode': u'0.4.14+6.g4bb441f'},
'cached-expressed-neoantigens': {u'cohorts': u'0+untagged.188.g6260531.dirty',
u'isovar': u'0.0.6',
u'mhctools': u'0.2.3',
u'numpy': u'1.11.0',
u'pandas': u'0.18.1',
u'pyensembl': u'0.9.1',
u'scipy': u'0.17.0',
u'topiary': u'0.0.21',
u'varcode': u'0.4.14'},
'cached-isovar-output': {u'cohorts': u'0+untagged.188.g6260531.dirty',
u'isovar': u'0.0.6',
u'mhctools': u'0.2.3',
u'numpy': u'1.11.0',
u'pandas': u'0.18.1',
u'pyensembl': u'0.9.1',
u'scipy': u'0.17.0',
u'topiary': u'0.0.21',
u'varcode': u'0.4.14'},
...
Maybe this could, instead, be something like:
{'dataframe_hash': -312166828261663650,
'provenance_file_summary': {'isovar`: `0.0.6`,
`numpy`: {`cached-isovar-output`: `1.11.0`,
`cached-neoantigens`: `1.12.0` # When they disagree
...
i think this is where you'll need to change something for lung, somehow making this configurable to not load the ensembl_coverage loader
Line 91 in 4718cb0
maybe e.g. if 'MB' column exists use that else join
that'll call load_ensembl_whatever which fails if there's no pageant dir
won’t it first check that the key exists?
oh ok so the key does exist
and the it’ll call load
and that will fail with a path
Every cached file can be accompanied by a MANIFEST
that lists the software versions that created it.
It's a bit monstrous at the moment.
Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
Downloading scikit-bio-0.5.0.zip (8.4MB)
100% |████████████████████████████████| 8.4MB 124kB/s
Complete output from command python setup.py egg_info:
scikit-bio can only be used with Python 3. You are currently running Python 2.
This seems to come from the newest isovar
0.4.21
isovar
to as this pulled a new version after a version bump yesterday.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.