hammerlab / cohorts Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 4.0 568 KB

Utilities for analyzing mutations and neoepitopes in patient cohorts

License: Apache License 2.0

Python 99.94% Shell 0.06%

cohorts's People

Contributors

Stargazers

Watchers

Forkers

armish arahuja hwang-happy duttaprat

cohorts's Issues

Summarize_provenance fails after a cache has been deleted

Summarize provenance function fails when comparing an existing cache provenance file vs a non-existing cache provenance file.

output

AttributeErrorTraceback (most recent call last)
<ipython-input-5-e89ba68b93f3> in <module>()
----> 1 cohort = data.init_cohort(join_with=["ensembl_coverage"])

[ ... some contents omitted ... ]

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_data_sources(self)
   1202         - provenance_file_summary: summary of provenance file contents (see `?cohorts.Cohort.summarize_provenance`)
   1203         """
-> 1204         provenance_file_summary = self.summarize_provenance()
   1205         dataframe_hash = self.summarize_dataframe()
   1206         results = {

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_provenance(self)
   1181                 summary_provenance,
   1182                 left_outer_diff = "In %s but not in %s" % (cache, summary_provenance_name),
-> 1183                 right_outer_diff = "In %s but not in %s" % (summary_provenance_name, cache)
   1184             )
   1185         ## compare provenance across cached items

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in compare_provenance(this_provenance, other_provenance, left_outer_diff, right_outer_diff)
   1253     Number of discrepancies (0: None)
   1254     """
-> 1255     this_items = set(this_provenance.items())
   1256     other_items = set(other_provenance.items())
   1257 

AttributeError: 'NoneType' object has no attribute 'items'

to replicate

(in an existing cohort)

delete all contents from an existing cache
run init_cohort()

strip_column_names warning when running tests

When I run nosetests test locally, I see:

tavi@tavi-machine-clone:~/cohorts$ nosetests test
................../home/tavi/cohorts/cohorts/utils.py:136: UserWarning: Warning: strip_column_names (if run) would introduce duplicate names. Reverting column names to the original.
  warnings.warn(warn_str)
............
----------------------------------------------------------------------

@jburos is this expected?

VariantCollection metadata is lost after filtering

Example:

vc = cohort.load_variants()
vc[0].metadata
{}

vc = cohort.load_variants(filter_fn=None)
vc[0].metadata
{Variant(contig='1', ...

Create public-facing usage template

Upgrade to latest APIs

We're currently frozen on varcode and isovar, and also have version limits on topiary and mhctools. We need to fix cohorts to work on the latest versions of all these packages.

Convert benefit labels

For final publication these would be nicer if these are not True/False but Benefit vs No Benefit

Warning on import

Look into this warning:

objc[5253]: Class TKApplication is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKMenu is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKContentView is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKWindow is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.

Statistical significance * on mann whitney and fisher's exact plot

Start plots at 0 on the y-axis

Assuming all the data is >= 0. This will be addressed while addressing #109

@jburos suggested adding environment variables to provenance to keep track of paths, but it just occurred to me that this repository doesn't know about any environment variables. It does know about bam_path_rna, bam_path_dna, etc.

Side effects of `as_dataframe()` are relied on

From this comment: #111 (comment) 6eed798#r74260647

This should be fixed to be more clear.

unit tests for summarize_provenance

My naive attempt to fix this (per this old commit) failed because most of the test cases for provenance files do not include a file for each patient. IE there are either 3 or 4 patients in the test data & only 1 of them has a provenance file.

Perhaps this is a realistic scenario? Currently summarize_data_sources() fails when there isn't a provenance file for each patient. I don't know the correct behavior in this case, so for now postponing the problem.

Back Cohort, Patient and Sample with a real database

Ideally, multiple Cohort could live in the same database.

Caching issues

Note the following scenario:

Patient 2 removed from variant cache.
Patient 2 underlying VCF file path deleted.
load_effects called.
Patient 2 doesn't exist; print("Variants did not exist for patient %s" % patient.id)
load_effects called again, and nothing printed this time, because the 0 variants were cached in load_effects.

Long story short: errors need to be thrown so that we don't cache an error.

Allow the plot_boolean column to be a function

By adding something like this, albeit less hacky, to plot_boolean:

if type(boolean_col) == FunctionType:
    cols, df = self.as_dataframe([on, boolean_col], **kwargs)
    boolean_col = cols[1]
df = filter_not_null(df, boolean_col)

additional data assumes patient id field

When using as_dataframe the additional_data field from patients is merged back in, but assumes that one of keys is patient_id.

        additional_data_all_patients = defaultdict(list)
        for patient in self:
            if patient.additional_data is not None:
                for key, value in patient.additional_data.items():
                    additional_data_all_patients[key].append(value)

        if len(additional_data_all_patients) > 0:
            df = df.merge(pd.DataFrame(additional_data_all_patients), on="patient_id", how="left")

Plot survival curves with upper/lower bounds

To highlight uncertainty in light of censored data. cc @iskandr

filter_fn default works across variants and effects

@tavinathanson I'm a bit confused about the behavior of the filter_fn arg of Cohort. Should this a function from FilterableVariant -> bool? or FilterableEffect?

For example something like

def qcfilter(filterable_variant):
    somatic_stats = variant_stats_from_variant(filterable_variant.variant,
                                               filterable_variant.variant_metadata)

     ...

works with either since both have a variant field and variant_metadata but if we ever looks at exclusive properties, load_variants or load_effects would fail.

Support aggregating over TPM in load_kallisto

Via implementing the re-scaling as in tximport

Possible issue: sample_id as int vs str

Sometimes I see integer sorting and sometimes I see alphanumeric sorting; at the least, we should be sure that IDs always match up correctly (i.e. the right sample ID with the right BAM IDs).

load_* should return consistent types

load_variants returns a dictionary from patient_id to VariantCollection, while load_neoepitopes returns a DataFrame with neoepitopes spanning all patients. This should be more consistent, though I'm not sure what the cleanest solution is. (I'd rather not convert everything to a DataFrame and lose the ability to easily filter VariantCollections natively.)

When initializing many Cohorts, don't always want provenance summarized

See #86 (comment)

One solution that we currently have is manually disabling summary printing.

Error on invalid join_with

Maybe we don't want this but perhaps a warning if I do
cohort.as_dataframe(join_with='non-existant data')

Verify grouping logic in count functions

We've often run into trouble with different de-duping mechanisms prior to grouping variants, epitopes, etc. This should be better tested.

Output filter_fn name whenever filtering happens

To make it more clear what's going on!

Inspired by #96

Simplify Cohort default

Seems very easy to accidentally forget the following lines at the beginning of functions:

filter_fn = first_not_none_param([filter_fn, cohort.filter_fn], no_filter)
normalized_per_mb = first_not_none_param([normalized_per_mb, cohort.normalized_per_mb], False)

...and thereby not use the Cohort default.

Fix versioneer releasing

After tagging a release 0.1.0, I get the following from Travis:

HTTPError: 400 Client Error: Invalid version, cannot use PEP 440 local versions on PyPI. for url: https://pypi.python.org/pypi

It appears that cohorts.__version__ is not 0.1.0, but I'm not sure why not.

Replace no_filter with None

From @arahuja:

We definitely need a way to not filter which no_fitler solves, but for some reason, I'd rather None did that? Does it make sense for filter_fn to have 3 possible args: 'default', a filter fn, or None. Where 'default' goes to the cohort? This way the user never needs to know of anything special, either the default happens or the user specifies a filter function or None? I worry about the discoverability of no_filter and just None being more natural?

Add read-only mode

e.g. Cohort(read_only=True) would allow use of a shared cache without fear of overwriting anything in it.

travis builds failing with isovar==0.0.6

Travis builds on the summarize_provenance branch continue to fail, even after freezing isovar to v 0.0.6.

E.g. in build 329:

python2.7:

Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
  Downloading scikit-bio-0.5.0.zip (8.4MB)
    100% |████████████████████████████████| 8.4MB 159kB/s 
    Complete output from command python setup.py egg_info:
    scikit-bio can only be used with Python 3. You are currently running Python 2.

python3.4:

************* Module cohorts.load
E:1067,17: No value for argument 'join_with' in method call (no-value-for-parameter)
E:1067,17: No value for argument 'join_how' in method call (no-value-for-parameter)

Not sure whether either of these is related to the changes in this branch. Posting here for assistance.

Update isovar usage

@iskandr is changing a few things around. Also, his comment from before:

My only suggestion would be to maybe increase the minimum MAPQ to 1 and if you're ultimately using the counts of neoepitopes with >= 3 spanning reads then this will be very sensitive to the degree of exonic coverage. Maybe normalize by number of reads mapping to exons?

Find public or create simulated data for worked example

Is verify_cache still needed?

Hitting a bug with verify_cache

    def verify_cache(self, cache_names):
        bad_caches = []
        for cache_name in cache_names.values():
            cache_dir = path.join(self.cache_dir, cache_name)
            if path.exists(cache_dir):
                cache_subdirs = set(listdir(cache_dir))
                cache_int_subdirs = set([int(name) for name in cache_subdirs])
                if len(cache_subdirs) != len(cache_int_subdirs):
                    bad_caches.append(cache_name)

        if len(bad_caches) > 0:
            raise ValueError("Caches %s have duplicate int/str directories" %
                             str(bad_caches))

/demeter/users/ahujaa01/src/hammerlab/cohorts/cohorts/load.pyc in verify_cache(self, cache_names)
    254             if path.exists(cache_dir):
    255                 cache_subdirs = set(listdir(cache_dir))
--> 256                 cache_int_subdirs = set([int(name) for name in cache_subdirs])
    257                 if len(cache_subdirs) != len(cache_int_subdirs):
    258                     bad_caches.append(cache_name)

ValueError: invalid literal for int() with base 10: 'TCGA-55-7907'

Is this still needed? Seems to assume that the sample ids are ints?

Handle pandas dtypes better

See @arahuja's comment: #14 (comment)

Support NA PFS or OS

Maybe out of scope for this project, but for TCGA data it would be useful to support PFS or OS being set to NaN. Right now, this gives an error because of the PFS <= OS assertion

Spacing of new functions is off

Looks like new functions in #68 are indented differently; will fix in #67.

builds against latest versions of isovar fails

Currently, builds against the latest versions of isovar fail. See, for example, build #327.

The error message says :

ImportError: No module named protein_sequence

@tavinathanson mentions this is because the latest isovar API has changed.

For now, the requirements.txt file has been frozen at isovar==0.0.6.

Use sercol instead of hand-rolled Collection class?

See https://github.com/iskandr/sercol

Fix isovar git dependency

Not sure what the right way is to specify git packages in setup.py, but right now pip install . or python setup.py install give me the following message.

Collecting isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)
  Could not find a version that satisfies the requirement isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa) (from versions: )
No matching distribution found for isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)

I can work around it with running the following first

pip install git+git://github.com/hammerlab/isovar

It's a pretty cohort-specific setting.
None is looking appropriate for our cohort.

Agreed?

Cohort summaries are a bit verbose

For example:

{'dataframe_hash': -312166828261663650,
 'provenance_file_summary': {'cached-effects': {u'cohorts': u'0+untagged.260.g71e0082',
                                                u'isovar': u'0.0.6',
                                                u'mhctools': u'0.2.3',
                                                u'numpy': u'1.11.0',
                                                u'pandas': u'0.18.1',
                                                u'pyensembl': u'0.9.3',
                                                u'scipy': u'0.17.0',
                                                u'topiary': u'0.0.21',
                                                u'varcode': u'0.4.14+6.g4bb441f'},
                             'cached-expressed-neoantigens': {u'cohorts': u'0+untagged.188.g6260531.dirty',
                                                              u'isovar': u'0.0.6',
                                                              u'mhctools': u'0.2.3',
                                                              u'numpy': u'1.11.0',
                                                              u'pandas': u'0.18.1',
                                                              u'pyensembl': u'0.9.1',
                                                              u'scipy': u'0.17.0',
                                                              u'topiary': u'0.0.21',
                                                              u'varcode': u'0.4.14'},
                             'cached-isovar-output': {u'cohorts': u'0+untagged.188.g6260531.dirty',
                                                      u'isovar': u'0.0.6',
                                                      u'mhctools': u'0.2.3',
                                                      u'numpy': u'1.11.0',
                                                      u'pandas': u'0.18.1',
                                                      u'pyensembl': u'0.9.1',
                                                      u'scipy': u'0.17.0',
                                                      u'topiary': u'0.0.21',
                                                      u'varcode': u'0.4.14'},
...

Maybe this could, instead, be something like:

{'dataframe_hash': -312166828261663650,
 'provenance_file_summary': {'isovar`: `0.0.6`,
                                `numpy`: {`cached-isovar-output`: `1.11.0`,
                                   `cached-neoantigens`: `1.12.0` # When they disagree
...

Allow MB to be specified manually without a pageant dir

@tavinathanson:

i think this is where you'll need to change something for lung, somehow making this configurable to not load the ensembl_coverage loader

cohorts/cohorts/functions.py

Line 91 in 4718cb0

patient_to_mb = dict(cohort.as_dataframe(join_with="ensembl_coverage")[["patient_id", "MB"]].to_dict("split")["data"])

maybe e.g. if 'MB' column exists use that else join
that'll call load_ensembl_whatever which fails if there's no pageant dir

@arahuja:

won’t it first check that the key exists?
oh ok so the key does exist
and the it’ll call load
and that will fail with a path

Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
  Downloading scikit-bio-0.5.0.zip (8.4MB)
    100% |████████████████████████████████| 8.4MB 124kB/s 
    Complete output from command python setup.py egg_info:
    scikit-bio can only be used with Python 3. You are currently running Python 2.

This seems to come from the newest isovar

We can (or isovar) fix this is to 0.4.21
What do you think about fixing isovar to as this pulled a new version after a version bump yesterday.