Coder Social home page Coder Social logo

cohorts's People

Contributors

arahuja avatar armish avatar e5c avatar hammer avatar jburos avatar tavinathanson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cohorts's Issues

Summarize_provenance fails after a cache has been deleted

Summarize provenance function fails when comparing an existing cache provenance file vs a non-existing cache provenance file.

output

AttributeErrorTraceback (most recent call last)
<ipython-input-5-e89ba68b93f3> in <module>()
----> 1 cohort = data.init_cohort(join_with=["ensembl_coverage"])

[ ... some contents omitted ... ]

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_data_sources(self)
   1202         - provenance_file_summary: summary of provenance file contents (see `?cohorts.Cohort.summarize_provenance`)
   1203         """
-> 1204         provenance_file_summary = self.summarize_provenance()
   1205         dataframe_hash = self.summarize_dataframe()
   1206         results = {

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in summarize_provenance(self)
   1181                 summary_provenance,
   1182                 left_outer_diff = "In %s but not in %s" % (cache, summary_provenance_name),
-> 1183                 right_outer_diff = "In %s but not in %s" % (summary_provenance_name, cache)
   1184             )
   1185         ## compare provenance across cached items

/mnt/ssd0/env/local/lib/python2.7/site-packages/cohorts/load.pyc in compare_provenance(this_provenance, other_provenance, left_outer_diff, right_outer_diff)
   1253     Number of discrepancies (0: None)
   1254     """
-> 1255     this_items = set(this_provenance.items())
   1256     other_items = set(other_provenance.items())
   1257 

AttributeError: 'NoneType' object has no attribute 'items'

to replicate

(in an existing cohort)

  1. delete all contents from an existing cache
  2. run init_cohort()

strip_column_names warning when running tests

When I run nosetests test locally, I see:

tavi@tavi-machine-clone:~/cohorts$ nosetests test
................../home/tavi/cohorts/cohorts/utils.py:136: UserWarning: Warning: strip_column_names (if run) would introduce duplicate names. Reverting column names to the original.
  warnings.warn(warn_str)
............
----------------------------------------------------------------------

@jburos is this expected?

Upgrade to latest APIs

We're currently frozen on varcode and isovar, and also have version limits on topiary and mhctools. We need to fix cohorts to work on the latest versions of all these packages.

Convert benefit labels

For final publication these would be nicer if these are not True/False but Benefit vs No Benefit

Warning on import

Look into this warning:

objc[5253]: Class TKApplication is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKMenu is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKContentView is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.
objc[5253]: Class TKWindow is implemented in both /Users/arahuja/anaconda/envs/py34/lib/libtk8.5.dylib and /System/Library/Frameworks/Tk.framework/Versions/8.5/Tk. One of the two will be used. Which one is undefined.

Add paths to provenance

@jburos suggested adding environment variables to provenance to keep track of paths, but it just occurred to me that this repository doesn't know about any environment variables. It does know about bam_path_rna, bam_path_dna, etc.

unit tests for summarize_provenance

My naive attempt to fix this (per this old commit) failed because most of the test cases for provenance files do not include a file for each patient. IE there are either 3 or 4 patients in the test data & only 1 of them has a provenance file.

Perhaps this is a realistic scenario? Currently summarize_data_sources() fails when there isn't a provenance file for each patient. I don't know the correct behavior in this case, so for now postponing the problem.

Caching issues

Note the following scenario:

  • Patient 2 removed from variant cache.
  • Patient 2 underlying VCF file path deleted.
  • load_effects called.
  • Patient 2 doesn't exist; print("Variants did not exist for patient %s" % patient.id)
  • load_effects called again, and nothing printed this time, because the 0 variants were cached in load_effects.

Long story short: errors need to be thrown so that we don't cache an error.

Allow the plot_boolean column to be a function

By adding something like this, albeit less hacky, to plot_boolean:

if type(boolean_col) == FunctionType:
    cols, df = self.as_dataframe([on, boolean_col], **kwargs)
    boolean_col = cols[1]
df = filter_not_null(df, boolean_col)

additional data assumes patient id field

When using as_dataframe the additional_data field from patients is merged back in, but assumes that one of keys is patient_id.

        additional_data_all_patients = defaultdict(list)
        for patient in self:
            if patient.additional_data is not None:
                for key, value in patient.additional_data.items():
                    additional_data_all_patients[key].append(value)

        if len(additional_data_all_patients) > 0:
            df = df.merge(pd.DataFrame(additional_data_all_patients), on="patient_id", how="left")

filter_fn default works across variants and effects

@tavinathanson I'm a bit confused about the behavior of the filter_fn arg of Cohort. Should this a function from FilterableVariant -> bool? or FilterableEffect?

For example something like

def qcfilter(filterable_variant):
    somatic_stats = variant_stats_from_variant(filterable_variant.variant,
                                               filterable_variant.variant_metadata)

     ...

works with either since both have a variant field and variant_metadata but if we ever looks at exclusive properties, load_variants or load_effects would fail.

Possible issue: sample_id as int vs str

Sometimes I see integer sorting and sometimes I see alphanumeric sorting; at the least, we should be sure that IDs always match up correctly (i.e. the right sample ID with the right BAM IDs).

load_* should return consistent types

load_variants returns a dictionary from patient_id to VariantCollection, while load_neoepitopes returns a DataFrame with neoepitopes spanning all patients. This should be more consistent, though I'm not sure what the cleanest solution is. (I'd rather not convert everything to a DataFrame and lose the ability to easily filter VariantCollections natively.)

Error on invalid join_with

Maybe we don't want this but perhaps a warning if I do
cohort.as_dataframe(join_with='non-existant data')

Simplify Cohort default

Seems very easy to accidentally forget the following lines at the beginning of functions:

filter_fn = first_not_none_param([filter_fn, cohort.filter_fn], no_filter)
normalized_per_mb = first_not_none_param([normalized_per_mb, cohort.normalized_per_mb], False)

...and thereby not use the Cohort default.

Fix versioneer releasing

After tagging a release 0.1.0, I get the following from Travis:

HTTPError: 400 Client Error: Invalid version, cannot use PEP 440 local versions on PyPI. for url: https://pypi.python.org/pypi

It appears that cohorts.__version__ is not 0.1.0, but I'm not sure why not.

Replace no_filter with None

From @arahuja:

We definitely need a way to not filter which no_fitler solves, but for some reason, I'd rather None did that? Does it make sense for filter_fn to have 3 possible args: 'default', a filter fn, or None. Where 'default' goes to the cohort? This way the user never needs to know of anything special, either the default happens or the user specifies a filter function or None? I worry about the discoverability of no_filter and just None being more natural?

Add read-only mode

e.g. Cohort(read_only=True) would allow use of a shared cache without fear of overwriting anything in it.

travis builds failing with isovar==0.0.6

Travis builds on the summarize_provenance branch continue to fail, even after freezing isovar to v 0.0.6.

E.g. in build 329:

python2.7:

Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
  Downloading scikit-bio-0.5.0.zip (8.4MB)
    100% |████████████████████████████████| 8.4MB 159kB/s 
    Complete output from command python setup.py egg_info:
    scikit-bio can only be used with Python 3. You are currently running Python 2.

python3.4:

************* Module cohorts.load
E:1067,17: No value for argument 'join_with' in method call (no-value-for-parameter)
E:1067,17: No value for argument 'join_how' in method call (no-value-for-parameter)

Not sure whether either of these is related to the changes in this branch. Posting here for assistance.

Update isovar usage

@iskandr is changing a few things around. Also, his comment from before:

My only suggestion would be to maybe increase the minimum MAPQ to 1 and if you're ultimately using the counts of neoepitopes with >= 3 spanning reads then this will be very sensitive to the degree of exonic coverage. Maybe normalize by number of reads mapping to exons?

Is verify_cache still needed?

Hitting a bug with verify_cache

    def verify_cache(self, cache_names):
        bad_caches = []
        for cache_name in cache_names.values():
            cache_dir = path.join(self.cache_dir, cache_name)
            if path.exists(cache_dir):
                cache_subdirs = set(listdir(cache_dir))
                cache_int_subdirs = set([int(name) for name in cache_subdirs])
                if len(cache_subdirs) != len(cache_int_subdirs):
                    bad_caches.append(cache_name)

        if len(bad_caches) > 0:
            raise ValueError("Caches %s have duplicate int/str directories" %
                             str(bad_caches))
/demeter/users/ahujaa01/src/hammerlab/cohorts/cohorts/load.pyc in verify_cache(self, cache_names)
    254             if path.exists(cache_dir):
    255                 cache_subdirs = set(listdir(cache_dir))
--> 256                 cache_int_subdirs = set([int(name) for name in cache_subdirs])
    257                 if len(cache_subdirs) != len(cache_int_subdirs):
    258                     bad_caches.append(cache_name)

ValueError: invalid literal for int() with base 10: 'TCGA-55-7907'

Is this still needed? Seems to assume that the sample ids are ints?

Support NA PFS or OS

Maybe out of scope for this project, but for TCGA data it would be useful to support PFS or OS being set to NaN. Right now, this gives an error because of the PFS <= OS assertion

builds against latest versions of isovar fails

Currently, builds against the latest versions of isovar fail. See, for example, build #327.

The error message says :

ImportError: No module named protein_sequence

@tavinathanson mentions this is because the latest isovar API has changed.

For now, the requirements.txt file has been frozen at isovar==0.0.6.

Fix isovar git dependency

Not sure what the right way is to specify git packages in setup.py, but right now pip install . or python setup.py install give me the following message.

Collecting isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)
  Could not find a version that satisfies the requirement isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa) (from versions: )
No matching distribution found for isovar>=0.0.2 (from cohorts==0.1.0+6.gd3244aa)

I can work around it with running the following first

pip install git+git://github.com/hammerlab/isovar

Default filter_fn back to None

@arahuja based on our conversation yesterday, seems like filter_fn should default back to None for two reasons:

  • It's a pretty cohort-specific setting.
  • None is looking appropriate for our cohort.

Agreed?

Cohort summaries are a bit verbose

For example:

{'dataframe_hash': -312166828261663650,
 'provenance_file_summary': {'cached-effects': {u'cohorts': u'0+untagged.260.g71e0082',
                                                u'isovar': u'0.0.6',
                                                u'mhctools': u'0.2.3',
                                                u'numpy': u'1.11.0',
                                                u'pandas': u'0.18.1',
                                                u'pyensembl': u'0.9.3',
                                                u'scipy': u'0.17.0',
                                                u'topiary': u'0.0.21',
                                                u'varcode': u'0.4.14+6.g4bb441f'},
                             'cached-expressed-neoantigens': {u'cohorts': u'0+untagged.188.g6260531.dirty',
                                                              u'isovar': u'0.0.6',
                                                              u'mhctools': u'0.2.3',
                                                              u'numpy': u'1.11.0',
                                                              u'pandas': u'0.18.1',
                                                              u'pyensembl': u'0.9.1',
                                                              u'scipy': u'0.17.0',
                                                              u'topiary': u'0.0.21',
                                                              u'varcode': u'0.4.14'},
                             'cached-isovar-output': {u'cohorts': u'0+untagged.188.g6260531.dirty',
                                                      u'isovar': u'0.0.6',
                                                      u'mhctools': u'0.2.3',
                                                      u'numpy': u'1.11.0',
                                                      u'pandas': u'0.18.1',
                                                      u'pyensembl': u'0.9.1',
                                                      u'scipy': u'0.17.0',
                                                      u'topiary': u'0.0.21',
                                                      u'varcode': u'0.4.14'},
...

Maybe this could, instead, be something like:

{'dataframe_hash': -312166828261663650,
 'provenance_file_summary': {'isovar`: `0.0.6`,
                                `numpy`: {`cached-isovar-output`: `1.11.0`,
                                   `cached-neoantigens`: `1.12.0` # When they disagree
...

Allow MB to be specified manually without a pageant dir

@tavinathanson:

i think this is where you'll need to change something for lung, somehow making this configurable to not load the ensembl_coverage loader

patient_to_mb = dict(cohort.as_dataframe(join_with="ensembl_coverage")[["patient_id", "MB"]].to_dict("split")["data"])

maybe e.g. if 'MB' column exists use that else join
that'll call load_ensembl_whatever which fails if there's no pageant dir

@arahuja:

won’t it first check that the key exists?
oh ok so the key does exist
and the it’ll call load
and that will fail with a path

Travis python2.7 build failing

Collecting scikit-bio>=0.4.2 (from isovar==0.0.6->-r requirements.txt (line 14))
  Downloading scikit-bio-0.5.0.zip (8.4MB)
    100% |████████████████████████████████| 8.4MB 124kB/s 
    Complete output from command python setup.py egg_info:
    scikit-bio can only be used with Python 3. You are currently running Python 2.

This seems to come from the newest isovar

  1. We can (or isovar) fix this is to 0.4.21
  2. What do you think about fixing isovar to as this pulled a new version after a version bump yesterday.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.