Coder Social home page Coder Social logo

anopheles-genomic-surveillance / selection-atlas Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 309.5 MB

Here be dragons.

Home Page: https://anopheles-genomic-surveillance.github.io/selection-atlas/home-page.html

License: MIT License

Python 8.92% Jupyter Notebook 91.08%

selection-atlas's Introduction

selection-atlas

Here be dragons.

Development docs:

Conda environment management

Assuming you have a recent version of mamba installed.

The file requirements.yml has the dependencies required to build the site. To ensure reproducibility we currently also maintain the file environment.yml which contains an export of the solved environment.

To create and activate an environment on your own computer:

mamba env create --force --file environment.yml
mamba activate selection-atlas

To create and activate an environment on datalab-bespin:

mamba env create --force --prefix=${HOME}/envs/selection-atlas --file environment.yml
conda activate ${HOME}/envs/selection-atlas

If you need to add or upgrade a package, edit requirements.yml. Do not edit environment.yml.

To upgrade environment.yml:

mamba env create --force --file requirements.yml
mamba env export -f environment.yml -n selection-atlas-requirements --override-channels --channel conda-forge --channel bioconda
sed -i "s/selection-atlas-requirements/selection-atlas/" environment.yml

Running the workflow

If running on your local system with GCS caching enabled, you'll need to run the build without any parallelisation:

snakemake -c1

If running on Google Cloud and GCS caching is disabled, you can try running with parallelisation, e.g.:

snakemake -c4

selection-atlas's People

Contributors

alimanfoo avatar sanjaynagi avatar

Watchers

 avatar  avatar

Forkers

chabbytmd

selection-atlas's Issues

Cohort pages add number of collection sites (locations)

In the preamble text on the cohort pages, insert the number of distinct collection sites (locations). This can be found from distinct values of either the "location" column or the "latitude" and "longitude" columns in the sample metadata.

Final tidy up for alpha1

Do an editorial pass over all pages, adding or refining some text where needed, removing darts videos, and generally making the site look presentable to demo.

Fix overview plots in chromosome pages

Currently the overview plots showing all the signals on the chromosome pages don't seem to be working.

When we have got them working, do any visual styling and tweaking required so they work OK with the numbers of signals we're getting from the full Ag3.0 build.

Cohort page, suggestion for how to present collection site and date information

Currently on the cohort pages we have a map of collection locations and a bar chart of number of samples by month.

The bar chart is visually a bit odd because it's big but doesn't convey much information. Also we are not showing a breakdown of numbers of samples by location and month.

Suggest to replace the bar chart with a pivot table with collection locations as rows and months as columns, showing the numbers of samples.

Dodgy signals

I'm seeing a few dodgy looking signals, inferred where there is no sign of a peak. E.g.:

image

Initial zoom on home page map

Set the initial zoom on the home page map to zoom out a little, so that all Africa is visible. Currently some of west africa gets cut off...

image

"CONTIG" KeyError in H12 calibration

When trying to build, I'm getting an error from within malariagen_data - not quite sure whats going on...

I am using the new environment. Have tried removing the previous cache and the old build/ folder. The error also occurs with both joined contigs (2RL) and not joined (2R).

CacheMiss                                 Traceback (most recent call last)
File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:5846, in AnophelesDataResource.h12_calibration(self, contig, analysis, sample_query, sample_sets, cohort_size, min_cohort_size, max_cohort_size, window_sizes, random_seed)
   5845 try:
-> 5846     calibration_runs = self.results_cache_get(name=name, params=params)
   5848 except CacheMiss:

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:853, in AnophelesDataResource.results_cache_get(self, name, params)
    852 if not results_path.exists():
--> 853     raise CacheMiss
    854 results = np.load(results_path)

CacheMiss: 

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[13], line 1
----> 1 ag3.plot_h12_calibration(
      2     contig=h12_calibration_contig,
      3     analysis=phasing_analysis,
      4     sample_sets=sample_sets,
      5     sample_query=sample_query,
      6     min_cohort_size=min_cohort_size,
      7     max_cohort_size=max_cohort_size,
      8     window_sizes=window_sizes,
      9 );

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:5915, in AnophelesDataResource.plot_h12_calibration(self, contig, analysis, sample_query, sample_sets, cohort_size, min_cohort_size, max_cohort_size, window_sizes, random_seed, title, show)
   5889 @doc(
   5890     summary="Plot h12 GWSS calibration data for different window sizes.",
   5891     parameters=dict(
   (...)
   5913 ) -> gplt_params.figure:
   5914     # get H12 values
-> 5915     calibration_runs = self.h12_calibration(
   5916         contig=contig,
   5917         analysis=analysis,
   5918         sample_query=sample_query,
   5919         sample_sets=sample_sets,
   5920         window_sizes=window_sizes,
   5921         cohort_size=cohort_size,
   5922         min_cohort_size=min_cohort_size,
   5923         max_cohort_size=max_cohort_size,
   5924         random_seed=random_seed,
   5925     )
   5927     # compute summaries
   5928     q50 = [np.median(calibration_runs[str(window)]) for window in window_sizes]

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:5849, in AnophelesDataResource.h12_calibration(self, contig, analysis, sample_query, sample_sets, cohort_size, min_cohort_size, max_cohort_size, window_sizes, random_seed)
   5846     calibration_runs = self.results_cache_get(name=name, params=params)
   5848 except CacheMiss:
-> 5849     calibration_runs = self._h12_calibration(**params)
   5850     self.results_cache_set(name=name, params=params, results=calibration_runs)
   5852 return calibration_runs

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:5867, in AnophelesDataResource._h12_calibration(self, contig, analysis, sample_query, sample_sets, cohort_size, min_cohort_size, max_cohort_size, window_sizes, random_seed)
   5854 def _h12_calibration(
   5855     self,
   5856     contig,
   (...)
   5865 ):
   5866     # access haplotypes
-> 5867     ds_haps = self.haplotypes(
   5868         region=contig,
   5869         sample_sets=sample_sets,
   5870         sample_query=sample_query,
   5871         analysis=analysis,
   5872         cohort_size=cohort_size,
   5873         min_cohort_size=min_cohort_size,
   5874         max_cohort_size=max_cohort_size,
   5875         random_seed=random_seed,
   5876     )
   5878     gt = allel.GenotypeDaskArray(ds_haps["call_genotype"].data)
   5879     with self._dask_progress(desc="Load haplotypes"):

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:5717, in AnophelesDataResource.haplotypes(self, region, analysis, sample_sets, sample_query, inline_array, chunks, cohort_size, min_cohort_size, max_cohort_size, random_seed)
   5715 debug("normalise parameters")
   5716 sample_sets = self._prep_sample_sets_param(sample_sets=sample_sets)
-> 5717 resolved_region = self.resolve_region(region)
   5718 del region
   5720 if isinstance(resolved_region, Region):

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anopheles.py:1736, in AnophelesDataResource.resolve_region(self, region)
   1731 @doc(
   1732     summary="Convert a genome region into a standard data structure.",
   1733     returns="An instance of the `Region` class.",
   1734 )
   1735 def resolve_region(self, region: base_params.region) -> Region:
-> 1736     return resolve_region(self, region)

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/util.py:436, in resolve_region(resource, region)
    433     raise TypeError("The region parameter must be a string or Region object.")
    435 # check if region is a whole contig
--> 436 if region in _valid_contigs(resource):
    437     return Region(region, None, None)
    439 # check if region is a region string providing coordinates

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/util.py:406, in _valid_contigs(resource)
    404 def _valid_contigs(resource):
    405     """Determine which contig identifiers are valid for the given data resource."""
--> 406     valid_contigs = resource.contigs
    407     # allow for optional virtual contigs
    408     valid_contigs += getattr(resource, "virtual_contigs", ())

File ~/projects/selection-atlas/.snakemake/conda/becd6af5994f9e79c1329307d32adf4a_/lib/python3.10/site-packages/malariagen_data/anoph/genome_sequence.py:23, in AnophelesGenomeSequenceData.contigs(self)
     21 @property
     22 def contigs(self) -> Tuple[str, ...]:
---> 23     return tuple(self.config["CONTIGS"])

KeyError: 'CONTIGS'

cc @alimanfoo

Selection alert for Vgsc locus

Create a selection alert page for the Vgsc locus.

As part of this, create any necessary utility functions that will be used on all selection alert pages, e.g., a function to show all selection signals overlapping the locus of interest.

Cohort page, add overlay of inferred signals to H12 scan plots

In the cohort pages, on the H12 scan plots, overlay the inferred selection signals somehow. E.g., similar to the original selection atlas prototype, could use color shading to indicate focus and span:

image

...perhaps with some stats like delta I in hover text.

Chromosome page, styling tweaks to the plot of signals by chromosome

For the plot of selection signals, suggest a few styling tweaks:

  • Reduce row height, to allow for situations where lots of signals get stacked up at the same location.
  • Add some kind of color legend (we are coloring by taxon)
  • Consider using diamonds instead of rectangles maybe?
  • Currently possible to zoom out beyond end of chromosome...

image

Error in cohort page for CI-LG_Agneby-Tiassa_colu_2012

The CI-LG_Agneby-Tiassa_colu_2012 cohort seems to have a -1 for the month column in the metadata (based on the error at least, I haven't checked it yet). This gives an error when doing pandas datetime stuff on the cohort page.

The error:

cols = pd.MultiIndex.from_tuples(
     13     [("Location", "Name"), ("Location", "Longitude"), ("Location", "Latitude")] + 
---> 14     [("Month", pd.to_datetime(x, format="%m").month_name())

ValueError: time data "-1" doesn't match format "%m", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Upgrade malariagen_data to 7.4.0

When it's released, upgrade malariagen_data to 7.4.0 in order to bring in support for joined arm virtual contigs, and IHS and G123.

Error in cohort pages when no signals were found by h12-signal-detection

It seems that the h12-signal-detection notebook saves an empty file with no column names when no signals are found for a given cohort's contig.

This seems to cause an error when running the cohort page notebook for that cohort. For example, in the MZ-I_Morrumbene_gamb_2004_Q1 cohort, we get the following error:

----> 3 df_signals = [
      4     pd.read_csv(here() / "build/h12-signal-detection/" / f"{cohort_id}_{contig}.csv")
      5     for contig in contigs
      6 ]

EmptyDataError: No columns to parse from file

Add favicon

Add a pretty favicon - either a mosquito or DNA probably

Add workflow rules for running G123 and IHS GWSS

Currently for H12 we have a dedicated notebook and a task in the workflow to run the GWSS for all contigs and cohorts.

However, we don't have equivalent tasks for G123 or IHS.

Later, when building the cohort pages, IHS and G123 scans get run because they are requested when building the plots. This is OK, but it might be nice to have dedicated tasks so we can see when those scans are actually being run during a snakemake build.

Add G123 window size calibration

Currently the size runs G123 GWSS using the window size chosen from H12 calibration. However, the G123 calibration curves generally look a bit different, and indicate a different window size should be used for G123. Suggest to add a separate G123 window size calibration step to the workflow.

Btw, it's probably safe to assume that if a cohort passes H12 calibration then it will also pass G123 calibration, and therefore filtering of final cohorts may still only need to inspect the results of H12 calibration. Something to consider though, as there could be an edge case where H12 calibration passes but G123 calibration fails, in which case the build would break.

Preparatory work in malariagen_data

In order to run GWSS for this project, we'll need some work upstream in the malariagen_data package. Creating this issue to collect ideas for things we'll need in malariagen_data.

  • Ability to run GWSS across whole chromosomes, i.e., 2RL and 3RL. This is particularly useful to see the signals associated with Vgsc, which usually span the join between 2R and 2L. Can be achieved by concatenating data from the two chromosome arms then offsetting coordinates for the left arm. Will need this support at least in all GWSS functions we'll want to run. Raised upstream: malariagen/malariagen-data-python#332
  • Implementation of G123. Raised upstream: malariagen/malariagen-data-python#305
  • Implementation of IHS. Raised upstream: malariagen/malariagen-data-python#310

We probably don't want to use XPEHH or PBS for this project, because of the manual effort required to select suitable comparison populations. However, PBS is coming upstream anyway, via malariagen/malariagen-data-python#295.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.