Coder Social home page Coder Social logo

bioinformatics's People

Contributors

aneilbaboo avatar gcv avatar

bioinformatics's Issues

Add code test of real-world Ancestry file

  1. Find publicly-available Ancestry file, perhaps looking at the Personal Genomes Project
  2. Add makefile target for downloading the file using wget
  3. Add makefile target called 'test'

The makefile target called 'test' will depend on the makefile target for the Ancestry file. You should download the Ancestry file to a directory called 'test'. You will need to exclude the 'test' directory using '.gitignore'. The recipe for the 'test' makefile target should execute the MAMBA tests.

More on being productive with Make:

Short tutorial on creating reproducible bioinformatic pipelines using Make:
https://bsmith89.github.io/make-bml/

Human-readable intro book on Make:
https://www.oreilly.com/openbook/make3/book/index.csp

GNU Make Manual (i.e., the kitchen sink):
https://www.gnu.org/software/make/manual/

Formalize syntax for describing extended SVN for content authors

Two options that @aneilbaboo and I have discussed:

  1. Use logical operators within SVN strings:

"[A|B];[A|C]"

  1. Limit any SVN to a single locus (or simple variant), and connect them by widget XML syntax:
<gene connect_mode="genotyping_array">
   <variant svn="[A];[A]"/>
   <variant svn="B"/>
   <variant svn="C"/>
</gene>

Modify Dynamo uploader functionality to use S3-based upload instead

Either Constantine or Aneil.

Bioinformatics repo current uses boto API to upload, but it is very slow.

Switch over to using S3 to get data into the VariantCall table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.CopyingData.S3.html#EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat

It looks like this requires EMR. We create a Hive external table and issue a HiveQL UPDATE statement to import data from an S3 file.

I think CSV should be ok, but maybe we should use Hive default format (0x01 instead of comma).

I'm ok with implementing this using the Admin console for now, but ideally, these resources would be created and managed using serverless.

Note: only Hive/TSV/CSV format is supported out of the box. There's an unsupported example here which shows how to load JSON.

(For the record, I vote for Hive)

Document the Precisely VCF format

Provide references to the standard VCF format documentation, along with pointing out Precisely-specific extensions and calling out parts that are important for @visheshd & team to notice.

Need S3 bucket for storing human genome builds

We will have a Makefile for building new docker images for running convert23andMe on 23andMe data files.

Instead of having that Makefile pulling data from the 1k Genomes project each time, we should cache pre-processed files on AWS to speed things up. It would be great if we could have a S3 bucket for third-party bioinformatic databases / datasets.

Schedule conversation with Affy Bioinformatics Specialist

  • How do we determine what coverage we have?
  • Do we need to collect other demographic info?
  • Does the Axiom software provide imputation support?
  • Can the Axiom software be used in a batch-oriented fashion, via a CLI or an API?
  • Does the imputation software incorporate hereditary genetics ?
  • What is the resolving power of a set of SNPs
  • What higher level features is Affy detecting about the genome?
    • how do these relate to GWAS studies?
    • how does a geneticist use these features to filter populations with respect to a SNP
      E.g., As a geneticist, I've done a study on British people and found a SNP that correlates with a disease. I can offer that to consumers, but I need to filter for people of the right background.
    • What information does a geneticist need to capture when describing a SNP?
      • what if they did a GWAS study that associates a SNP with a symptom?

Optimize reference genome file format for adding reference to 3rd party genotype files

convert23andMe and convertAncestry rely on using vcftools to convert the 3rd-party flat-file format into a standard VCF format. This requires providing a reference human genome file to figure out what is the "wildtype" for a locus. You can provide a GZIP'ed FASTA, but it also accepts an indexed FASTA file using faidx. Conversion using the GZIP'ed FASTA is very fast. It might be faster using the indexed format. This issue is a note for the icebox that we might want to optimize this some day.

Create Releases in repos

Hi @aneilbaboo, could you create the following Releases (or any set of categories that you prefer) for tagging Issues? It would be helpful for saying that some issues will be tackled after Release 1.

  • Release 0
  • Release 1
  • Release >1

Write a function to python package to generate Precisely VCF from 23andMe files

  • Test with several 23andMe files
  • Assume raw files are in s3://{env}-precisely-genetics-raw-23andme
  • Output files go in s3://{env}-precisely-genetics-vcf
    Where env = dev for the time being, but will be stage on staging, prod on prod
  • Files will be named {opaque_id}_{23andMe-fileHash}.txt; copy the filename to the output bucket (e.g., to s3://{env}-precisely-genetics-vcf/{opaque_id}_{23andme-filehash}.json)
  • Write errors as to a file s3://{env}-precisely-genetics-vcf/{opaque_id}{23andme-filehash}.error
    As a JSON object with a message field describing the error, and possibly other values.

Add test of Ancestry code by shunting output to 23&Me conversion function

Create a test function in MAMBA that uses your code to generate a 23 & Me text file, and then puts that file as input for the 23 & Me -> VCF function that I've written.

This will require you to use the Docker framework to auto-install the dependencies that are external to Python.

Thank you!

Compare coverage for genes and SNPs in Elizabeth's research on both ThermoFisher PMRA (Akesogen DTC) and Illumina GSA (23andMe v5)

@aneilbaboo commented on Tue Mar 06 2018

ME/CFS curated genes


@taltman commented on Thu Mar 22 2018

@aneilbaboo, I think this is more of a bioinformatic task, now that I've created dependencies that separate out the curation work from the data processing work.


@aneilbaboo commented on Thu Mar 22 2018

@taltman - Please feel free to move this over to bioinformatics if you like.

There's a [Move Issue] button here -------------------------------------------------------------->

Respec the quality metric in our data format

  • Should not be design-focused
  • Should communicate clearly the different states of the actual data
  • What are all of the states?
    • Remove unnecessary complexity in
      • 23andMe
      • Affy
      • Illumina
    • Preserve enough information to know things like
      • Poor read
      • no reading
      • quality level
      • imputation
      • boosted
      • laboratory quality read
  • Look at GA4GH spec

Test translation of specific genotype cases for 23andMe converter

Create one or more input 23andMe file which encodes several specific situations.

Test corresponding rows of the imputed VCF:

Expected cases:

  • homozygous mutation
  • heterozygous mutation
  • Y chromosome mutation
  • validate that the reference base is correct for each mutation
    • "Reference base of NC00001.1:g.123123 should be A"
    • "Reference base of NC00003.1:g.55555 should be T"
    • etc.

Unexpected cases:

  • invalid rsId
  • invalid chromosome number
  • invalid chromosome location (e.g., negative number, too large number, string)
  • missing genotype entry

Create seed genetics data

@aneilbaboo commented on Wed Feb 07 2018

We’re going to try to stand up a report with sample user data for the current sprint (by Wednesday 21st).

We’ll need files representing a tiny subset of data that the genetics service will produce, for a handful of mock users. The goal is to provide data to exercise the mechanisms for report generation.

  • Example of one User genotype (with 2 genes)
    • User with a particular MTHFR and GRIK3 phenotypes
  • Optional states for relevant fields:
    • variant_types (eg., wt, C677T, A1298C, ____?) for MTHFR
    • quality types

@visheshd ^


@taltman commented on Tue Feb 13 2018

@aneilbaboo Almost all of the data from Matty's mock-up report for MTHFR has been distilled into a set of YAML and MarkDown files in the repo. There's some fine-tuning that can happen in terms of how the data is organized, but it should be enough to start having those discussions and getting code to parse them. I will start documenting the format so that Vishesh & Matty can look them over. Will they be able to access the Wiki for this repo?


@taltman commented on Wed Feb 14 2018

Ah, I misunderstood. Here is the mock data in GA4GH JSON format:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data


@aneilbaboo commented on Wed Feb 14 2018

These are the input files to the (Genetics) Analytics Service. We need the output of that service: JSON files suitable for loading into the Genetics Service - the yellow highlighted parts here:
screen shot 2018-02-14 at 2 10 16 pm

See also the Database architecture document: https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit


@aneilbaboo commented on Wed Feb 14 2018

Specifically, we need to know what are the SVN variant names for the various states? What are the gene names? How will various situations be represented in the genetics service.

We need JSON that will contain values for the fields in the Genetics model:

[ 
  {
    "user_data_type_id": "{barcode-id-from-akesogen}",
    "gene": "mthfr",
    "source": "akesogen:genotyping",
    "labAnalysisId": "...",  // equivalent to the variantsetId --- identifies a particular reading 
    "variant": "....",     // <--- need your help here
    "createdAt": ...,
    "updatedAt": ...
  },
  { 
     // ... another gene genotype for the same user
   },
   ... etc
]

@taltman commented on Wed Feb 14 2018

And here is the first iteration of the report generation JSON input:
https://github.com/precisely/gene-panel-curation/blob/master/mock-user-data/report_input/MTHFR_C677T-WT_A1298C-heterozygous.json

It will be completed in the next two days (including the metadata fields described above), following design discussions tomorrow.


@aneilbaboo commented on Thu Feb 15 2018

Initial thoughts:

  1. Separate out gene annotation from user genetics
    • seed data needs to be input to the Genetics Service (not input to a report)
    • the genetics service will produce the input to a report
  2. Do we need a separate format and service for storing gene annotation information?
  3. Top level structure should represent a user, not a gene
    • Seed data should be a file that the GAS outputs
    • E.g., GAS takes 23andMe file as input, outputs a file of user's variants
    • E.g., GAS takes Akesongen raw genotype input file, outputs a file of user's variants
      What other info do we need to store?
  4. For SVN, should we use coding indexes rather than genomic indexes?
  • insulates us from shifts in genomic index
  • provides a way to generate a meaningful name for the genotype if an nickname (like C677T) isn't available
    @taltman ^

@taltman commented on Wed Feb 21 2018

Just updated the report JSON files as per format discussion with @aneilbaboo last week:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data/report_input

@aneilbaboo Regarding additional genes: GRIK3 doesn't have any phenotypes that I am aware of. Perhaps at this point the dev team can issue tickets against this repo for specific use cases, conditions, or unit tests that they need, and I can create the corresponding JSON files for them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.