The bioinformatics from precisely

Build Schema mapping of GA4GH JSON to userData DB

Complete 23andMe Uploader Private Beta Tasks

@gcv @taltman - I'm collecting all the issues we need before private beta here.

Create S3 buckets with appropriate permissions and policies

We need S3 buckets for receiving 23andMe data, and one for generating precisely GA4GH data.

Think about rational naming scheme
One for 23andMe and one for preciselyGA4GH
E.g., dev-01-raw23andMe, dev-01-preciselyGA4GH

Obtain three or more 23andMe Files

Add code test of real-world Ancestry file

Find publicly-available Ancestry file, perhaps looking at the Personal Genomes Project
Add makefile target for downloading the file using wget
Add makefile target called 'test'

The makefile target called 'test' will depend on the makefile target for the Ancestry file. You should download the Ancestry file to a directory called 'test'. You will need to exclude the 'test' directory using '.gitignore'. The recipe for the 'test' makefile target should execute the MAMBA tests.

Create "push Docker image to staging" script

Script in Python, simplified at top-level Makefile target.

Need to create extended SVN pattern matcher

Given a pattern encoded by a content author, a pattern matching engine needs to figure out whether the genotype of a patient's sample matches or not.

Should we add gene labels to intronic variants?

Our current BED file used in the convert23andme Python package is based on exons, not the full gene sequence. Do we want to label variants in introns as being associated with genes?

Orchestrate running of Docker repo on AWS ECS

Formalize syntax for describing extended SVN for content authors

Two options that @aneilbaboo and I have discussed:

Use logical operators within SVN strings:

"[A|B];[A|C]"

Limit any SVN to a single locus (or simple variant), and connect them by widget XML syntax:

<gene connect_mode="genotyping_array">
   <variant svn="[A];[A]"/>
   <variant svn="B"/>
   <variant svn="C"/>
</gene>

Modify Dynamo uploader functionality to use S3-based upload instead

Either Constantine or Aneil.

Bioinformatics repo current uses boto API to upload, but it is very slow.

Switch over to using S3 to get data into the VariantCall table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.CopyingData.S3.html#EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat

It looks like this requires EMR. We create a Hive external table and issue a HiveQL UPDATE statement to import data from an S3 file.

I think CSV should be ok, but maybe we should use Hive default format (0x01 instead of comma).

I'm ok with implementing this using the Admin console for now, but ideally, these resources would be created and managed using serverless.

Note: only Hive/TSV/CSV format is supported out of the box. There's an unsupported example here which shows how to load JSON.

(For the record, I vote for Hive)

Add ability to handle S3 URIs as input & output

convert23andMe module doesn't support human genome version 36

Some very old 23andMe raw data files reference human genome build 36. Currently the code only supports version 37.

Setup bioinformatics repo Docker developer environment

Define development flow
Update README
Make sure that it reliably installs

Convert file & function comments into PEP docstrings for Ancestry code

Check out the conventions here:
https://www.python.org/dev/peps/pep-0257/

Thanks!

Add a test suite to convert23andMe

Aim for ~80% test coverage

Update repo label controlled vocabulary

Document top-level Makefile targets in README.md

Build Schema mapping of VCF to Genotype userData

Create Gene Information Service (GIS) importer

Add a test suite to convert23andMe

Aim for ~80% test coverage

Document an example of how a content author encodes a gene report

What does author write?
What documentation do we need?
How does author know what to write?

Set up SES to report bioinformatics pipeline errors to user

Failures from the bioinformatics pipeline should be reported to the user with an email:

"Sorry, we're unable to process your file."

Bonus points: mechanism for reporting more detail about the error.

Determine how to perform imputation analysis over 23andMe raw data format

We will have support from ThermoFisher and AkesoGen for performing imputation over the PMRA, but what is the equivalent support and software tooling that will allow us to do the same for raw 23andMe data?

Document the Precisely VCF format

Provide references to the standard VCF format documentation, along with pointing out Precisely-specific extensions and calling out parts that are important for @visheshd & team to notice.

Fix SVN Wildtype examples

Test bootstrapping Docker image from scratch, and document

Use the Dockerfile to generate a working image on your laptop. Then, document the way to do it on the README.md file. Thanks!

Need S3 bucket for storing human genome builds

We will have a Makefile for building new docker images for running convert23andMe on 23andMe data files.

Instead of having that Makefile pulling data from the 1k Genomes project each time, we should cache pre-processed files on AWS to speed things up. It would be great if we could have a S3 bucket for third-party bioinformatic databases / datasets.

Schedule conversation with Affy Bioinformatics Specialist

How do we determine what coverage we have?
Do we need to collect other demographic info?
Does the Axiom software provide imputation support?
Can the Axiom software be used in a batch-oriented fashion, via a CLI or an API?
Does the imputation software incorporate hereditary genetics ?
What is the resolving power of a set of SNPs
What higher level features is Affy detecting about the genome?
- how do these relate to GWAS studies?
- how does a geneticist use these features to filter populations with respect to a SNP
  E.g., As a geneticist, I've done a study on British people and found a SNP that correlates with a disease. I can offer that to consumers, but I need to filter for people of the right background.
- What information does a geneticist need to capture when describing a SNP?
  - what if they did a GWAS study that associates a SNP with a symptom?

How do we translate variant information between human genome releases?

Optimize reference genome file format for adding reference to 3rd party genotype files

convert23andMe and convertAncestry rely on using vcftools to convert the 3rd-party flat-file format into a standard VCF format. This requires providing a reference human genome file to figure out what is the "wildtype" for a locus. You can provide a GZIP'ed FASTA, but it also accepts an indexed FASTA file using faidx. Conversion using the GZIP'ed FASTA is very fast. It might be faster using the indexed format. This issue is a note for the icebox that we might want to optimize this some day.

Would content author want to specify quality when writing a report?

E.g.,
Would they want to write content for a particular genotype with a poor read?

Make a Docker Container of the convert23AndMe functionality

Create Releases in repos

Hi @aneilbaboo, could you create the following Releases (or any set of categories that you prefer) for tagging Issues? It would be helpful for saying that some issues will be tackled after Release 1.

Release 0
Release 1
Release >1

Build 23andMe File Processor

Add MAMBA testing package as dependency in setup.py

There are a dozen good Python packaging tutorials available online. Let me know if you get stuck at any point. Thanks!

Write a function to python package to generate Precisely VCF from 23andMe files

Test with several 23andMe files
Assume raw files are in s3://{env}-precisely-genetics-raw-23andme
Output files go in s3://{env}-precisely-genetics-vcf
Where env = dev for the time being, but will be stage on staging, prod on prod
Files will be named {opaque_id}_{23andMe-fileHash}.txt; copy the filename to the output bucket (e.g., to s3://{env}-precisely-genetics-vcf/{opaque_id}_{23andme-filehash}.json)
Write errors as to a file s3://{env}-precisely-genetics-vcf/{opaque_id}{23andme-filehash}.error
As a JSON object with a message field describing the error, and possibly other values.

bcftools does not support converting 23&Me indel variants into {B|V}CF format

Current tools from Heng Li & Co only deals with SNVs. We might want to write some extra code that salvages these variants from the 23&Me data files, and appends these variants to the GA4GH JSON files.

Create Precisely VCF sample files for Vishesh

What is ME/CFS variant coverage in 23andMe, in Affy PM?

Add test of Ancestry code by shunting output to 23&Me conversion function

Create a test function in MAMBA that uses your code to generate a 23 & Me text file, and then puts that file as input for the 23 & Me -> VCF function that I've written.

This will require you to use the Docker framework to auto-install the dependencies that are external to Python.

Thank you!

Create Test Script which processes 10 different 23andMe files of different vintages

Compare coverage for genes and SNPs in Elizabeth's research on both ThermoFisher PMRA (Akesogen DTC) and Illumina GSA (23andMe v5)

@aneilbaboo commented on Tue Mar 06 2018

ME/CFS curated genes

@taltman commented on Thu Mar 22 2018

@aneilbaboo, I think this is more of a bioinformatic task, now that I've created dependencies that separate out the curation work from the data processing work.

@aneilbaboo commented on Thu Mar 22 2018

@taltman - Please feel free to move this over to bioinformatics if you like.

There's a [Move Issue] button here -------------------------------------------------------------->

Respec the quality metric in our data format

Should not be design-focused
Should communicate clearly the different states of the actual data
What are all of the states?
- Remove unnecessary complexity in
  - 23andMe
  - Affy
  - Illumina
- Preserve enough information to know things like
  - Poor read
  - no reading
  - quality level
  - imputation
  - boosted
  - laboratory quality read

Look at GA4GH spec

Test translation of specific genotype cases for 23andMe converter

Create one or more input 23andMe file which encodes several specific situations.

Test corresponding rows of the imputed VCF:

Expected cases:

homozygous mutation
heterozygous mutation
Y chromosome mutation
validate that the reference base is correct for each mutation
- "Reference base of NC00001.1:g.123123 should be A"
- "Reference base of NC00003.1:g.55555 should be T"
- etc.

Unexpected cases:

invalid rsId
invalid chromosome number
invalid chromosome location (e.g., negative number, too large number, string)
missing genotype entry

Create front-end script to dispatch on input file types

Should handle:
Raw VCF
23 & Me
Ancestry

Review DynamoDB table model in web repo

Make sure all of the fields make sense for VCF, think about indexing...

New variantCall table definition https://github.com/precisely/web/blob/dev/app-backend/src/services/variant-call/models.ts

Create python function for imputation & deliver working docker image

Will need to stage data files on S3 to prevent docker image from becoming too large.

What is the difference between Affy's DTC and Pharmacogenetics chip?

Compare disease areas - what broad disease areas, how many markers in each?

Create seed genetics data

@aneilbaboo commented on Wed Feb 07 2018

We’re going to try to stand up a report with sample user data for the current sprint (by Wednesday 21st).

We’ll need files representing a tiny subset of data that the genetics service will produce, for a handful of mock users. The goal is to provide data to exercise the mechanisms for report generation.

Example of one User genotype (with 2 genes)
- User with a particular MTHFR and GRIK3 phenotypes
Optional states for relevant fields:
- variant_types (eg., wt, C677T, A1298C, ____?) for MTHFR
- quality types

@visheshd ^

@taltman commented on Tue Feb 13 2018

@aneilbaboo Almost all of the data from Matty's mock-up report for MTHFR has been distilled into a set of YAML and MarkDown files in the repo. There's some fine-tuning that can happen in terms of how the data is organized, but it should be enough to start having those discussions and getting code to parse them. I will start documenting the format so that Vishesh & Matty can look them over. Will they be able to access the Wiki for this repo?

@taltman commented on Wed Feb 14 2018

Ah, I misunderstood. Here is the mock data in GA4GH JSON format:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data

@aneilbaboo commented on Wed Feb 14 2018

These are the input files to the (Genetics) Analytics Service. We need the output of that service: JSON files suitable for loading into the Genetics Service - the yellow highlighted parts here:

See also the Database architecture document: https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit

@aneilbaboo commented on Wed Feb 14 2018

Specifically, we need to know what are the SVN variant names for the various states? What are the gene names? How will various situations be represented in the genetics service.

We need JSON that will contain values for the fields in the Genetics model:

[ 
  {
    "user_data_type_id": "{barcode-id-from-akesogen}",
    "gene": "mthfr",
    "source": "akesogen:genotyping",
    "labAnalysisId": "...",  // equivalent to the variantsetId --- identifies a particular reading 
    "variant": "....",     // <--- need your help here
    "createdAt": ...,
    "updatedAt": ...
  },
  { 
     // ... another gene genotype for the same user
   },
   ... etc
]

@taltman commented on Wed Feb 14 2018

And here is the first iteration of the report generation JSON input:
https://github.com/precisely/gene-panel-curation/blob/master/mock-user-data/report_input/MTHFR_C677T-WT_A1298C-heterozygous.json

It will be completed in the next two days (including the metadata fields described above), following design discussions tomorrow.

@aneilbaboo commented on Thu Feb 15 2018

Initial thoughts:

Separate out gene annotation from user genetics
- seed data needs to be input to the Genetics Service (not input to a report)
- the genetics service will produce the input to a report
Do we need a separate format and service for storing gene annotation information?
Top level structure should represent a user, not a gene
- Seed data should be a file that the GAS outputs
- E.g., GAS takes 23andMe file as input, outputs a file of user's variants
- E.g., GAS takes Akesongen raw genotype input file, outputs a file of user's variants
  What other info do we need to store?
For SVN, should we use coding indexes rather than genomic indexes?

insulates us from shifts in genomic index
provides a way to generate a meaningful name for the genotype if an nickname (like C677T) isn't available
@taltman ^

@taltman commented on Wed Feb 21 2018

Just updated the report JSON files as per format discussion with @aneilbaboo last week:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data/report_input

@aneilbaboo Regarding additional genes: GRIK3 doesn't have any phenotypes that I am aware of. Perhaps at this point the dev team can issue tickets against this repo for specific use cases, conditions, or unit tests that they need, and I can create the corresponding JSON files for them?

precisely / bioinformatics Goto Github PK

bioinformatics's People

Contributors

bioinformatics's Issues

Recommend Projects

Recommend Topics

Recommend Org