bioinformatics's People
bioinformatics's Issues
Build Schema mapping of GA4GH JSON to userData DB
Complete 23andMe Uploader Private Beta Tasks
Create S3 buckets with appropriate permissions and policies
We need S3 buckets for receiving 23andMe data, and one for generating precisely GA4GH data.
- Think about rational naming scheme
- One for 23andMe and one for preciselyGA4GH
- E.g., dev-01-raw23andMe, dev-01-preciselyGA4GH
Obtain three or more 23andMe Files
Add code test of real-world Ancestry file
- Find publicly-available Ancestry file, perhaps looking at the Personal Genomes Project
- Add makefile target for downloading the file using wget
- Add makefile target called 'test'
The makefile target called 'test' will depend on the makefile target for the Ancestry file. You should download the Ancestry file to a directory called 'test'. You will need to exclude the 'test' directory using '.gitignore'. The recipe for the 'test' makefile target should execute the MAMBA tests.
More on being productive with Make:
Short tutorial on creating reproducible bioinformatic pipelines using Make:
https://bsmith89.github.io/make-bml/
Human-readable intro book on Make:
https://www.oreilly.com/openbook/make3/book/index.csp
GNU Make Manual (i.e., the kitchen sink):
https://www.gnu.org/software/make/manual/
Create "push Docker image to staging" script
Script in Python, simplified at top-level Makefile target.
Need to create extended SVN pattern matcher
Given a pattern encoded by a content author, a pattern matching engine needs to figure out whether the genotype of a patient's sample matches or not.
Should we add gene labels to intronic variants?
Our current BED file used in the convert23andme Python package is based on exons, not the full gene sequence. Do we want to label variants in introns as being associated with genes?
Orchestrate running of Docker repo on AWS ECS
Formalize syntax for describing extended SVN for content authors
Two options that @aneilbaboo and I have discussed:
- Use logical operators within SVN strings:
"[A|B];[A|C]"
- Limit any SVN to a single locus (or simple variant), and connect them by widget XML syntax:
<gene connect_mode="genotyping_array">
<variant svn="[A];[A]"/>
<variant svn="B"/>
<variant svn="C"/>
</gene>
Modify Dynamo uploader functionality to use S3-based upload instead
Either Constantine or Aneil.
Bioinformatics repo current uses boto API to upload, but it is very slow.
Switch over to using S3 to get data into the VariantCall table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.CopyingData.S3.html#EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat
It looks like this requires EMR. We create a Hive external table and issue a HiveQL UPDATE statement to import data from an S3 file.
I think CSV should be ok, but maybe we should use Hive default format (0x01 instead of comma).
I'm ok with implementing this using the Admin console for now, but ideally, these resources would be created and managed using serverless.
Note: only Hive/TSV/CSV format is supported out of the box. There's an unsupported example here which shows how to load JSON.
(For the record, I vote for Hive)
Add ability to handle S3 URIs as input & output
convert23andMe module doesn't support human genome version 36
Some very old 23andMe raw data files reference human genome build 36. Currently the code only supports version 37.
Setup bioinformatics repo Docker developer environment
- Define development flow
- Update README
- Make sure that it reliably installs
Convert file & function comments into PEP docstrings for Ancestry code
Check out the conventions here:
https://www.python.org/dev/peps/pep-0257/
Thanks!
Add a test suite to convert23andMe
Aim for ~80% test coverage
Update repo label controlled vocabulary
Document top-level Makefile targets in README.md
Build Schema mapping of VCF to Genotype userData
Create Gene Information Service (GIS) importer
Add a test suite to convert23andMe
Aim for ~80% test coverage
Document an example of how a content author encodes a gene report
- What does author write?
- What documentation do we need?
- How does author know what to write?
Set up SES to report bioinformatics pipeline errors to user
Failures from the bioinformatics pipeline should be reported to the user with an email:
"Sorry, we're unable to process your file."
Bonus points: mechanism for reporting more detail about the error.
Determine how to perform imputation analysis over 23andMe raw data format
- We will have support from ThermoFisher and AkesoGen for performing imputation over the PMRA, but what is the equivalent support and software tooling that will allow us to do the same for raw 23andMe data?
Document the Precisely VCF format
Provide references to the standard VCF format documentation, along with pointing out Precisely-specific extensions and calling out parts that are important for @visheshd & team to notice.
Fix SVN Wildtype examples
Test bootstrapping Docker image from scratch, and document
Use the Dockerfile to generate a working image on your laptop. Then, document the way to do it on the README.md file. Thanks!
Need S3 bucket for storing human genome builds
We will have a Makefile for building new docker images for running convert23andMe on 23andMe data files.
Instead of having that Makefile pulling data from the 1k Genomes project each time, we should cache pre-processed files on AWS to speed things up. It would be great if we could have a S3 bucket for third-party bioinformatic databases / datasets.
Schedule conversation with Affy Bioinformatics Specialist
- How do we determine what coverage we have?
- Do we need to collect other demographic info?
- Does the Axiom software provide imputation support?
- Can the Axiom software be used in a batch-oriented fashion, via a CLI or an API?
- Does the imputation software incorporate hereditary genetics ?
- What is the resolving power of a set of SNPs
- What higher level features is Affy detecting about the genome?
- how do these relate to GWAS studies?
- how does a geneticist use these features to filter populations with respect to a SNP
E.g., As a geneticist, I've done a study on British people and found a SNP that correlates with a disease. I can offer that to consumers, but I need to filter for people of the right background. - What information does a geneticist need to capture when describing a SNP?
- what if they did a GWAS study that associates a SNP with a symptom?
How do we translate variant information between human genome releases?
Optimize reference genome file format for adding reference to 3rd party genotype files
convert23andMe and convertAncestry rely on using vcftools to convert the 3rd-party flat-file format into a standard VCF format. This requires providing a reference human genome file to figure out what is the "wildtype" for a locus. You can provide a GZIP'ed FASTA, but it also accepts an indexed FASTA file using faidx
. Conversion using the GZIP'ed FASTA is very fast. It might be faster using the indexed format. This issue is a note for the icebox that we might want to optimize this some day.
Would content author want to specify quality when writing a report?
E.g.,
Would they want to write content for a particular genotype with a poor read?
Make a Docker Container of the convert23AndMe functionality
Create Releases in repos
Hi @aneilbaboo, could you create the following Releases (or any set of categories that you prefer) for tagging Issues? It would be helpful for saying that some issues will be tackled after Release 1.
- Release 0
- Release 1
- Release >1
Build 23andMe File Processor
Add MAMBA testing package as dependency in setup.py
There are a dozen good Python packaging tutorials available online. Let me know if you get stuck at any point. Thanks!
Write a function to python package to generate Precisely VCF from 23andMe files
- Test with several 23andMe files
- Assume raw files are in
s3://{env}-precisely-genetics-raw-23andme
- Output files go in
s3://{env}-precisely-genetics-vcf
Where env =dev
for the time being, but will bestage
on staging,prod
on prod - Files will be named
{opaque_id}_{23andMe-fileHash}.txt
; copy the filename to the output bucket (e.g., tos3://{env}-precisely-genetics-vcf/{opaque_id}_{23andme-filehash}.json
) - Write errors as to a file
s3://{env}-precisely-genetics-vcf/{opaque_id}{23andme-filehash}.error
As a JSON object with amessage
field describing the error, and possibly other values.
bcftools does not support converting 23&Me indel variants into {B|V}CF format
Current tools from Heng Li & Co only deals with SNVs. We might want to write some extra code that salvages these variants from the 23&Me data files, and appends these variants to the GA4GH JSON files.
Create Precisely VCF sample files for Vishesh
What is ME/CFS variant coverage in 23andMe, in Affy PM?
Add test of Ancestry code by shunting output to 23&Me conversion function
Create a test function in MAMBA that uses your code to generate a 23 & Me text file, and then puts that file as input for the 23 & Me -> VCF function that I've written.
This will require you to use the Docker framework to auto-install the dependencies that are external to Python.
Thank you!
Create Test Script which processes 10 different 23andMe files of different vintages
Compare coverage for genes and SNPs in Elizabeth's research on both ThermoFisher PMRA (Akesogen DTC) and Illumina GSA (23andMe v5)
@aneilbaboo commented on Tue Mar 06 2018
@taltman commented on Thu Mar 22 2018
@aneilbaboo, I think this is more of a bioinformatic task, now that I've created dependencies that separate out the curation work from the data processing work.
@aneilbaboo commented on Thu Mar 22 2018
@taltman - Please feel free to move this over to bioinformatics if you like.
There's a [Move Issue] button here -------------------------------------------------------------->
Respec the quality metric in our data format
- Should not be design-focused
- Should communicate clearly the different states of the actual data
- What are all of the states?
- Remove unnecessary complexity in
- 23andMe
- Affy
- Illumina
- Preserve enough information to know things like
- Poor read
- no reading
- quality level
- imputation
- boosted
- laboratory quality read
- Remove unnecessary complexity in
- Look at GA4GH spec
Test translation of specific genotype cases for 23andMe converter
Create one or more input 23andMe file which encodes several specific situations.
Test corresponding rows of the imputed VCF:
Expected cases:
- homozygous mutation
- heterozygous mutation
- Y chromosome mutation
- validate that the reference base is correct for each mutation
- "Reference base of NC00001.1:g.123123 should be A"
- "Reference base of NC00003.1:g.55555 should be T"
- etc.
Unexpected cases:
- invalid rsId
- invalid chromosome number
- invalid chromosome location (e.g., negative number, too large number, string)
- missing genotype entry
Create front-end script to dispatch on input file types
Should handle:
Raw VCF
23 & Me
Ancestry
Review DynamoDB table model in web repo
Make sure all of the fields make sense for VCF, think about indexing...
New variantCall table definition https://github.com/precisely/web/blob/dev/app-backend/src/services/variant-call/models.ts
Create python function for imputation & deliver working docker image
Will need to stage data files on S3 to prevent docker image from becoming too large.
What is the difference between Affy's DTC and Pharmacogenetics chip?
Compare disease areas - what broad disease areas, how many markers in each?
Create seed genetics data
@aneilbaboo commented on Wed Feb 07 2018
We’re going to try to stand up a report with sample user data for the current sprint (by Wednesday 21st).
We’ll need files representing a tiny subset of data that the genetics service will produce, for a handful of mock users. The goal is to provide data to exercise the mechanisms for report generation.
- Example of one User genotype (with 2 genes)
- User with a particular MTHFR and GRIK3 phenotypes
- Optional states for relevant fields:
- variant_types (eg., wt, C677T, A1298C, ____?) for MTHFR
- quality types
@taltman commented on Tue Feb 13 2018
@aneilbaboo Almost all of the data from Matty's mock-up report for MTHFR has been distilled into a set of YAML and MarkDown files in the repo. There's some fine-tuning that can happen in terms of how the data is organized, but it should be enough to start having those discussions and getting code to parse them. I will start documenting the format so that Vishesh & Matty can look them over. Will they be able to access the Wiki for this repo?
@taltman commented on Wed Feb 14 2018
Ah, I misunderstood. Here is the mock data in GA4GH JSON format:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data
@aneilbaboo commented on Wed Feb 14 2018
These are the input files to the (Genetics) Analytics Service. We need the output of that service: JSON files suitable for loading into the Genetics Service - the yellow highlighted parts here:
See also the Database architecture document: https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit
@aneilbaboo commented on Wed Feb 14 2018
Specifically, we need to know what are the SVN variant names for the various states? What are the gene names? How will various situations be represented in the genetics service.
We need JSON that will contain values for the fields in the Genetics model:
[
{
"user_data_type_id": "{barcode-id-from-akesogen}",
"gene": "mthfr",
"source": "akesogen:genotyping",
"labAnalysisId": "...", // equivalent to the variantsetId --- identifies a particular reading
"variant": "....", // <--- need your help here
"createdAt": ...,
"updatedAt": ...
},
{
// ... another gene genotype for the same user
},
... etc
]
@taltman commented on Wed Feb 14 2018
And here is the first iteration of the report generation JSON input:
https://github.com/precisely/gene-panel-curation/blob/master/mock-user-data/report_input/MTHFR_C677T-WT_A1298C-heterozygous.json
It will be completed in the next two days (including the metadata fields described above), following design discussions tomorrow.
@aneilbaboo commented on Thu Feb 15 2018
Initial thoughts:
- Separate out gene annotation from user genetics
- seed data needs to be input to the Genetics Service (not input to a report)
- the genetics service will produce the input to a report
- Do we need a separate format and service for storing gene annotation information?
- Top level structure should represent a user, not a gene
- Seed data should be a file that the GAS outputs
- E.g., GAS takes 23andMe file as input, outputs a file of user's variants
- E.g., GAS takes Akesongen raw genotype input file, outputs a file of user's variants
What other info do we need to store?
- For SVN, should we use coding indexes rather than genomic indexes?
- insulates us from shifts in genomic index
- provides a way to generate a meaningful name for the genotype if an nickname (like C677T) isn't available
@taltman ^
@taltman commented on Wed Feb 21 2018
Just updated the report JSON files as per format discussion with @aneilbaboo last week:
https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data/report_input
@aneilbaboo Regarding additional genes: GRIK3 doesn't have any phenotypes that I am aware of. Perhaps at this point the dev team can issue tickets against this repo for specific use cases, conditions, or unit tests that they need, and I can create the corresponding JSON files for them?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.