Coder Social home page Coder Social logo

dnanexus / ukb_rap Goto Github PK

View Code? Open in Web Editor NEW
111.0 111.0 45.0 9.08 MB

Access share reviewed code & Jupyter Notebooks for use on the UK Biobank (UKBB) Research Application Platform. Includes resources from DNAnexus webinars, online trainings and workshops.

License: MIT License

Shell 0.78% Jupyter Notebook 68.89% WDL 1.51% R 0.14% HTML 28.63% Dockerfile 0.05%
gwas ukbiobank wdl

ukb_rap's Introduction

DNAnexus

Dnanexus Apps and Scripts

applets

  • binning_step0: BioBin Pipeline
  • biobin_pipeline
  • binning_step1: BioBin Pipeline
  • biobin_pipeline
  • binning_step2: BioBin Pipeline
  • biobin_pipeline
  • binning_step3: BioBin Pipeline
  • biobin_pipeline
  • impute2_group_join: Impute2_group_join
  • This app can be used to merge multiple imputed impute2 files
  • plato_biobin: PLATO BioBin Regression Analysis
  • PLATO_BioBin
  • vcf_batch: VCF Batch effect tester
  • vcf_batch

apps

  • association_result_annotation: Annotate GWAS, PheWAS Assocaitions
  • association_result_annotation
  • biobin:
  • This app runs the latest development build of the rare variant binning tool BioBin.
  • generate_phenotype_matrix: Generate Phenotype Matrix
  • generate_phenotype_matrix
  • genotype_case_control: Generate Case/Control by Genotype
  • App provides case and control number by each genotype
  • impute2: imputation
  • This will perfrom imputation using Impute2
  • impute2_to_plink: Impute2 To PLINK
  • Convert Impute2 file to PLINK files
  • plato_single_variant: PLATO - Single Variant Analysis
  • Apps allows you to run single variant association testing against single phenotype (GWAS) or multiple phenotype (PheWAS) test
  • rl_sleeper_app: sleeper
  • This App provides some useful tools when working with data in DNANexus. This App is designed to be run on the command line with "dx run --ssh RL_Sleeper_App" in the project that you have data that you want to explore (use "dx select" to switch projects as needed).
  • shapeit2: SHAPEIT2
  • This app do phasing using SHAPEIT2
  • strand_align: Strand Align
  • Strand Align prior to phasing
  • vcf_annotation_formatter:
  • Extracts and reformats VCF annotations (CLINVAR, dbNSFP, SIFT, SNPEff)
  • QC_apps subfolder:
    • drop_marker_sample: Drop Markers and/or Samples (PLINK)
      • drop_marker_sample
  • drop_relateds: Relatedness Filter (IBD)
    • drop_relateds
  • extract_marker_sample: Drop Markers and/or Samples (PLINK)"
    • extract_marker_sample
  • maf_filter: Marker MAF Rate Filter (PLINK)
    • maf_filter
  • marker_call_filter: Marker Call Rate Filter (PLINK)
    • marker_call_filter
  • missing_summary: Missingness Summary (PLINK)
    • Returns missingness rate by sample
  • pca: Principal Component Analysis using SMARTPCA
    • pca
  • sample_call_filter: Sample Call Rate Filter (PLINK)
    • sample_call_filter

scripts

  • cat_vcf.py *
  • download_intervals.py *
  • download_part.py *
  • estimate_size.py *
  • interval_pad.py
    • This reads a bed file from standard input, pads the intervals, sorts and then outputs the intervals guranteed to be non-overlapping
  • update_applet.sh *

sequencing

  • bcftools_view:
    • Calls "bcftools view". Still in experimental stages.
  • calc_ibd:
    • Calculates a pairwise IBD estimate from either VCF or PLINK files using PLINK 1.9.
  • call_bqsr: Base Quality Score Recalibration
  • call_genotypes:
    • Obsolete, do not use; use geno_p instead. Calls GATK GenotypeGVCFs.
  • call_hc:
  • call_vqsr:
  • cat_variants: combine_variants
    • Combines non-overlapping VCF files with the same subjects. A reimplementation of GATK CatVariants (GATK CatVariants available upon request)
  • combine_variants: combine_variants
  • gen_ancestry:
    • Determine Ancestry from PCA. Uses an eigenvector file and training dataset listing known ancestries. Runs QDA to determine posterior ancestries for all samples, even those in the training set.
  • gen_related_todrop:
    • Uses a PLINK IBD file to determine the minimal set of samples to drop in order to generate an unrelated sample set. Uses a minimum vertex cut algorithm of the related samples to get
  • geno_p:
  • merge_gvcfs:
  • plink_merge:
    • Merge PLINK bed/bim/fam files using PLINK 1.9
  • select_variants: VCF QC
  • variant_annotator: VCF QC
  • vcf_annotate: Annotate VCF File
    • Use a variety of tools to annotate a sites-only VCF.
  • vcf_concordance: VCF Concordance
  • vcf_gen_lof:
    • Subset a VCF from vcf_annotate based on the given annotations to get a sites-only VCF of loss-of-function variants.
  • vcf_pca:
    • Uses PLINK 1.9 and eigenstrat 6.0 to calculate principal components from VCF or PLINK bed/bim/fam files.
  • vcf_qc:
  • vcf_query:
    • Calls "bcftools query" to extract annotations from the VCF file. Used in the stripping of files for MEGAbase
  • vcf_sitesonly: VCF QC
    • Generates a sites-only file from full VCF files.
  • vcf_slice: Slice VCF File(s)
    • Return a small section of a VCF file (similar to tabix). For large output, many small regions, or subsetting samples, use subset_vcf instead.
  • vcf_summary: VCF Summary Statistics
    • Generate summary statistics for a VCF file (by sample and by variant)
  • vcf_to_plink:
    • Uses PLINK 1.9 to convert VCF files to PLINK bed/bim/fam files

ukb_rap's People

Contributors

ajlee21 avatar anastazie-dnanexus avatar kmcgurk avatar laderast avatar lorenbuhle avatar oklempir-cf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ukb_rap's Issues

Running a script over a bunch of vcf files

Hi

I am interested in individual-level WGS vcf files in "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" for a list of participants in eid.txt.
eid.txt

I want to locate these vcf files so I done as below but I get error

usr@LV19Y7325V dnanexus-upload-agent-1.5.33-osx %
 dx ls "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" VCF/*gz > tempfile.txt ;
zsh: no matches found: VCF/*gz

I tried this command and got error too

usr@LV19Y7325V dnanexus-upload-agent-1.5.33-osx %
 for i in `cat eid.txt`; do dx find data --property eid=$i --folder "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" ; done
cat: eid.txt: No such file or directory

Or, I tried this

usr@LV19Y7325V ~ % dx ls './*.gz*'                         
dxpy.utils.resolver.ResolutionError: Unable to resolve "*.gz*" to a data object or folder name in '/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]'

Please could somebody help me with this?

Running function "main" of "regenie_test_associations" failed because of AppInternalError

Hi, everyone
This is our first time to use UKB_RAP.
We're running regenie for GWAS, the step 1 process was successful, but we encountered some problem during step 2 (below is the log file).
Is there anyone having the same problem?

Options in effect:
--step 2
--bgen ukb00000_c21_b0_v1.bgen
--phenoFile Hematologic_malignancy_imputation.phe
--bsize 200
--pThresh 0.05
--test additive
--pred ukb_c1-22_GRCh38_full_analysis_set_plus_decoy_hla_merged_pred.list
--gz
--sample ukb00000_c21_b0_v1.sample
--extract /home/dnanexus/gel_imputed_snps_data_qc_pass_HM.snplist
--covarFile Hematologic_malignancy_imputation.phe
--bt
--spa
--phenoColList Hematologic_malignancy_cc
--covarColList sex,age,ethnic_background,alcohol_status,hypertension,ever_smoked,SBP,BMI,NLR,CRP
--ref-first
--htp ukb00000_c21_b0_v1
--out ukb00000_c21_b0_v1
Association testing mode with fast multithreading using OpenMP

  • bgen : [ukb00000_c21_b0_v1.bgen]
    -summary : bgen file (v1.2 layout, zstd compressed) with 488315 anonymous samples and 4376829 variants with 8-bit encoding.
    -index bgi file [ukb00000_c21_b0_v1.bgen.bgi]
    ERROR: BGenError
    Traceback (most recent call last):
    File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY", line 39, in
    exec_code(job_code, _code_filename)
    File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY", line 25, in exec_code
    exec(compiled_code, globals())
    File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY.code.py", line 208, in
    dxpy.run()
    File "/usr/local/lib/python3.8/dist-packages/dxpy/utils/exec_utils.py", line 150, in run
    result = ENTRY_POINT_TABLEjob['function']
    File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY.code.py", line 103, in main
    docker.run(cmd)
    File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 347, in run
    self._handle_last_process_cpe(cpe)
    File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 368, in _handle_last_process_cpe
    raise cpe
    File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 337, in run
    self._last_process = processing.run(
    File "/usr/lib/python3/dist-packages/sugar/processing/init.py", line 122, in run
    processes.block()
    File "/usr/lib/python3/dist-packages/sugar/processing/core.py", line 462, in block
    self.raise_if_error()
    File "/usr/lib/python3/dist-packages/sugar/processing/core.py", line 576, in raise_if_error
    raise CalledProcessError(self.returncode, str(self), output=msg)
    subprocess.CalledProcessError: Command 'docker run -e PYTHONUNBUFFERED=s -v /home/dnanexus:/home/dnanexus -w /home/dnanexus --rm dnanexus/regenie:latest bash -c 'regenie --step 2 --bgen ukb00000_c21_b0_v1.bgen --phenoFile Hematologic_malignancy_imputation.phe --bsize 200 --pThresh 0.05 --test additive --pred ukb_c1-22_GRCh38_full_analysis_set_plus_decoy_hla_merged_pred.list --gz --sample ukb00000_c21_b0_v1.sample --extract /home/dnanexus/gel_imputed_snps_data_qc_pass_HM.snplist --covarFile Hematologic_malignancy_imputation.phe --bt --spa --phenoColList Hematologic_malignancy_cc --covarColList sex,age,ethnic_background,alcohol_status,hypertension,ever_smoked,SBP,BMI,NLR,CRP --ref-first --htp ukb00000_c21_b0_v1 --out ukb00000_c21_b0_v1'' returned non-zero exit status 1.
    Close

List out of range for Data-Field 2090

Hi,
I have created two cohorts using cohort browser namely "exome_natd_case" and " exome_natd_control" for the above mentioned data field. I can load them using dxdata.load_cohort

case = dxdata.load_cohort("exome_natd_case")  
cont = dxdata.load_cohort("exome_natd_control")  
case
<dxdata.dashboard.base.CohortQuery at 0x7fe64e61c0b8>

But when I try to retrieve field description I am getting the following attached error: list out of range ![list_out_of_index_dxdata](https://user-images.githubusercontent.com/12833907/154571422-1fbe0b3f-3b03-48e2-99eb-4d15cfcb4baf.png)
Any help on this matter?

Query exceeded timeout [120]

command:
cohort_template <- "dx extract_dataset {dataset} --fields {field_list} -o cohort_data.csv"
cmd <- glue::glue(cohort_template)
system(cmd)

bug:
{'type': 'QueryTimeOut', 'message': 'Query exceeded timeout [120]. Cancelled', 'details': {}, 'time': '05/30/2023, 08:08:18'}

Why not use dxfuse throughout the entire regenie workflow

Thanks for your wonderful work.

I've noticed that performing the regenie workflow using dxfuse is quite convenient and can significantly reduce the time spent on data downloading, especially when running step 2. Why doesn't our example code fully incorporate dxfuse?"

Fitting Linear Model

Hi Alex
Is there a reason why the model was not fitted with npx_normalised data?
many thanks
prasad

Json format for WDL workflow

Hi @anastazie-dnanexus

Thanks for all the tutorials on UKB-RAP use

I'm very new to WDL and I was wondering how would the json file look for this specific example of the geno_bgen_files/geno_sample_files. Is this a text file that contains the name of one bgen file per line? Can you provide an example of this within the DNAnexus file system?

workflow bgens_qc {
  input {
    Array[File]+ geno_bgen_files
    Array[File]+ geno_sample_files
    Boolean ref_first = true
    File? keep_file
    Array[File] extract_files
    String plink2_options = ""
    String output_prefix
  }

Thanks a lot

phenoCol covarCol

Hi,
Thanks for your awesome work.

What will happen if I don't specify the "phenocol" and "covalcol" parameters? Will all the columns in the "covarfile" and "phenofile" be treated as covariates/phenotypes?

No return for 'eid'.

Hi, The command below doesn't work:

df_qced['FID'] = df_qced['IID']

I think there's a bug:

Bug:

fields = [fields_for_id(f)[0] for f in field_ids] + [participant.find_field(name='p20160_i0')]

Debugged:

fields = [fields_for_id(f)[0] for f in field_ids] + [participant.find_field(name='p20160_i0')] + [participant.find_field(name='eid')]

Quotes and name problem in partD-step1-regenie.sh

Hey Anastazie,

Really love the code you have provided and your DNA Nexus tutorials!

I've noticed an issue to the GWAS code partD-step1-regenie.sh:
I think this line:
-icmd=${run_plink_wes}
should read:
-icmd="${run_regenie_step1}"

Also I noticed you added ";" at the end of the chunks of code for part D only - they're not in the other parts, maybe they don't have an effect but I haven't added them just in case!

Thanks,

Kathryn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.