dnanexus / ukb_rap Goto Github PK

Access share reviewed code & Jupyter Notebooks for use on the UK Biobank (UKBB) Research Application Platform. Includes resources from DNAnexus webinars, online trainings and workshops.

License: MIT License

Shell 0.78% Jupyter Notebook 68.89% WDL 1.51% R 0.14% HTML 28.63% Dockerfile 0.05%

gwas ukbiobank wdl

ukb_rap's Introduction

DNAnexus

Dnanexus Apps and Scripts

applets

binning_step0: BioBin Pipeline
biobin_pipeline
binning_step1: BioBin Pipeline
biobin_pipeline
binning_step2: BioBin Pipeline
biobin_pipeline
binning_step3: BioBin Pipeline
biobin_pipeline
impute2_group_join: Impute2_group_join
This app can be used to merge multiple imputed impute2 files
plato_biobin: PLATO BioBin Regression Analysis
PLATO_BioBin
vcf_batch: VCF Batch effect tester
vcf_batch

apps

association_result_annotation: Annotate GWAS, PheWAS Assocaitions
association_result_annotation
biobin:
This app runs the latest development build of the rare variant binning tool BioBin.
generate_phenotype_matrix: Generate Phenotype Matrix
generate_phenotype_matrix
genotype_case_control: Generate Case/Control by Genotype
App provides case and control number by each genotype
impute2: imputation
This will perfrom imputation using Impute2
impute2_to_plink: Impute2 To PLINK
Convert Impute2 file to PLINK files
plato_single_variant: PLATO - Single Variant Analysis
Apps allows you to run single variant association testing against single phenotype (GWAS) or multiple phenotype (PheWAS) test
rl_sleeper_app: sleeper
This App provides some useful tools when working with data in DNANexus. This App is designed to be run on the command line with "dx run --ssh RL_Sleeper_App" in the project that you have data that you want to explore (use "dx select" to switch projects as needed).
shapeit2: SHAPEIT2
This app do phasing using SHAPEIT2
strand_align: Strand Align
Strand Align prior to phasing
vcf_annotation_formatter:
Extracts and reformats VCF annotations (CLINVAR, dbNSFP, SIFT, SNPEff)
QC_apps subfolder:
- drop_marker_sample: Drop Markers and/or Samples (PLINK)
  - drop_marker_sample
drop_relateds: Relatedness Filter (IBD)
- drop_relateds
extract_marker_sample: Drop Markers and/or Samples (PLINK)"
- extract_marker_sample
maf_filter: Marker MAF Rate Filter (PLINK)
- maf_filter
marker_call_filter: Marker Call Rate Filter (PLINK)
- marker_call_filter
missing_summary: Missingness Summary (PLINK)
- Returns missingness rate by sample
pca: Principal Component Analysis using SMARTPCA
- pca
sample_call_filter: Sample Call Rate Filter (PLINK)
- sample_call_filter

scripts

cat_vcf.py *
download_intervals.py *
download_part.py *
estimate_size.py *
interval_pad.py
- This reads a bed file from standard input, pads the intervals, sorts and then outputs the intervals guranteed to be non-overlapping
update_applet.sh *

sequencing

bcftools_view:
- Calls "bcftools view". Still in experimental stages.
calc_ibd:
- Calculates a pairwise IBD estimate from either VCF or PLINK files using PLINK 1.9.
call_bqsr: Base Quality Score Recalibration
- Call GATK BaseRecalibrator and return the tables for use in HaplotypeCaller
call_genotypes:
- Obsolete, do not use; use geno_p instead. Calls GATK GenotypeGVCFs.
call_hc:
- Call GATK HaplotypeCaller and return gVCF files
call_vqsr:
- Calls GATK VariantRecalibrator and returns the files needed to apply the recalibration
cat_variants: combine_variants
- Combines non-overlapping VCF files with the same subjects. A reimplementation of GATK CatVariants (GATK CatVariants available upon request)
combine_variants: combine_variants
- Calls GATK CombineVariants to merge VCF files
gen_ancestry:
- Determine Ancestry from PCA. Uses an eigenvector file and training dataset listing known ancestries. Runs QDA to determine posterior ancestries for all samples, even those in the training set.
gen_related_todrop:
- Uses a PLINK IBD file to determine the minimal set of samples to drop in order to generate an unrelated sample set. Uses a minimum vertex cut algorithm of the related samples to get
geno_p:
- Calls GATK GenotypeGVCFs in parallel by chromosome
merge_gvcfs:
- Calls GATK CombineGVCFs
plink_merge:
- Merge PLINK bed/bim/fam files using PLINK 1.9
select_variants: VCF QC
- Calls GATK SelectVariants
variant_annotator: VCF QC
- Calls GATK VariantAnnotator
vcf_annotate: Annotate VCF File
- Use a variety of tools to annotate a sites-only VCF.
vcf_concordance: VCF Concordance
- Generate concordance metrics from VCF file(s) using GATK GenotypeConcordance. Not recommended for large files.
vcf_gen_lof:
- Subset a VCF from vcf_annotate based on the given annotations to get a sites-only VCF of loss-of-function variants.
vcf_pca:
- Uses PLINK 1.9 and eigenstrat 6.0 to calculate principal components from VCF or PLINK bed/bim/fam files.
vcf_qc:
- Calls GATK ApplyRecalibration and GATK VariantFiltration to apply filters to VCF files.
vcf_query:
- Calls "bcftools query" to extract annotations from the VCF file. Used in the stripping of files for MEGAbase
vcf_sitesonly: VCF QC
- Generates a sites-only file from full VCF files.
vcf_slice: Slice VCF File(s)
- Return a small section of a VCF file (similar to tabix). For large output, many small regions, or subsetting samples, use subset_vcf instead.
vcf_summary: VCF Summary Statistics
- Generate summary statistics for a VCF file (by sample and by variant)
vcf_to_plink:
- Uses PLINK 1.9 to convert VCF files to PLINK bed/bim/fam files

ukb_rap's People

Contributors

Stargazers

Watchers

Forkers

exetergenetics d-mullin xianshu-li biochen4445 ileump raonyguimaraes m-mburu da-ma-dm hi-vis arkarachai qzhang314 kmcgurk pmquiros yiheng-aug30 adams-charleen yangchuhua rayin-saber skill51988 alexandermoerseburg ishmael7812 kdding kevinluolk rnbeaumont dryezl mcolosimo-p4 seth-borrowman msoliai raminsalmas yaluwen francoise231101 wei319 yuehuang-gh yanluocityu amazingshi doubleld siquan-xie justinaxie xamwise miltondp shamin2512 lipingshu shufeige ripenishtala biomguler tba2024

ukb_rap's Issues

No need to liftover to Hg38

On this page:
https://github.com/dnanexus/UKB_RAP/tree/main/end_to_end_gwas_phewas/liftover_plink_beds_tmp
It says:
"In order to perform association testing with the sequncing data on step-2, it is required to get the array genotyping data on the same reference coordinate as the sequencing data."

Whereas it seems it is actually not required but simply desirable:
rgcgithub/regenie#82

Running a script over a bunch of vcf files

I am interested in individual-level WGS vcf files in "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" for a list of participants in eid.txt.
eid.txt

I want to locate these vcf files so I done as below but I get error

usr@LV19Y7325V dnanexus-upload-agent-1.5.33-osx %
 dx ls "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" VCF/*gz > tempfile.txt ;
zsh: no matches found: VCF/*gz

I tried this command and got error too

usr@LV19Y7325V dnanexus-upload-agent-1.5.33-osx %
 for i in `cat eid.txt`; do dx find data --property eid=$i --folder "/mnt/project/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]" ; done
cat: eid.txt: No such file or directory

Or, I tried this

usr@LV19Y7325V ~ % dx ls './*.gz*'                         
dxpy.utils.resolver.ResolutionError: Unable to resolve "*.gz*" to a data object or folder name in '/Bulk/DRAGEN WGS/Whole genome variant call files (VCFs) (DRAGEN) [500k release]'

Please could somebody help me with this?

Running function "main" of "regenie_test_associations" failed because of AppInternalError

Hi, everyone
This is our first time to use UKB_RAP.
We're running regenie for GWAS, the step 1 process was successful, but we encountered some problem during step 2 (below is the log file).
Is there anyone having the same problem?

Options in effect:
--step 2
--bgen ukb00000_c21_b0_v1.bgen
--phenoFile Hematologic_malignancy_imputation.phe
--bsize 200
--pThresh 0.05
--test additive
--pred ukb_c1-22_GRCh38_full_analysis_set_plus_decoy_hla_merged_pred.list
--gz
--sample ukb00000_c21_b0_v1.sample
--extract /home/dnanexus/gel_imputed_snps_data_qc_pass_HM.snplist
--covarFile Hematologic_malignancy_imputation.phe
--bt
--spa
--phenoColList Hematologic_malignancy_cc
--covarColList sex,age,ethnic_background,alcohol_status,hypertension,ever_smoked,SBP,BMI,NLR,CRP
--ref-first
--htp ukb00000_c21_b0_v1
--out ukb00000_c21_b0_v1
Association testing mode with fast multithreading using OpenMP

bgen : [ukb00000_c21_b0_v1.bgen]
-summary : bgen file (v1.2 layout, zstd compressed) with 488315 anonymous samples and 4376829 variants with 8-bit encoding.
-index bgi file [ukb00000_c21_b0_v1.bgen.bgi]
ERROR: BGenError
Traceback (most recent call last):
File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY", line 39, in
exec_code(job_code, _code_filename)
File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY", line 25, in exec_code
exec(compiled_code, globals())
File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY.code.py", line 208, in
dxpy.run()
File "/usr/local/lib/python3.8/dist-packages/dxpy/utils/exec_utils.py", line 150, in run
result = ENTRY_POINT_TABLEjob['function']
File "/home/dnanexus/job-GfKBQ7QJ3ZvYYY5Y9z3QffzY.code.py", line 103, in main
docker.run(cmd)
File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 347, in run
self._handle_last_process_cpe(cpe)
File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 368, in _handle_last_process_cpe
raise cpe
File "/usr/lib/python3/dist-packages/sugar/dockertools.py", line 337, in run
self._last_process = processing.run(
File "/usr/lib/python3/dist-packages/sugar/processing/init.py", line 122, in run
processes.block()
File "/usr/lib/python3/dist-packages/sugar/processing/core.py", line 462, in block
self.raise_if_error()
File "/usr/lib/python3/dist-packages/sugar/processing/core.py", line 576, in raise_if_error
raise CalledProcessError(self.returncode, str(self), output=msg)
subprocess.CalledProcessError: Command 'docker run -e PYTHONUNBUFFERED=s -v /home/dnanexus:/home/dnanexus -w /home/dnanexus --rm dnanexus/regenie:latest bash -c 'regenie --step 2 --bgen ukb00000_c21_b0_v1.bgen --phenoFile Hematologic_malignancy_imputation.phe --bsize 200 --pThresh 0.05 --test additive --pred ukb_c1-22_GRCh38_full_analysis_set_plus_decoy_hla_merged_pred.list --gz --sample ukb00000_c21_b0_v1.sample --extract /home/dnanexus/gel_imputed_snps_data_qc_pass_HM.snplist --covarFile Hematologic_malignancy_imputation.phe --bt --spa --phenoColList Hematologic_malignancy_cc --covarColList sex,age,ethnic_background,alcohol_status,hypertension,ever_smoked,SBP,BMI,NLR,CRP --ref-first --htp ukb00000_c21_b0_v1 --out ukb00000_c21_b0_v1'' returned non-zero exit status 1.
Close

List out of range for Data-Field 2090

Hi,
I have created two cohorts using cohort browser namely "exome_natd_case" and " exome_natd_control" for the above mentioned data field. I can load them using dxdata.load_cohort

case = dxdata.load_cohort("exome_natd_case")  
cont = dxdata.load_cohort("exome_natd_control")  
case
<dxdata.dashboard.base.CohortQuery at 0x7fe64e61c0b8>

But when I try to retrieve field description I am getting the following attached error: list out of range ![list_out_of_index_dxdata](https://user-images.githubusercontent.com/12833907/154571422-1fbe0b3f-3b03-48e2-99eb-4d15cfcb4baf.png)
Any help on this matter?

Query exceeded timeout [120]

command:
cohort_template <- "dx extract_dataset {dataset} --fields {field_list} -o cohort_data.csv"
cmd <- glue::glue(cohort_template)
system(cmd)

bug:
{'type': 'QueryTimeOut', 'message': 'Query exceeded timeout [120]. Cancelled', 'details': {}, 'time': '05/30/2023, 08:08:18'}

Why not use dxfuse throughout the entire regenie workflow

Thanks for your wonderful work.

I've noticed that performing the regenie workflow using dxfuse is quite convenient and can significantly reduce the time spent on data downloading, especially when running step 2. Why doesn't our example code fully incorporate dxfuse?"

Fitting Linear Model

Hi Alex
Is there a reason why the model was not fitted with npx_normalised data?
many thanks
prasad

Json format for WDL workflow

Hi @anastazie-dnanexus

Thanks for all the tutorials on UKB-RAP use

I'm very new to WDL and I was wondering how would the json file look for this specific example of the geno_bgen_files/geno_sample_files. Is this a text file that contains the name of one bgen file per line? Can you provide an example of this within the DNAnexus file system?

workflow bgens_qc {
  input {
    Array[File]+ geno_bgen_files
    Array[File]+ geno_sample_files
    Boolean ref_first = true
    File? keep_file
    Array[File] extract_files
    String plink2_options = ""
    String output_prefix
  }

Thanks a lot

phenoCol covarCol

Hi,
Thanks for your awesome work.

What will happen if I don't specify the "phenocol" and "covalcol" parameters? Will all the columns in the "covarfile" and "phenofile" be treated as covariates/phenotypes?

No return for 'eid'.

Hi, The command below doesn't work:

df_qced['FID'] = df_qced['IID']

I think there's a bug:

Bug:

fields = [fields_for_id(f)[0] for f in field_ids] + [participant.find_field(name='p20160_i0')]

Debugged:

fields = [fields_for_id(f)[0] for f in field_ids] + [participant.find_field(name='p20160_i0')] + [participant.find_field(name='eid')]

Quotes and name problem in partD-step1-regenie.sh

Hey Anastazie,

Really love the code you have provided and your DNA Nexus tutorials!

I've noticed an issue to the GWAS code partD-step1-regenie.sh:
I think this line:
-icmd=${run_plink_wes}
should read:
-icmd="${run_regenie_step1}"

Also I noticed you added ";" at the end of the chunks of code for part D only - they're not in the other parts, maybe they don't have an effect but I haven't added them just in case!

Thanks,

Kathryn