comorment / containers Goto Github PK

CoMorMent-Containers

License: GNU General Public License v3.0

MATLAB 0.01% Shell 7.30% Python 24.93% R 43.60% Perl 2.24% Mathematica 0.40% Makefile 0.23% Dockerfile 1.36% Batchfile 0.12% Jupyter Notebook 19.83%

containers gwas polygenic-risk-scores singularity-containers

containers's People

Stargazers

Watchers

Forkers

rzetterberg joellepasman nasser1el johnshorter chunhwu nvrivera jiangbixuan ttfiliz docliu94

containers's Issues

meta-analysis with METAL or GWAMA

We need one (or both) of these tools in a container:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-288
https://genome.sph.umich.edu/wiki/METAL_Documentation

Add Issue/PR templates

Make reporting issues/open PRs more streamlined, using GitHub template files, see https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository

The gwas.py merge-regenie command detects missing SNP IDs in the .afreq files as duplicates and throws an error. Would it be possible to specify a command for removing (or replacing with CHR:BP) missing SNP IDs?

Pin software versions

Copied from comorment/gwas#29:
In the current Dockerfile recipes and bash installer files, versions of different tools are (usually) not pinned. Thus a (re)built container will likely differ from day to day, in particular, if packages are installed from sources like conda-forge and similar where updates are frequent.
Ideally, versions should explicitly be pinned in the recipes, e.g., like

FROM buildpack-deps:focal

RUN apt-get update && \
    apt-get install --no-install-recommends -y \
    cmake=3.16.3-1ubuntu1 \
    python3-dev=3.8.2-0ubuntu2
    ....
RUN pip install h5py==2.10.0 && \
    pip install git+https://github.com/NeuralEnsemble/parameters@b95bac2bd17f03ce600541e435e270a1e1c5a478#egg=parameters \
    ...
RUN git clone --depth 1 -b v3.1 https://github.com/nest/nest-simulator /usr/src/nest-simulator && \
    # compile
    ...

The above is just taken from another project of mine (complete example: https://github.com/LFPy/LFPykernels/blob/main/Dockerfile).

Version pinning is also a best practice suggested by Dockerfile linting tools like Hadolint (https://hadolint.github.io/hadolint/).

``ld.so`` error when run on SURFSARA login node

I'm getting the following error when run hello.sif container on surfsara login node. However the error seem to be harmless

ERROR: ld.so: object '/sara/tools/xalt/xalt/lib64/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

METAL analysis question. Do we want to have NCASE column?

define variables that need to be accumulated across GWASes for each SNP.

CUSTOMVARIABLE NCASE
CUSTOMVARIABLE NCONTROL

However, the output from Regenie only has N, not NCASE or NCONTROL.
If we want this implemented, we would need Regenie to have additional output columns.

Job 3 should only compile together chromosomes if there are no errors from slurm

For the saige analysis, some chromosomes did not finish, and yet job 3 still went on and combined together all .saige chromosomes files, even the ones partially finished.

The gwas.py script should look for an error at the end of the corresponding .out files, to see if the previous step in job 2 finished, before running job 3 .

Manhattan plot (python_convert)

For some reason the Mahattan plot utility from python_convert only includes about 45,000 of the ~8 mil SNPs in my summary statistics, without outputting any warnings or errors. What reference data is used?

It would be great if more information can be added about the usage of the clump utility (sumstats.py). What are the minimum required flags? What are the defaults? What is the reference dataset? To unify these processes across different sites as much as possible we probably should pre-set as many of the parameters as we can.

-a / -p for Nextflow scripts

Intermittent failure in scripts/from_docker_image.sh

Sometimes when I use scripts/from_docker_image.sh script it fails with an error. Here is how I call the script:

>sudo make gwas.sif

which in turn triggers the following command

docker build -t gwas -f containers/gwas/Dockerfile . && scripts/convert_docker_image_to_singularity.sh gwas

The first part of it succeeds, but the second fails:

...
Successfully built 6038cc3d0a1a
Successfully tagged gwas:latest
registry
Using default tag: latest
The push refers to repository [localhost:5000/gwas]
Get http://localhost:5000/v2/: EOF
make: *** [Makefile:4: gwas.sif] Error 1

Running it one more time always solves the problem.
I haven't investigated what's the problem here...

--nThreads in Saige step 2

--nThreads is invalid for the second step. It only works for the first step in Saige analysis.

Maybe there is another solution

suggestions for gwas.py

Perhaps add effect allele frequency in the outputted sumstats?
The regenie jobs created with the argsfile contain an extra --bt argument that I think can be removed
The sbatch commands do not work on our system, on Bianca you need to specify -A (account/ project for computational time) and -p (node / core with -n)
The 'parallel' command in gwas_real is not recognized, here it's a module that needs to be loaded first (whereas signularity on the other hand is not a module)
Does regenie have flags for some standard QC, like MAF, missingness, or HWE filtering? Maybe some of these standard QC checks could be built in
Would it be an idea to auto-generate a readme for the output? Just so you can see in one glance which was the reference allele, on which scale are the BETA and SE, etc.

Data for GCTB tutorial

You asked before if there were example data for GCTA/B tutorials. Here are the tutorial data for GCTB (I don't think there is a tutorial for SBLUP): https://cnsgenomics.com/software/gctb/#Download

gwas.py reports incorrect sums for variables of type `CONTINUOUS`

We have a phenotype dictionary that looks like this:

We supply ind_F3300 with the --pheno argument and the rest of the variables are supplied with --covar.

When we run gwas.py it incorrectly reports that the variables of type CONTINUOUS have no cases, controls or missing even though all individuals have valid values for these variables:

It seems that the part of the code that reports these sums use the variable pheno_type that doesn't change between iterations:

    log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
    pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
    pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
    for var in pheno_and_covar_cols:
        if pheno_type=='BINARY':
            log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
        else:
            log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))

Source: https://github.com/comorment/containers/blob/main/gwas/gwas.py#L742-L749

If I'm not mistaken we can check the type of each variable using the pheno_dict_map, like so:

     log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
     pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
     pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
     for var in pheno_and_covar_cols:
-        if pheno_type=='BINARY':
+        if pheno_dict_map[var]=='BINARY':
             log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
         else:
             log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))

Clarify versioning of the containers

add "tags" to github
include those "tags" into ".sif" files (e.g. by adding a file inside container showing its version)
add "CHANGELOG" file listing what has changed across versions
provide a README page in the documentation explaining versioning

singularity/saige.sif is a textfile, instead of the actual singularity image

Before commit c38f807 the file singularity/saige.sif was 702 MB in size, after the commit that file is 295 Bytes.

This is the content of that file now:

version https://git-lfs.github.com/spec/v1
<<<<<<< HEAD
oid sha256:8c870154d08604b5eefe2a4635a6ef22c2cf69b4dccb72f12367bda467dffb43
size 736071680
=======
oid sha256:1d8e3762db280395a73eb9bd3a070f6666717f8d0dbc76cb7738e256bf5649da
size 899510272
>>>>>>> 205045b7ae8864036476cf68d358bd0e9ce045c0

It seems like the file has been accidentally commited as a literal LFS textfile, instead of the actual singularity image.

Dockerfile recipes: is `libquadmath0` essential?

This Debian package which is not (yet) supporting all architectures (e.g., arm64) is installed via https://github.com/comorment/gwas/blob/e3f295087b11866d12d3da9d5ba5c5d929e7272a/scripts/apt_get_essential.sh across different containers. What tools except Linux-king (https://github.com/comorment/gwas/blob/b6209ffbad73e4f638e9057b1f2612a3d0e0a625/scripts/install_king.sh) require this library?

SnakeMake?

Snakemake (https://snakemake.readthedocs.io/en/stable/) is a tool that could potentially be used for updating container builds, e.g., in case a container dependency or install script used by the Dockerfile(s) is updated. Would this be useful?

ERROR: No setuid installation found, p33 , tsd

Hi, I was trying to run ldsc.py in p33 in tsd but getting the above mentioned error. Any help?

`--config` is not passed on to slurm jobs that run `mege-regenie`, `merge-saige` or `merge-plink2`

Here's the expected the behavior:

Given I have an arguments file named "my_test.args" in the current directory
  And my arguments file have the argument "--analysis regenie figures"
  And my arguments file have the argument "--out my_test"
  And I have an config file named "my_test.yaml" in the current directory
 When I run "gwas.py --argsfile my_test.args --config my_test.yaml"
 Then the file "my_test.3.job" is created
  And the file "my_test.3.job" contains the command "gwas merge-regenie" 
  And the file "my_test.3.job" contains the argument "--config my_test.yaml"

But currently when going through this scenario (tested with 5d3a5b4) the last step fails, the argument --config is absent from the job file.

Error in job2 with saige "chunks"

This is the error in the out file for job2 in the unreleased saige "chunks" test version.

The problem seems to be the flag for the start of the chunks. Maybe this is an R error? I'm not sure.

Avoid requirement that FID and IID are the same in gwas.py

Current scripts are developed under assumption that FID and IID are the same, and only IID is used to identify individuals and link them between .pheno file and the genetic files. Good to design this in a way that is more flexible.

qq plot confidence band

Small suggestion to add confidence bands to the qq-plot, like this: https://slowkow.com/notes/ggplot2-qqplot/ , so that it can be seen if the p-values fall within the expected range.

Add a small unit test setup

Describe the solution you'd like
Set up a small framework (py.test or similar) calling the different containers locally, checking that software installed in containers returns its version or similar, and does not result in crashes (from missing libs, etc.)

Error with lead SNP file, and plotting

There is an error in the loci command. The output from loci is used when plotting.
This is the description of the loci command:
Perform LD-based clumping of summary stats, using a procedure that is similar to FUMA snp2gene functionality
The plotting seems to expect the files iPSYCH2012_ind_F3300.lead.csv and iPSYCH2012_ind_F3300.indep.csv, but those files does not exist. I guess they would be produced by the loci command

I tried to re-run and remove the flags for those two files to see if that solves the issue. But it didn’t. And I know that chr 2 has variants that pass significance threshold.

Support for non-autosomes

It would be good if the script doesn't throw an error for non-autosomes, but just filters them out (or keeps them if they are sensible codes)

Move matlab-related software to separate github repo

I don't think we fully established a container with MATLAB runtime to package software written in MATLAB.
Bayram have done a lot of work on this:

https://github.com/comorment/gwas/tree/main/containers/matlabruntime - docker files to build container with MATLAB runtime
https://github.com/comorment/containers/tree/main/matlab - pre-built binaries with pleioFDR software (https://github.com/precimed/pleiofdr/)
https://github.com/comorment/containers/tree/main/reference/pleiofdr - reference data from pleiOFDR analysis

Note that matlabruntime.sif as well as pre-build pleiofdr and magicsquare binaries can be build with a separate dev box. It's also hosted on NREC, but it's separate from the devbox where we build the rest of containers. This is because matlab runtime environment was pretty tricky to configure and make it compatible with docker / singularity.

I suggest to move all of this to a separate github repo, which already exists:

https://github.com/comorment/matlabruntime
Docs in this repo can be moved into src folder to be a bit more hidden, and we'll need a cleaner user docs for those users who want to run MATLAB code via container.

Later we can consider including other tools such as https://github.com/precimed/mostest/ and FEMA ( https://github.com/cmig-research-group/cmig_tools ) into MATLAB container. However I consider this to be low priority, because MATLAB is somewhat too tricky to squeeze into a container. Perhaps it's best if users just stick to run MATLAB code in their own environment.

So let's move everything matlab-related away from github.com/comorment/containers repo, include it in https://github.com/comorment/matlabruntime , and than put it all on hold and discuss whether or not we want to put more effort in MATLAB-related containers.

Separate github repos for tools with large reference

To reduce overhead from git clone of https://github.com/comorment/containers it's good to move tools such as mixer, ldsc, magma, pleiofdr each into its own repository, same as currently done for HDL tool.

tool requests

Tools that could be added to the containers:

MAGMA https://ctg.cncr.nl/software/magma
Rpackage Genomic SEM https://github.com/GenomicSEM/GenomicSEM
Rpackage Two Sample Mendelian Randomization https://mrcieu.github.io/TwoSampleMR/
Rpackage gsmr
(+ dependencies and secondary tools)

Move LDSC into its own github repository

LDSC also has a fairly sizable reference, and it's reasonable to keep it in its own container.

New Saige update available.

https://github.com/weizhouUMICH/SAIGE

@ofrei

This update may resolve the issues we are having with saige.

We have released a new version 1.0.0 (on March 15, 2022). It has substantial computation efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks!

git LFS: files that should have been pointers

A shallow git clone of this repo reports that a few files should've been pointers:

 % GIT_LFS_SKIP_SMUDGE=1 /opt/homebrew/bin/git clone --depth 1 [email protected]:comorment/containers.git
Cloning into 'containers'...
remote: Enumerating objects: 1130, done.
remote: Counting objects: 100% (1130/1130), done.
remote: Compressing objects: 100% (1032/1032), done.
remote: Total 1130 (delta 28), reused 1083 (delta 20), pack-reused 0
Receiving objects: 100% (1130/1130), 19.23 MiB | 3.21 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (1174/1174), done.
Encountered 5 files that should have been pointers, but weren't:
	usecases/bolt_out/example_3chr.frq
	usecases/bolt_out/example_3chr.log
	usecases/bolt_out/myld.l2.ldscore.gz
	usecases/bolt_out/myld.log
	usecases/saige_out/out_vcf.log

Not a big issue though, as they're all pretty small files:

 24K	usecases/bolt_out/example_3chr.frq
4.0K	usecases/bolt_out/example_3chr.log
4.0K	usecases/bolt_out/myld.l2.ldscore.gz
4.0K	usecases/bolt_out/myld.log
4.0K	usecases/saige_out/out_vcf.log

Edit: Some more info here: https://stackoverflow.com/questions/46704572/git-error-encountered-7-files-that-should-have-been-pointers-but-werent

Software: SHAPEIT.v2 to SHAPEIT4?

Hi @ofrei; The gwas dockerfile at https://github.com/comorment/gwas incorporates binaries of an older version of this software from https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html.
There's a more recent version with sources and MIT license available from https://github.com/odelaneau/shapeit4.
Should we rather package that?

Dockerfile recipes: Prefer MiniForge over MiniConda

MiniForge (https://github.com/conda-forge/miniforge) is the community-driven version of Conda. We can replace MiniConda by MiniForge in the Dockerfiles as we're mainly using the conda-forge channel anyway. Also, this means we can ignore the Anaconda terms of license (https://legal.anaconda.com/policies/en/?name=terms-of-service), just in case.

Edit: We should rather use the Mambaforge distribution from Miniforge, as this resolves the conda environment much faster than conda.

Trouble merging statistics in gwas.py merge-regenie

Something goes wrong on my end with gwas.py merge-regenie. Both run_regenie1 and run_regenie2 run as expected but then I get the following error for merge-regenie. Looks like something goes wrong in the join.

jacber@sens2017599-b10:~/nordic_gwas/basic$ $PYTHON ~/gwas.py merge-regenie --maf 0.1 --sumstats out/run_chr@_MDD_broad.regenie --basename out/run_chr@ --out out/run_MDD_broad --chr2use 1,2

gwas.py: pipeline for GWAS analysis
Version 1.1.0
(C) 2021 Oleksandr Frei, Bayram Akdeniz and Alexey A. Shadrin
Norwegian Centre for Mental Disorders Research / University of Oslo
Centre for Bioinformatics / University of Oslo
GNU General Public License v3

Call:
/home/jacber/gwas.py merge-regenie
--maf 0.1
--sumstats out/run_chr@_MDD_broad.regenie
--basename out/run_chr@
--out out/run_MDD_broad
--chr2use 1,2
Beginning analysis at Mon Aug 23 09:19:46 2021 by jacber, host sens2017599-b10.uppmax.uu.se
Traceback (most recent call last):
File "/home/jacber/gwas.py", line 1908, in
args.func(args, log)
File "/home/jacber/gwas.py", line 838, in merge_regenie
df, info_col = apply_filters(args, df)
File "/home/jacber/gwas.py", line 760, in apply_filters
df = pd.merge(df, maf, how='left', on='SNP')
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 89, in merge
return op.get_result()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 684, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 909, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 887, in _get_join_indexers
return get_join_indexers(
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1441, in get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/_libs/join.pyx", line 109, in pandas._libs.join.left_outer_join
MemoryError: Unable to allocate 446. GiB for an array with shape (59901284770,) and data type int64

Dockerfile recipes: Bump Ubuntu version 18.04 (LTS) -> 22.04 (LTS)?

Ubuntu 18.04 (LTS) reaches the end of standard support in April 2023 (https://wiki.ubuntu.com/Releases). It could make sense to base container builds on the more current 22.04 (LTS) or perhaps 20.04 (LTS).

Tool for distinguishing loci

Jacob and me were wondering if some tool exists that does the opposite of genetic correlations, namely identify loci that are specific for a trait. Say you want to compare summary statistics for bipolar disorder and depressive disorder, and you want to identify loci that are not shared between the two, is there a tool to identify those? You can do something like that with genomic SEM or by eyeballing circular manhattan plots, but I could not think of a tool that specifically identifies non-shared variants.

If such a tool exists we would love to have it in the container toolbox.

PRS tools

Could you make sure the following packages are available and fully functional:

LDpred2: https://privefl.github.io/bigsnpr/articles/LDpred2.html (not sure if you have LDpred or LDpred2, also it may need a usecase)
PRSice2 (I think this one is covered as it's included in a usecase)
GCTA-SBLUP (it is installed but it may need some testing + a usecase)

Move MAGMA and LAVA into its own github repository

MAGMA software is released as binary, but it require a fairly large reference data, and for this reason it's best to move into a separate github repository.
LAVA tool is quite difference - it's is based on R, and it addresses a different question than MAGMA.
But it needs some of the reference files as MAGMA. Also, LAVA is developed by the same group as MAGMA. So it's reasonable to include LAVA in the same github repository - but perhaps in a separate .sif file (can be magma.sif and lava.sif).
The github repo can be https://github.com/comorment/magma

mixer_demo usecase fails if it's executed on "ssh -Y" connection

Include Dockerfile and scripts into comorment/containers repo

Currently we use https://github.com/comorment/gwas to keep all development-related scripts for comorment containers.
The https://github.com/comorment/containers repo is used to release singularity containers (as .sif files), to keep reference data, and for user documentation. This separation is suboptimal, and it makes more sense to include all development-related scripts (Dockerfile, bash scripts, some dev instuctions, etc) in https://github.com/comorment/containers. However we should keep those codes somewhat hidden from the end user, for example move them to a new source folder in the root of this repo. After than the github.com/comorment/gwas repo can be archived (e.g. kept in case we need code history, but we lock is so no futher changes can be submitted).

Also, we should change our development model and start using feature & bug-fix branches, using a pull request and code review to integrate changes into the main branch.

Problem with config location

containers/gwas/gwas.py

Line 998 in 6434e86

args.config = yaml.safe_load(open(args.config, "r"))

I switched from copying the gwas.py to a personal folder to running gwas.py directly from the repository in the TSD environment which gave rise to the following problem.

If the yaml configuration or gwas.py script is not located in the same directory as gwas.py is executed it seems that gwas.py won't find the configuration file. Maybe it's a good idea to retreive the path of the gwas.py to locate the configuration file?
os.path.dirname(os.path.realpath(__file__))

Replace

containers/gwas/gwas.py

Line 68 in 6434e86

    
           parent_parser.add_argument('--config', type=str, default="config.yaml", help="file with misc configuration options")

configFile = os.path.dirname(os.path.realpath(__file__)) + "/config.yaml"
parent_parser.add_argument('--config', type=str, default=configFile, help='file with misc configuration options')

Since this file seems to be required...
Below line 986:

containers/gwas/gwas.py

Lines 985 to 986 in 6434e86

    
           if args.out is None: 
        
               raise ValueError('--out is required.')

if not os.path.exists(args.config):
   raise IOError('configuration file "' + os.path.basename(args.config) + '" not found')

Not sure if IOError is the appropriate error though...

Reading .pheno/ .dict

Please build in some flexibility to deal with variations in reading .pheno / .dict files. Of course it's unfeasible (and unnecessary) to be able to deal with all possible variations; we need to keep a balance. Just be clear on the restrictions in the documentation.

Move ipsychcnv.sif and enigma-cnv.sif into separate github repository

I suggest moving development of ipsychcnv.sif and enigma-cnv.sif into a separate github repository, e.g. http://github.com/comorment/cnv , which should also include docker files, scripts, and other documentation related to these containers. @bayramakdeniz does this sound reasonable?

HDL for genetic correlation

Could you please implement:
High-Definition Likelihood for genetic correlation: https://github.com/zhenin/HDL

reading geno files per chromosome

I have a suggestion for tweaking the gwas.py script so that it can write jobs using geno and geno-fit files that are split out per chromosome.

Dockerfile recipes: Dockerfile linter action

Just to check for consistency, incorporate a Dockerfile linter GitHub action, e.g., Hadolint (https://github.com/hadolint/hadolint-action) in this repo.

custom --sample when using bgen files

Could the gwas.py script implement the option to specify a .sample file to accompany a .bgen file with a different (path)name?

Several errors with Saige (empty GMMATmodelFile)

Hello,

The new version of Saige is running, but I have encountered (and fixed) some errors, but I have hit a new issue.

First, I have several long flag errors, and they were solved by removing the following flags:
--long flag "numLinesOutput" is invalid
--long flag "IsOutputAFinCaseCtrl" is invalid
--long flag "IsOutputNinCaseCtrl" is invalid

The issue below in the screen shot is more complicated, I do not know how to change the GMMATmodelFile, but it appears empty and causes an error that halts Saige. Interestingly, this wasn't an error before the last update.

gwas.py: --variance-standardize introduces NA columns if input has no variation

current implementation makes --variance-standardize flag is potentially dangerous as it will introduce NA to a column that originally didn't have any variation

comorment / containers Goto Github PK

containers's People

Stargazers

Watchers

Forkers

containers's Issues

define variables that need to be accumulated across GWASes for each SNP.

Recommend Projects

Recommend Topics

Recommend Org