Coder Social home page Coder Social logo

comorment / containers Goto Github PK

View Code? Open in Web Editor NEW
21.0 8.0 9.0 25.89 MB

CoMorMent-Containers

Home Page: https://www.comorment.uio.no

License: GNU General Public License v3.0

MATLAB 0.01% Shell 7.30% Python 24.93% R 43.60% Perl 2.24% Mathematica 0.40% Makefile 0.23% Dockerfile 1.36% Batchfile 0.12% Jupyter Notebook 19.83%
containers gwas polygenic-risk-scores singularity-containers

containers's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

containers's Issues

merge-regenie missing IDs

The gwas.py merge-regenie command detects missing SNP IDs in the .afreq files as duplicates and throws an error. Would it be possible to specify a command for removing (or replacing with CHR:BP) missing SNP IDs?

Pin software versions

Copied from comorment/gwas#29:
In the current Dockerfile recipes and bash installer files, versions of different tools are (usually) not pinned. Thus a (re)built container will likely differ from day to day, in particular, if packages are installed from sources like conda-forge and similar where updates are frequent.
Ideally, versions should explicitly be pinned in the recipes, e.g., like

FROM buildpack-deps:focal

RUN apt-get update && \
    apt-get install --no-install-recommends -y \
    cmake=3.16.3-1ubuntu1 \
    python3-dev=3.8.2-0ubuntu2
    ....
RUN pip install h5py==2.10.0 && \
    pip install git+https://github.com/NeuralEnsemble/parameters@b95bac2bd17f03ce600541e435e270a1e1c5a478#egg=parameters \
    ...
RUN git clone --depth 1 -b v3.1 https://github.com/nest/nest-simulator /usr/src/nest-simulator && \
    # compile
    ...

The above is just taken from another project of mine (complete example: https://github.com/LFPy/LFPykernels/blob/main/Dockerfile).

Version pinning is also a best practice suggested by Dockerfile linting tools like Hadolint (https://hadolint.github.io/hadolint/).

``ld.so`` error when run on SURFSARA login node

I'm getting the following error when run hello.sif container on surfsara login node. However the error seem to be harmless

ERROR: ld.so: object '/sara/tools/xalt/xalt/lib64/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

METAL analysis question. Do we want to have NCASE column?

define variables that need to be accumulated across GWASes for each SNP.

CUSTOMVARIABLE NCASE
CUSTOMVARIABLE NCONTROL

However, the output from Regenie only has N, not NCASE or NCONTROL.
If we want this implemented, we would need Regenie to have additional output columns.

Job 3 should only compile together chromosomes if there are no errors from slurm

For the saige analysis, some chromosomes did not finish, and yet job 3 still went on and combined together all .saige chromosomes files, even the ones partially finished.

The gwas.py script should look for an error at the end of the corresponding .out files, to see if the previous step in job 2 finished, before running job 3 .

image

Manhattan plot (python_convert)

For some reason the Mahattan plot utility from python_convert only includes about 45,000 of the ~8 mil SNPs in my summary statistics, without outputting any warnings or errors. What reference data is used?

It would be great if more information can be added about the usage of the clump utility (sumstats.py). What are the minimum required flags? What are the defaults? What is the reference dataset? To unify these processes across different sites as much as possible we probably should pre-set as many of the parameters as we can.

Intermittent failure in scripts/from_docker_image.sh

Sometimes when I use scripts/from_docker_image.sh script it fails with an error. Here is how I call the script:

>sudo make gwas.sif

which in turn triggers the following command

docker build -t gwas -f containers/gwas/Dockerfile . && scripts/convert_docker_image_to_singularity.sh gwas          

The first part of it succeeds, but the second fails:

...
Successfully built 6038cc3d0a1a
Successfully tagged gwas:latest
registry
Using default tag: latest
The push refers to repository [localhost:5000/gwas]
Get http://localhost:5000/v2/: EOF
make: *** [Makefile:4: gwas.sif] Error 1

Running it one more time always solves the problem.
I haven't investigated what's the problem here...

--nThreads in Saige step 2

image

--nThreads is invalid for the second step. It only works for the first step in Saige analysis.

Maybe there is another solution

suggestions for gwas.py

  • Perhaps add effect allele frequency in the outputted sumstats?
  • The regenie jobs created with the argsfile contain an extra --bt argument that I think can be removed
  • The sbatch commands do not work on our system, on Bianca you need to specify -A (account/ project for computational time) and -p (node / core with -n)
  • The 'parallel' command in gwas_real is not recognized, here it's a module that needs to be loaded first (whereas signularity on the other hand is not a module)
  • Does regenie have flags for some standard QC, like MAF, missingness, or HWE filtering? Maybe some of these standard QC checks could be built in
  • Would it be an idea to auto-generate a readme for the output? Just so you can see in one glance which was the reference allele, on which scale are the BETA and SE, etc.

gwas.py reports incorrect sums for variables of type `CONTINUOUS`

We have a phenotype dictionary that looks like this:

2022-02-16-114524_869x636_scrot

We supply ind_F3300 with the --pheno argument and the rest of the variables are supplied with --covar.

When we run gwas.py it incorrectly reports that the variables of type CONTINUOUS have no cases, controls or missing even though all individuals have valid values for these variables:

2022-02-16-115813_1020x594_scrot

It seems that the part of the code that reports these sums use the variable pheno_type that doesn't change between iterations:

    log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
    pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
    pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
    for var in pheno_and_covar_cols:
        if pheno_type=='BINARY':
            log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
        else:
            log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))

Source: https://github.com/comorment/containers/blob/main/gwas/gwas.py#L742-L749

If I'm not mistaken we can check the type of each variable using the pheno_dict_map, like so:

     log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
     pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
     pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
     for var in pheno_and_covar_cols:
-        if pheno_type=='BINARY':
+        if pheno_dict_map[var]=='BINARY':
             log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
         else:
             log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))

Clarify versioning of the containers

  • add "tags" to github
  • include those "tags" into ".sif" files (e.g. by adding a file inside container showing its version)
  • add "CHANGELOG" file listing what has changed across versions
  • provide a README page in the documentation explaining versioning

singularity/saige.sif is a textfile, instead of the actual singularity image

Before commit c38f807 the file singularity/saige.sif was 702 MB in size, after the commit that file is 295 Bytes.

This is the content of that file now:

version https://git-lfs.github.com/spec/v1
<<<<<<< HEAD
oid sha256:8c870154d08604b5eefe2a4635a6ef22c2cf69b4dccb72f12367bda467dffb43
size 736071680
=======
oid sha256:1d8e3762db280395a73eb9bd3a070f6666717f8d0dbc76cb7738e256bf5649da
size 899510272
>>>>>>> 205045b7ae8864036476cf68d358bd0e9ce045c0

It seems like the file has been accidentally commited as a literal LFS textfile, instead of the actual singularity image.

`--config` is not passed on to slurm jobs that run `mege-regenie`, `merge-saige` or `merge-plink2`

Here's the expected the behavior:

Given I have an arguments file named "my_test.args" in the current directory
  And my arguments file have the argument "--analysis regenie figures"
  And my arguments file have the argument "--out my_test"
  And I have an config file named "my_test.yaml" in the current directory
 When I run "gwas.py --argsfile my_test.args --config my_test.yaml"
 Then the file "my_test.3.job" is created
  And the file "my_test.3.job" contains the command "gwas merge-regenie" 
  And the file "my_test.3.job" contains the argument "--config my_test.yaml"

But currently when going through this scenario (tested with 5d3a5b4) the last step fails, the argument --config is absent from the job file.

Error in job2 with saige "chunks"

This is the error in the out file for job2 in the unreleased saige "chunks" test version.

The problem seems to be the flag for the start of the chunks. Maybe this is an R error? I'm not sure.

image

Avoid requirement that FID and IID are the same in gwas.py

Current scripts are developed under assumption that FID and IID are the same, and only IID is used to identify individuals and link them between .pheno file and the genetic files. Good to design this in a way that is more flexible.

Add a small unit test setup

Describe the solution you'd like
Set up a small framework (py.test or similar) calling the different containers locally, checking that software installed in containers returns its version or similar, and does not result in crashes (from missing libs, etc.)

Error with lead SNP file, and plotting

There is an error in the loci command. The output from loci is used when plotting.
This is the description of the loci command:
Perform LD-based clumping of summary stats, using a procedure that is similar to FUMA snp2gene functionality
The plotting seems to expect the files iPSYCH2012_ind_F3300.lead.csv and iPSYCH2012_ind_F3300.indep.csv, but those files does not exist. I guess they would be produced by the loci command

I tried to re-run and remove the flags for those two files to see if that solves the issue. But it didn’t. And I know that chr 2 has variants that pass significance threshold.

image

Support for non-autosomes

It would be good if the script doesn't throw an error for non-autosomes, but just filters them out (or keeps them if they are sensible codes)

Move matlab-related software to separate github repo

I don't think we fully established a container with MATLAB runtime to package software written in MATLAB.
Bayram have done a lot of work on this:

Note that matlabruntime.sif as well as pre-build pleiofdr and magicsquare binaries can be build with a separate dev box. It's also hosted on NREC, but it's separate from the devbox where we build the rest of containers. This is because matlab runtime environment was pretty tricky to configure and make it compatible with docker / singularity.

I suggest to move all of this to a separate github repo, which already exists:

  • https://github.com/comorment/matlabruntime
    Docs in this repo can be moved into src folder to be a bit more hidden, and we'll need a cleaner user docs for those users who want to run MATLAB code via container.

Later we can consider including other tools such as https://github.com/precimed/mostest/ and FEMA ( https://github.com/cmig-research-group/cmig_tools ) into MATLAB container. However I consider this to be low priority, because MATLAB is somewhat too tricky to squeeze into a container. Perhaps it's best if users just stick to run MATLAB code in their own environment.

So let's move everything matlab-related away from github.com/comorment/containers repo, include it in https://github.com/comorment/matlabruntime , and than put it all on hold and discuss whether or not we want to put more effort in MATLAB-related containers.

New Saige update available.

https://github.com/weizhouUMICH/SAIGE

@ofrei

This update may resolve the issues we are having with saige.

We have released a new version 1.0.0 (on March 15, 2022). It has substantial computation efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks!

git LFS: files that should have been pointers

A shallow git clone of this repo reports that a few files should've been pointers:

 % GIT_LFS_SKIP_SMUDGE=1 /opt/homebrew/bin/git clone --depth 1 [email protected]:comorment/containers.git
Cloning into 'containers'...
remote: Enumerating objects: 1130, done.
remote: Counting objects: 100% (1130/1130), done.
remote: Compressing objects: 100% (1032/1032), done.
remote: Total 1130 (delta 28), reused 1083 (delta 20), pack-reused 0
Receiving objects: 100% (1130/1130), 19.23 MiB | 3.21 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (1174/1174), done.
Encountered 5 files that should have been pointers, but weren't:
	usecases/bolt_out/example_3chr.frq
	usecases/bolt_out/example_3chr.log
	usecases/bolt_out/myld.l2.ldscore.gz
	usecases/bolt_out/myld.log
	usecases/saige_out/out_vcf.log

Not a big issue though, as they're all pretty small files:

 24K	usecases/bolt_out/example_3chr.frq
4.0K	usecases/bolt_out/example_3chr.log
4.0K	usecases/bolt_out/myld.l2.ldscore.gz
4.0K	usecases/bolt_out/myld.log
4.0K	usecases/saige_out/out_vcf.log

Edit: Some more info here: https://stackoverflow.com/questions/46704572/git-error-encountered-7-files-that-should-have-been-pointers-but-werent

Dockerfile recipes: Prefer MiniForge over MiniConda

MiniForge (https://github.com/conda-forge/miniforge) is the community-driven version of Conda. We can replace MiniConda by MiniForge in the Dockerfiles as we're mainly using the conda-forge channel anyway. Also, this means we can ignore the Anaconda terms of license (https://legal.anaconda.com/policies/en/?name=terms-of-service), just in case.

Edit: We should rather use the Mambaforge distribution from Miniforge, as this resolves the conda environment much faster than conda.

Trouble merging statistics in gwas.py merge-regenie

Something goes wrong on my end with gwas.py merge-regenie. Both run_regenie1 and run_regenie2 run as expected but then I get the following error for merge-regenie. Looks like something goes wrong in the join.

jacber@sens2017599-b10:~/nordic_gwas/basic$ $PYTHON ~/gwas.py merge-regenie --maf 0.1 --sumstats out/run_chr@_MDD_broad.regenie --basename out/run_chr@ --out out/run_MDD_broad --chr2use 1,2


  • gwas.py: pipeline for GWAS analysis
  • Version 1.1.0
  • (C) 2021 Oleksandr Frei, Bayram Akdeniz and Alexey A. Shadrin
  • Norwegian Centre for Mental Disorders Research / University of Oslo
  • Centre for Bioinformatics / University of Oslo
  • GNU General Public License v3

Call:
/home/jacber/gwas.py merge-regenie
--maf 0.1
--sumstats out/run_chr@_MDD_broad.regenie
--basename out/run_chr@
--out out/run_MDD_broad
--chr2use 1,2
Beginning analysis at Mon Aug 23 09:19:46 2021 by jacber, host sens2017599-b10.uppmax.uu.se
Traceback (most recent call last):
File "/home/jacber/gwas.py", line 1908, in
args.func(args, log)
File "/home/jacber/gwas.py", line 838, in merge_regenie
df, info_col = apply_filters(args, df)
File "/home/jacber/gwas.py", line 760, in apply_filters
df = pd.merge(df, maf, how='left', on='SNP')
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 89, in merge
return op.get_result()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 684, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 909, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 887, in _get_join_indexers
return get_join_indexers(
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1441, in get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/_libs/join.pyx", line 109, in pandas._libs.join.left_outer_join
MemoryError: Unable to allocate 446. GiB for an array with shape (59901284770,) and data type int64

Tool for distinguishing loci

Jacob and me were wondering if some tool exists that does the opposite of genetic correlations, namely identify loci that are specific for a trait. Say you want to compare summary statistics for bipolar disorder and depressive disorder, and you want to identify loci that are not shared between the two, is there a tool to identify those? You can do something like that with genomic SEM or by eyeballing circular manhattan plots, but I could not think of a tool that specifically identifies non-shared variants.

If such a tool exists we would love to have it in the container toolbox.

PRS tools

Could you make sure the following packages are available and fully functional:

Move MAGMA and LAVA into its own github repository

MAGMA software is released as binary, but it require a fairly large reference data, and for this reason it's best to move into a separate github repository.
LAVA tool is quite difference - it's is based on R, and it addresses a different question than MAGMA.
But it needs some of the reference files as MAGMA. Also, LAVA is developed by the same group as MAGMA. So it's reasonable to include LAVA in the same github repository - but perhaps in a separate .sif file (can be magma.sif and lava.sif).
The github repo can be https://github.com/comorment/magma

Include Dockerfile and scripts into comorment/containers repo

Currently we use https://github.com/comorment/gwas to keep all development-related scripts for comorment containers.
The https://github.com/comorment/containers repo is used to release singularity containers (as .sif files), to keep reference data, and for user documentation. This separation is suboptimal, and it makes more sense to include all development-related scripts (Dockerfile, bash scripts, some dev instuctions, etc) in https://github.com/comorment/containers. However we should keep those codes somewhat hidden from the end user, for example move them to a new source folder in the root of this repo. After than the github.com/comorment/gwas repo can be archived (e.g. kept in case we need code history, but we lock is so no futher changes can be submitted).

Also, we should change our development model and start using feature & bug-fix branches, using a pull request and code review to integrate changes into the main branch.

Problem with config location

args.config = yaml.safe_load(open(args.config, "r"))

I switched from copying the gwas.py to a personal folder to running gwas.py directly from the repository in the TSD environment which gave rise to the following problem.

If the yaml configuration or gwas.py script is not located in the same directory as gwas.py is executed it seems that gwas.py won't find the configuration file. Maybe it's a good idea to retreive the path of the gwas.py to locate the configuration file?
os.path.dirname(os.path.realpath(__file__))

Replace

parent_parser.add_argument('--config', type=str, default="config.yaml", help="file with misc configuration options")

configFile = os.path.dirname(os.path.realpath(__file__)) + "/config.yaml"
parent_parser.add_argument('--config', type=str, default=configFile, help='file with misc configuration options')

Since this file seems to be required...
Below line 986:

containers/gwas/gwas.py

Lines 985 to 986 in 6434e86

if args.out is None:
raise ValueError('--out is required.')

if not os.path.exists(args.config):
   raise IOError('configuration file "' + os.path.basename(args.config) + '" not found')

Not sure if IOError is the appropriate error though...

Reading .pheno/ .dict

Please build in some flexibility to deal with variations in reading .pheno / .dict files. Of course it's unfeasible (and unnecessary) to be able to deal with all possible variations; we need to keep a balance. Just be clear on the restrictions in the documentation.

reading geno files per chromosome

I have a suggestion for tweaking the gwas.py script so that it can write jobs using geno and geno-fit files that are split out per chromosome.

Several errors with Saige (empty GMMATmodelFile)

Hello,

The new version of Saige is running, but I have encountered (and fixed) some errors, but I have hit a new issue.

First, I have several long flag errors, and they were solved by removing the following flags:
--long flag "numLinesOutput" is invalid
--long flag "IsOutputAFinCaseCtrl" is invalid
--long flag "IsOutputNinCaseCtrl" is invalid

The issue below in the screen shot is more complicated, I do not know how to change the GMMATmodelFile, but it appears empty and causes an error that halts Saige. Interestingly, this wasn't an error before the last update.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.