Coder Social home page Coder Social logo

mancusolab / sushie Goto Github PK

View Code? Open in Web Editor NEW
14.0 4.0 2.0 722 KB

Software to perform multi-ancestry SNP fine-mapping on molecular data

Home Page: https://mancusolab.github.io/sushie

License: MIT License

Python 98.40% R 0.81% Shell 0.79%
bayesian-inference eqtl gwas jax python variational-inference fine-mapping genetics pqtl susie

sushie's Introduction

Documentation-webpage Github License Project generated with PyScaffold

SuShiE🍣

SuShiE (Sum of Shared Single Effect) is a Python software to fine-map causal SNPs, compute prediction weights, and infer effect size correlation for molecular data (e.g., mRNA levels and protein levels etc.) across multiple ancestries.

- We detest usage of our software or scientific outcome to promote racial discrimination.

SuShiE is described in

Improved multi-ancestry fine-mapping identifies cis-regulatory variants underlying molecular traits and disease risk.

Zeyun Lu, Xinran Wang, Matthew Carr, Artem Kim, Steven Gazal, Pejman Mohammadi, Lang Wu, Alexander Gusev, James Pirruccello, Linda Kachuri, Nicholas Mancuso.

Check here for full documentation.

Installation | Example | Notes | Version History | Support | Other Software

Installation

Users can download the latest repository and then use pip:

git clone https://github.com/mancusolab/sushie.git
cd sushie
pip install .

We currently only support Python3.8+.

Before installation, we recommend to create a new environment using conda so that it will not affect the software versions of the other projects.

Get Started with Example

SuShiE software is very easy to use:

cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --covar EUR.covar AFR.covar --output ./test_result

It can perform:

  • SuShiE: multi-ancestry fine-mapping accounting for ancestral correlation
  • Single-ancestry SuSiE (Sum of Single Effect)
  • Independent SuShiE: multi-ancestry SuShiE without accounting for correlation
  • Meta-SuSiE: single-ancestry SuSiE followed by meta-analysis
  • Mega-SuSiE: single-ancestry SuSiE on row-wise stacked data across ancestries
  • QTL effect size correlation estimation
  • cis-SNP heritability estimation
  • Cross-validation for SuShiE prediction weights
  • Convert prediction results to FUSION format, thus can be used in TWAS

See here for more details on how to use SuShiE.

If you want to use in-software SuShiE inference function, you can use following code as an example:

from sushie.infer import infer_sushie
# Xs is for genotype data, and it should be a list of numpy array whose length is the number of ancestry.
# ys is for phenotype data, and it should also be a list of numpy array whose length is the number of ancestry.
infer_sushie(Xs=X, ys=y)

You can play it with your own ideas!

Notes

  • SuShiE currently only supports continuous phenotype fine-mapping.
  • SuShiE currently only supports fine-mapping on autosomes.
  • SuShiE uses JAX with Just In Time compilation to achieve high-speed computation. However, there are some issues for JAX with Mac M1 chip. To solve this, users need to initiate conda using miniforge, and then install SuShiE using pip in the desired environment.

Version History

Version Description
0.1 Initial Release
0.11 Fix the bug for OLS to compute adjusted r squared.
0.12 Update io.corr function so that report all the correlation results no matter cs is pruned or not.
0.13 Add --keep command to enable user to specify a file that contains the subjects ID SuShiE will perform on. Add --ancestry_index command to enable user to specify a file that contains the ancestry index for fine-mapping. With this, user can input single phenotype, genotype, and covariate file that contains all the subjects across ancestries. Implement padding to increase inference time. Record elbo at each iteration and can access it in the infer.SuShiEResult object. The alphas table now outputs the average purity and KL divergence for each L. Change --kl_threshold to --divergence. Add --maf command to remove SNPs that less than minor allele frequency threshold within each ancestry. Add --max_select command to randomly select maximum number of SNPs to compute purity to avoid unnecessary memory spending. Add a QC function to remove duplicated SNPs.
0.14 Remove KL-Divergence pruning. Enhance command line appearance and improve the output files contents. Fix small bugs on multivariate KL.
0.15 Fix several typos; add a sanity check on reading vcf genotype data by assigning gt_types==Unknown as NA; Add preprint information.
0.16 Add option to remove ambiguous SNPs; fix several bugs and enhance codes quality.

Support

For any questions, comments, bug reporting, and feature requests, please contact Zeyun Lu ([email protected]) and Nicholas Mancuso ([email protected]), and open a new thread in the Issue Tracker.

Other Software

Feel free to use other software developed by Mancuso Lab:

  • MA-FOCUS: a Bayesian fine-mapping framework using TWAS statistics across multiple ancestries to identify the causal genes for complex traits.
  • SuSiE-PCA: a scalable Bayesian variable selection technique for sparse principal component analysis
  • twas_sim: a Python software to simulate TWAS statistics.
  • FactorGo: a scalable variational factor analysis model that learns pleiotropic factors from GWAS summary statistics.
  • HAMSTA: a Python software to estimate heritability explained by local ancestry data from admixture mapping summary statistics.
  • Traceax: a Python library to perform stochastic trace estimation for linear operators.

This project has been set up using PyScaffold 4.1.1. For details and usage information on PyScaffold see https://pyscaffold.org/.

sushie's People

Contributors

quattro avatar zeyunlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sushie's Issues

Cleanup cli

  • Push out numerics into core or new utils file
  • Push out i/o into new io.py file
  • prepend internal functions to have "_" in name to indicate internal

QR decomp for OLS

Super low prio, but we can use QR decomposition as a more numerically stable way to do regression and get SEs.

    Q, R = QR(X, 'reduced')
    Qty = Q.T @ y
    beta = jsp.linalg.solve_triangular(R, Qty)
    df = Q.shape[0] - Q.shape[1]
    residuals = y - Q @ Qty
    sigma = jnp.sqrt(jnp.sum(residuals ** 2, axis=0) / df)
    se = sigma * jnp.sqrt(jnp.diag(jsp.linalg.cho_solve((R, False), jnp.eye(R.shape[0])))

Documentation

Catchall for documentation. Need to update sphinx conf from SuSie-PCA to sushie.

  • Tutorial for molQTL scan / run and output prediction weights for FUSION
  • API

Explore padding for improved JIT

Our current code uses JIT, but is suboptimal due to inconsistent shapes across groups (generally). One workaround could be to pad the smallest groups with 0s to match the largest sample size. This would enable the use of ndarrays throughout, without the need to use loops/lists at the cost of increased memory. In this setting JIT should be much more efficient, but it's unclear how much more efficient compared with our current implementation (or until they allow dynamic shapes for JIT).

  • Padding
  • JAXSParse support

Add unit testing

Catchall for unit testing in infer, core, and other utils that come up

Minor tweak to PIP computation

We can improve the numerical stability of PIP computation by going from

pip = 1 - jnp.exp(jnp.sum(jnp.log1p(-alpha), axis=0))

to

pip = -jnp.expm1(jnp.sum(jnp.log1p(-alpha), axis=0))

It probably won't affect much, and may be overkill, but comes at no additional computational cost, and should see slight benefit when the summed log probabilities are small.

Incorrect sample size reported in sushie.cv.tsv

Hello,

I encountered an issue with the cross-validation output, specifically regarding the reported sample size. It appears that the sample size reported is always that of the last ancestry processed.

You can reproduce the problem using the following test data:

sushie finemap
	--pheno EUR.pheno
	        AFR.pheno
	--plink plink/EUR
	        plink/AFR
	--covar EUR.covar
	        AFR.covar
	--cv
	--her
	--output ./test_result

The test_result.sushie.cv.tsv output looks like this:

ancestry rsq p_value N trait
1 0.4932215908291845 4.133847921724767e-74 639 Trait
2 0.4743709425200131 3.0193789659893506e-91 639 Trait

As shown above, both ancestries have the same reported sample size of 639. However, based on the input files, I would expect the EUR ancestry to have 489 samples.

Xinyu

Cleanup infer

  • implement lax.fori loop to deal with looping over effect updates
  • implement internal Callable to type VarUpdate

Issue with running example data

Hello,

Thanks for developing this amazing software! I followed the instructions to install the package into a conda environment with Python 3.10.14 on a Linux system. The installation seems to be successful, but when I run the "Get Started with Example" command:

cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --covar EUR.covar AFR.covar --output ./test_result

I encounter the following error message:

===================================
             SuShiE v0.15.post1.dev3+g97cb3ad             
===================================
sushie finemap
        --pheno EUR.pheno
                AFR.pheno
        --vcf vcf/EUR.vcf
              vcf/AFR.vcf
        --covar EUR.covar
                AFR.covar
        --output ./test_result

Starting log...
[2024-06-15 00:27:34 - INFO] Detect phenotypes for Trait from 2 ancestries.
[2024-06-15 00:27:34 - INFO] Detect genotype data in vcf format.
[2024-06-15 00:27:34 - INFO] Detect covariates data.
[2024-06-15 00:27:34 - INFO] Fine-mapping finishes for Trait, and thanks for using our software. For bug reporting, suggestions, and comments, please go to https://github.com/mancusolab/sushie.
Traceback (most recent call last):
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/cli.py", line 1014, in run_finemap
    rawData = io.read_data(
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/io.py", line 116, in read_data
    bim, fam, bed = geno_func(geno_paths[idx])
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/io.py", line 221, in read_vcf
    var.gt_types = jnp.where(var.gt_types == 3, jnp.nan, var.gt_types)
AttributeError: attribute 'gt_types' of 'cyvcf2.cyvcf2.Variant' objects is not writable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xxs410/.conda/envs/sushie/bin/sushie", line 8, in <module>
    sys.exit(run_cli())
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/cli.py", line 1641, in run_cli
    return _main(sys.argv[1:])
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/cli.py", line 1635, in _main
    args.func(args)
  File "/home/xxs410/.conda/envs/sushie/lib/python3.10/site-packages/sushie/cli.py", line 1053, in run_finemap
    traceback.format_exception(
TypeError: format_exception() got an unexpected keyword argument 'etype'

I believe the issue is specific to the VCF format genotype data, as the error does not occur when I replace --vcf with --plink.

Thanks in advance!

Xinyu

Genotype Inputs

Lets stress test genotype inputs when there is missing data, and a few other sanity checks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.