daesc,gqi

daesc's Issues

Identifying imprinted genes from reciprocal cross and scRNA-seq data

I am impressed by your tool and would like to leverage its capabilities in my research. Using a reciprocal mouse strain cross, I have generated stem cell lines from two distinct rat strains with characterized SNPs. I have performed scRNA-seq on these cell lines from both reciprocal crosses. I am specifically interested in identifying imprinted genes within this dataset using your DAESC algorithm. Can you please advise on the best approach to integrate my SNP and scRNA-seq data into your tool for this purpose?

Thank you

Specifying the design matrix for differential ASE

Hello DAESC dev team!

First of all, thank you for developing such flexible framework to test for allelic imbalance.

I am trying to incorporate DAESC into my analysis pipeline that tests for differential ASE across multiple individuals accounting for the disease status (condition), spatial location (cortical layers) and cell type.

To begin with, I applied DAESC on a toy dataset from one gene, two conditions, and two individuals per condition. Here, I fit the baseline model (DAESC-BB) specifying the design matrix x as a binary numeric array denoting the condition.

Now, I want to extend this by also taking into account the spatial information and the interaction terms. Since both condition and spatial location as in cortical layers are categorical variables, I tried using model.matrix function to encode the information in one-hot matrix. Here is the structure of the data frame I am working with and how I am invoking the daesc_bb function.

str(cur.df)
'data.frame': 3046 obs. of 8 variables:
$ gene : chr "HES6" "HES6" "HES6" "HES6" ...
$ barcode : chr "CTCTCTAACTGCCTAG" "TCGGCGTACTGCACAA" "GGGCAGGATTTCTGTG" "CGGTTCCGGCTTCTTG" ...
$ allele1_count: num 1 1 1 1 1 1 0 3 0 1 ...
$ allele2_count: num 1 0 0 0 1 1 1 0 1 0 ...
$ total_count : num 2 1 1 1 2 2 1 3 1 1 ...
$ condition : Factor w/ 2 levels "HC","PD": 1 1 1 1 1 1 1 1 1 1 ...
$ sample_id : chr "BN0339" "BN0339" "BN0339" "BN0339" ...
$ layer : Factor w/ 7 levels "Layer 1","Layer 2",..: 7 4 7 4 5 3 6 6 5 6 ...

myformula = ~ cur.df$condition + cur.df$layer + cur.df$condition:cur.df$layer + 0
one_hot = model.matrix(myformula)

str(one_hot)
num [1:3046, 1:14] 1 1 1 1 1 1 1 1 1 1 ...
- attr(, "dimnames")=List of 2
..$ : chr [1:3046] "1" "2" "3" "4" ...
..$ : chr [1:14] "cur.df$conditionHC" "cur.df$conditionPD" "cur.df$layerLayer 2" "cur.df$layerLayer 3" ...
- attr(, "assign")= int [1:14] 1 1 2 2 2 2 2 2 3 3 ...
- attr(*, "contrasts")=List of 2
..$ cur.df$condition: chr "contr.treatment"
..$ cur.df$layer : chr "contr.treatment"

res = daesc_bb(y=cur.df$allele1_count, n=cur.df$total_count, subj=cur.df$sample_id, x=one_hot)

When I run this as is, I get the following error:

fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
cur.df.conditionPD
NA
Error in aod::betabin(cbind(y, n - y) ~ ., random = ~1, data = data.frame(y, :
Initial values for the fixed effects contain at least one missing value.

I think the reason is in the design matrix, where the first two columns have essentially the same information. After dropping the first column in one_hot, it runs without error, but I want to double check whether this is the intended use of this variable when supplying multiple categorical variables.

Finally, I also want to add cell type information as another independent variable. In my case, cell type is not a categorical variable, since the data was generated using Visium. So for each cell type, the input data will be its estimated abundance. If I were to input all (1) condition, (2) layer and (3) cell type into x, how should I structure it?

Thank you!

Skewed p-value distribution

Hello DAESC team,

I have a question about interpreting the result using the DAESC-MIX model. I have tested ~10000 genes for allelic imbalance among three conditions. Then I plotted a histogram of p-values.

Looking at the p-value distribution, it is very skewed, identifying almost all genes as significant.
Can you share your insights on this?

The dataset used here is 10X Visium.

Thank you!

Pre-processing the data from 10xchromium

Hello,

I am now struggling to generate the base quality score table for the bam file generated from cellranger pipeline. As the cellranger mapped the sequencing file to grch38 genome, the vcf file from gatk resource bundle is not compatible. While the vcf files from 1000 genome project are doable, they are splited into different chromosome and each file take up a bit of storage place. Is there any suggestion to get around with this problem or if there's any way to merge the base quality score table generated for each chromosome into one?

gqi / daesc Goto Github PK

daesc's People

Contributors

Stargazers

Watchers

Forkers

daesc's Issues

Identifying imprinted genes from reciprocal cross and scRNA-seq data

Specifying the design matrix for differential ASE

Skewed p-value distribution

Pre-processing the data from 10xchromium

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent