rvinas / hyfa Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 3.0 8.16 MB

Hypergraph Factorisation

License: MIT License

Jupyter Notebook 93.51% Python 6.49%

hyfa's People

Contributors

Stargazers

Watchers

Forkers

gamazonlab yangyanda

hyfa's Issues

Questions about the normalized RNA-seq data processing

Hi, I have a couple of questions about the Normalized bulk transcriptomics according to your paper and GTEx pipeline, as below:

Discard under-represented tissues (n = 5), namely bladder,
cervix (ectocervix, endocervix), fallopian tube and kidney
(medulla).
Select set of overlapping protein-coding genes across all
tissues.
Discard donors with only one collected tissue (n = 4).
Select genes on the basis of expression thresholds of ≥0.1
transcripts per kilobase million in ≥20% of samples and ≥6 reads
(unnormalized) in ≥20% of samples.
Normalize read counts across samples using the trimmed mean
of M values method.
Apply inverse normal transformation to the expression values
for each gene.

I would greatly appreciate any clarification you could provide on these matters:

In step 1, I want to know the minimum sample size of each tissue that pass the filtering of "under-represented tissues".
In step 2, Does that mean we should select protein-coding genes that pass the filtering of step 3 and step 4 in all tissues?
I wonder if we should filter genes and normalize expressions separately for each tissue and then merge them together, or if it is more appropriate to merge all the tissue samples first and then conduct gene filtering and expression normalization on the merged dataset.
Is it acceptable for the dataset to contain overlapped tissue samples. For example, both Brain and Cerebral cortex expression data from the same individual are included.

Warm regards,
Mian

AssertionError when running "assert np.allclose(y_test_, y_test)"

Hi, when running the codes of compare to different baselines in evaluate_GTEx_v8_normalised.ipynb using my own data, I found an AssertionError in line 142 a below:

AssertionError Traceback (most recent call last)
Cell In[32], line 142
140 y_test_pred = out['px_rate'].cpu().numpy() # torch.distributions.normal.Normal(loc=out['px_rate'], scale=out['px_r']).mean.cpu().numpy()
141 y_test_ = d.x_target.cpu().numpy()
--> 142 assert np.allclose(y_test_, y_test)
144 sample_scores = score_fn(y_test, y_test_pred, sample_corr=sample_corr)
146 # Append results

AssertionError:

The target array of the aux_test_dataset is not consistent with the target array of the corresponding HypergraphDataset after we converted the aux_test_dataset into the HypergraphDataset.

After comparing d.target_dynamic['Participant ID'] and aux_test_dataset.adata_target.obs['Participant ID'], and also their expression arrays, I found the order of "Participand ID" and their expression data has changed. It seems that d.target_dynamic['Participant ID'] is ordered numerically and alphabetically instead of in the same order as aux_test_dataset.adata_target.obs['Participant ID']. For example:
print(aux_test_dataset.adata_target.obs['Participant ID'].values)
I got:

['GS12' 'GW133' 'GZ137' 'GW142' 'CT146' 'LBJ18' 'XQN39' 'SLG43' 'QG44' 'XQN75' 'ZGJ176' 'XN9063' 'PZ140']

After HypergraphDataset convertion and DataLoader:
aux_test_dataset = HypergraphDataset(adata[test_mask], obs_source={'Tissue': source_tissues}, obs_target={'Tissue': [tt]})'
aux_test_loader = DataLoader(aux_test_dataset, batch_size=len(aux_test_dataset),collate_fn=collate_fn, shuffle=False)
d = next(iter(aux_test_loader))
print(d.target_dynamic['Participant ID'])
The result changed into:

['CT146' 'GS12' 'GW133' 'GW142' 'GZ137' 'LBJ18' 'PZ140' 'QG44' 'SLG43' 'XN9063' 'XQN39' 'XQN75' 'ZGJ176']

The same is true of expression matrices (i.e. aux_test_dataset.adata_target.layers['x'].toarray() and d.x_target.cpu().numpy()).

How could I fix this error?

Thanks in advance!

Mian

Installation Problem

I'm having trouble installing packages due to version conflicts. Can you suggest the right Python version for compatibility with the required packages?

The difference between testing dataset and validation dataset

Hi, I found the GTEx bulk RNA-seq donors were divided into three parts (training, validation, and testing donors). I can grasp the purposes of the training and validation subsets in relation to the Hypergraph model's training and accuracy validation respectively, but I cannot fully comprehend the role of the testing dataset.
Could anyone elaborate on the specific purpose of the testing dataset and how it differs from the validation dataset? Can I just split the data into training and validation, and treat the validation dataset as the testing dataset?

Thanks in advance!
Mian

rvinas / hyfa Goto Github PK

hyfa's People

Contributors

Stargazers

Watchers

Forkers

hyfa's Issues

Questions about the normalized RNA-seq data processing

AssertionError when running "assert np.allclose(y_test_, y_test)"

Installation Problem

The difference between testing dataset and validation dataset

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent