Coder Social home page Coder Social logo

Comments (20)

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

thanks, but where?
I have the Methods S1 part, but I can't find, where the gene scores calculations is explained.

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

This looks great and i will definitely test it.
But on the fly i have seen, that in the examples you have on the web site, the cell-sets are still being calculated based on the matrix file and not the gene scores. Is that correct?

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

wot v. 1.0.0 is not reachable.
when doing either

git clone https://github.com/broadinstitute/wot.git
cd wot
pip install -e .

or when tryin to install it via pip install wot

Both command give me only version 0.5.7

Actually the cell sets are just loaded there ... not even computed!

In your example of how to create cell-sets you still use the older command
wot cells_by_gene_set --matrix matrix.txt --gene_sets gene_sets.gmt --out cell_sets.gmt --format gmt --quantile 0.99.

from wot.

joshua-gould avatar joshua-gould commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

I would still like to know if you can help with this issue. I have a calculated the gene-score using the gene_set_scores command and got for each of my 18 clusters a txt file with the corresponding cell ID and mean_z-score (mean, mean_rank) value.

This was followed by the cell-set calculation using the command

wot cells_by_gene_set  --score Output/p2_geneScores_Cluster10.txt --score Output/p2_geneScores_Cluster11.txt --score Output/p2_geneScores_Cluster12.txt --score ... Output/p2_geneScores_Cluster7.txt --score Output/p2_geneScores_Cluster8.txt --score Output/p2_geneScores_Cluster9.txt --out Output/p2_original_cell_sets.gmt

listing all 18 gene score files. I have tried this for all three options, the mean_z-score, the mean and the mean_rank. In all three I get the gmt file with 18 clusters, but all the clusters are exactly the same length, but each with a list different cells (cell IDs).

I was wondering if this makes any sense.
Can it be, that all cell sets are of the same length?

thanks, Assa

P.S. 1
If you think it might be helpful, I can share some of the files with you

P.S. 2
I'm really sorry about bombarding you with questions/problems. I just think the method can be very useful and we do want to use it for our data set (and publication). But we need to try to better understand it and free it from the bugs.

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

This is good to know, thanks. I thought it calculates it separately for each file. Now I understand why I get for each cluster a separate file :-)

But I still don't get how to "know" what a good cutoff means under the circumstances.
below are the histograms of two of the clusters. As you can see they look completely different.

What would be for example a good threshold for these two cases?

Screenshot 2019-05-31 15 36 42

What I mean is how would one choose a cutoff which makes sense?
(I couldn't find any mention to it in the paper or methods.)

I can see, that there is not good method to automate this step, but it is still confusing. I've tried to understand how you calculated the gene scores for you example data set, but it seems that it is not corresponding to the gene score values (at least it seems so to me). How did on go on about setting this cutoff?

thanks
Assa

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

Thanks for the answer. But I would really like to know, is why you chose 8000 in cluster2 and not sure about cluster1. cluster1 can also be seen (in a way) as bimodal.
What does the value 8000 here means?
I don't have this kind of high values in the data
the cluster1 df looks like

id      Cluster1_score
p2_cortex_143_AAACCTGAGGCTAGGT  0.44846073
p2_cortex_143_AAACGGGCAATCGGTT  0.47955644
p2_cortex_143_AAAGTAGCATAAAGGT  0.3165601
p2_cortex_143_AAATGCCGTGTGTGCC  0.14038791
p2_cortex_143_AACCGCGTCGTTGACA  0.26206216
p2_cortex_143_AACGTTGCATTCACTT  0.36493918

I would like to understand how you choose these values, so that I won't need to upload the histogram each time I have a dataset and ask for your opinion. :-)

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

The x-axis now looks also similar to the table above.
I have ran the gene-score calculations with all three methods mean_z_score, mean_rankand mean. In the third option the values are high.
Below are the last two of the three options. On the right hand side are the mean-calculated gene scores. This is the explanation t=for the differences. Sorry about that.

Screenshot 2019-06-05 14 01 32

But regarding choosing the threshold - Do i understand it correctly, that i should look for a (local) minimum on the histogram of the distributed scores for each of the clusters/gene-sets.

But even If I choose the value 8000 for the mean-calculated or the 0.1 for the mean_z_score-calculated values, what do I do with it?

Where Do I input this value in the wot cells_by_gene_set command. Is this the quantile parameter? Do I need to calculate at what quantile the value 8000 stands in the list of scores?

from wot.

geoffschieb avatar geoffschieb commented on July 17, 2024

from wot.

joshua-gould avatar joshua-gould commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

Yes, that's right: you need to calculate at what quantile the value stands
in the lists of scores. Then supply the quantile parameter to the command.

Thanks for the reply. But I must admit that this is not really intuitive. As one can see above, the histogram is not always so clear (cluster 1 above).
Can you explain to me, what the reason is for taking the minimum?
This can be automated, if it is just that.

Another option is to apply filters in Excel to generate your cell sets

What would this filter be like than?
sorting the scores in a decreasing order and choose the 0.99 quantile?

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

In the newer version (1.0.4) you've modified the script so that it creates one output file for the gene scores in a big table. This is a much better and more efficient way of handling the data. thanks for that.

My problem though is now, that my problem above returns. If I have multiple data sets, how can I calculate separate cell-sets for each of them.

The command selects the top x percent of cells according to the score. You
are using the same value of x for each score. What you need to do
differently is to manually look at histograms of these scores and identify
a doffeeent cutoff for each score. Then run the command 12 times (once for
each score) with a different value of the cutoff.

But now this is not possible, as they are all in one table.

Any ideas how to do that?

from wot.

joshua-gould avatar joshua-gould commented on July 17, 2024

from wot.

yeroslaviz avatar yeroslaviz commented on July 17, 2024

Hello again, sorry to keep insisting on it, but I would really like to understand more about the method of converting the gene-sets into cell-sets.

For better analysis I am using your gene-set file with my expression file ( put aside whether or not it makes biologically sense).
I have created the gene-scores using this command:

wot gene_set_scores --matrix p2_matrix.transposed.h5ad --method mean_z_score --gene_sets_Ex.gmx --out gene_Scores_EX

and got a gene score file

id	MEF.identity_score	Pluripotency_score	Cell.cycle_score ...
p2_cortex_143_AAACCTGAGGCTAGGT	0.09748856	-0.15128791	-0.09660525 ...
p2_cortex_143_AAACGGGCAATCGGTT	-0.024704605	-0.16561882	-0.21638143 ...
p2_cortex_143_AAAGTAGCATAAAGGT	0.07285158	-0.03847813	-0.1316462 ...
...

with these scores I would like to calculate my cell sets. As you mentioned above it, to be able to find the best value for the quantile, it would be a good idea to create a histogram of the scores (s. below) and look for the local minimum.

Screenshot 2019-07-17 13 59 07

Calculating the quantiles for e.g. ER.stress scroes (right histogram) gives me the following values:

      80%       85%       90%       95%       99%
0.1202166 0.1641223 0.2210341 0.3048465 0.4497572

I guess in this case I can take the .99 value for a quantile, as it can be considered as a local minima, but this would be more difficult in a histogram as i have posted previously, where one has a bi-modal behavior.
Is there a "general rule" one can rely on to choose the "correct" value?

from wot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.