vicco-group / frrsa Goto Github PK

View Code? Open in Web Editor NEW

24.0 1.0 3.0 337 KB

Python package to conduct feature-reweighted representational similarity analysis.

Home Page: https://www.sciencedirect.com/science/article/pii/S105381192200413X

License: GNU Affero General Public License v3.0

Python 100.00%

representational-similarity feature-weighting human-vision deep-neural-networks dissimilarity-matrix

frrsa's People

Contributors

Stargazers

Watchers

Forkers

hahahannes philippkaniuth xiaoyeye668

frrsa's Issues

make number of parallel jobs an argument

Add custom warning if the value chosen for outer_k in combination with the number of conditions yields too few (inner or outer) test conditions

Some combinations of outer_k and n_conditions can lead to too few test conditions. Catch those, change "outer_k" as it was supplied by the user accordingly, and inform the user. Also inform the user that, when n_conditions is big, the changing of "outer_k" might lead to a much longe run time and that s/he might want to abort and supply a better value for "outer_k".

Correcting predicted illegal similarities happens at wrong place

Add custom warning and abort if n_conditions < 9

find more efficient way for euclidian calculation in helper/predictor_distance

Question about the returned "predicted_matrix"

Hello,

I would like to use the predicted matrix returned from your function, but there are some value returned as 9999 which means these pair are not predicted. But if k-fold cross-validation is applied, then each pair of conditions should be used as testing set during cross-validation. How would it be possible that some pairs have not been predicted? I am really confused and could you explain about this?

replace fracridge directory with pypi package

fracridge

add parameter to properly clip predictions

Implement handling of multiple `activity_pattern_matrix` into helper/classical_rsa/make_RDM

deprecate `splitter == 'kfold'`

Deprecating splitter == 'kfold' i.e. the choice-argument of frrsa/helper/data_splitter.py would simplify solving #22. Possibly first comment out the respective code bits to wait and see if anyone would want anything else than splitter='random'.

rename frrsa's parameter `distance` (and possibly add a sibling parameter)

Currently, the name of the parameter distance is a bit of a misnomer as it can be used to either have squared Euclidean distance being computed within each feature (i.e. a dissimilarity) or to compute the feature-specific dot-product (i.e. a similarity measure, so a distance). What's a good hypernym though for "(dis-)similarity"?

Deprecate helper/classical_rsa/correlate_RDMs

since this is not used within frrsa.

Alternatively, declare the improved (see issue #11) scoring functions as the new correlate_RDMs function and get rid of the scoring module.

deprecate classical scores

frrsa's returnes an object called scores. This object contains the representational correspondency scores between each target and the predictor. At present, it holds scores for both, classical and feature-reweighted RSA.

The fact that frrsa outputs classical scores was really just a convenience to be able to quickly compare reweighted and classical scores. However, there is a multitude of (dis-)similarity measures one could want for the classical case and many are not (and will not) be supported by frrsa (think cross-validated Mahalanobis distance). It migh well be that some user wants a (dis-)similarity measure for the classical case that will never be supported by frrsa because there are better suited alternatives to compute that - in that case one would throw away the classical score that frrsa outputs anyway.

Further, the computation of classical scores within frrsa leads to other problems (see #32 and #25).

Therefore, frrsa outputting classical scores will be soft-deprecated.

RDM and RSM

I think right now it is a bit tricky to handle RSMs.
E.g if one uses a RSM as target and chooses pearson as distance, then while performing classical RSA, a RDM gets produced and the score will be negative.
I think we discussed the possibility to check this in the code. For example by checking the diagonal and comparing it with the choosen distance function but decided that it is the user responsibilty.
In this case, I would propose to maybe rename the possible values for the distance function.

pearson_similarity -> pearson() -> RSM
pearson_distance -> 1 - pearson() -> RDM
sqeuclidean_distance ->  sqeuclidean()  -> RDM
sqeuclidean_similarity -> max value? - sqeuclidean() -> RSM

with regard to #25 it could be named

multi_dim_distance:
pearson_similarity -> pearson() -> RSM
pearson_distance -> 1 - pearson() -> RDM
sqeuclidean_distance ->  sqeuclidean()  -> RDM
sqeuclidean_similarity -> max value? - sqeuclidean() -> RSM

single_dim_distance:
dot_similarity -> dot() -> RSM
dot_distance -> 1 - dot() -> RDM
sqeuclidean_distance ->  sqeuclidean()  -> RDM
sqeuclidean_similarity -> max value? - sqeuclidean() -> RSM

An alternative could be to use a parameter that sets similarity or dissimilarity and to then choose the matching distance/similarity function

make frrsa executable for a number of conditions > 8

Would partly solve this.

Necessitates to change "outer_k" as it was supplied by the user.
Also necessitates to adaptively change hard-coded "inner_k".

Inform the user about the changes accordingly.

This issues is a special case of #22.

Different cv fold sizes lead to different RSA scores

I'm estimating RSA for models with n=48, p=4431. I used split-half folding with a large number of repetitions (cv=[2, 1000]; inspired by this paper: https://www.sciencedirect.com/science/article/pii/S1053811921004225). However, I received a quite different estimate of the Pearson correlation compared to the situation in which I used 5-folds. In my case, the correlation for split-half folding was ~=0.01, while for 5-folds it was 0.05 (I repeated this test a few times to make sure the estimates are stable). I don't know if it is relevant, but I also received a ton of warnings:

python3.9/site-packages/scipy/stats/_stats_py.py:4068: PearsonRConstantInputWarning: An input array is constant; the correlation coefficient is not defined.

and one:

python3.9/site-packages/numpy/core/fromnumeric.py:43: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

In my understanding, the results should not depend on folding size so I found this outcome a bit worrisome.

Correlate reweighted predicting matrix with other targets

Motivated from #43, if the aim is to generalize the predictions to a second target:

The user submits two (or several) targets. One of them will be used to fit the statistical model. The others will be simply correlated with the reweighted predicting matrix, essentially generalizing the model. This should likely be done in the func fit_and_score.

This will take the reweighted predicting matrix as it was receveived when fitting to target and generalize it to another target. Basically, the question this would answer is: if I reweight my predictor with regards to target, how well does the reweighted predicting matrix now relate to another target for the same images? Therefore, this does not allow generalizing the model to a matrix with (dis-)similarities from different images (for that see #47).

reorganise parameters to reduce their total number

Clarify returned objects

In the README, clarify how the optionally returned objects predicted_matrix, betas, and predictions are created and should (not) be used.

Get rid of inconsequential numpy warning

Hunt down this warning:

python3.9/site-packages/numpy/core/fromnumeric.py:43: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

It occurs when parallel is not 1. It doesn't have an effect on results.

update dependencies

Option do to FR for all subs simultanously?

Greetings,

It is my understanding in the paper that FRRSA is being done on a subject level (each subject gets its own weight-vector), however I was wondering whether it could be tweaked to find an optimal weight given a set of participant datasets thus finding an optimal group-level weight vector to a model?

Would reweighting e.g.: fmri voxel data to the group average voxel data (as per Fig3c) and then running FRRSA on the (subject level reweighted fmri voxel data, model RSM) achieve this in a roundabout way?

Add documentation page

This

add unittests for predictor distance & enhance euclidian

add more unittests for hadamard and euclidian functions in predictor_distance
add calculation for first and second pair indices in the euclidian function

Simplify installation

make a setup.py

Apply betas to new images

Motivated from #43, if the aim is to generalize the betas to another imageset:

The user submits two (or several) targets and corresponding two (or several) predictors. One these target-predictor pairs will be used to fit the statistical model as usual.

Then, the betas will be extracted and used to reweight another predictor. This predictor must come from the same system, e.g. if using a DNN, it must be the same DNN-module. However, this new predictor can be for different images. Using the betas, this new predictor is reweighted. Finally, the reweighted second predictor is evaluated with its corresponding target.

The question this would answer is: if I reweight my predictor with regards to target, how well does the fitted statistical model (i.e. the betas of the regularized regression) generalize if applied to another predictor with different images?

In contrast to #46, this does allow generalizing the statistical model to a matrix with (dis-)similarities from different images.

Add type annotations to all functions

Add other distance norms to helper/classical_rsa/make_RDM

Add other distance norms, at least Euclidian.

Improve computation of final betas

Possibly change how optionally returned final betas are computed to make them more useful for downstream analyses. For that, change frrsa/frrsa/fitting/fitting/final_model so that it does a repeated CV to find the best hyperparameter for the whole dataset. Unclear how to deal with several hyperparams i.e. whether averaging them is actually sensible. The best course of action might also depend on the kind of hyperparameter, i.e. whether one uses fractional ridge regression or the more classical approach as sklearn does (this depends on how one sets the parameter nonnegative).

PRs for this issue welcome.

Make `frrsa` exectuable for `n_conditions` < 9

Currently, the absolute minimum of conditions that need to be present is 9 (also see known issues).

A "proper" LOOCV could solve this issue. Currently, the lower bound of conditions is 9 only because the algorithm expects at least 3 conditions to be in each test fold (to have more than 1 condition pair in the test fold) in order to correlate the predicted and test dissimilarities. A "proper" LOOCV could deal with only 2 conditions in the test set.

This, however, would entail a major change to the structure of the code base, away from the "correlate predicted with test values directly" approach to an "iteratively fill the predicted RDM with all LOOCV runs to finally correlate the whole predicted RDM with the whole test RDM" - approach.

This is out-of-scope for the time being.

Clarify various object names

frrsa in frrsa/fitting/crossvalidation.py optionally returns the object predictions. This object has the hard-coded column names "dissim_target" and "dissim_predicted". In case of fitting representational matrices that contain similarities, these column names are confusing.

Either conditionally name the columns based on whether similarity or dissimilarity is predicted or use a generic name that'd cover both cases.

speed up & refactor fitting/scoring.py

Speed up and possibly fuse both functions.

Suggestions:

if score_type=='pearson':
- use np.corrcoef to replace at least one for-loop (in case of scoring when multioutput & evaluating alphas: first change order of for-loops to then replace the then inner for-loop over targets)
if score_type!='pearson':
- use some sort of mapfunc
use numba/etc for remaining for-loop