ucl-pond / pysustain Goto Github PK

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.

License: MIT License

Python 13.06% Jupyter Notebook 86.94%

pysustain's Introduction

pySuStaIn

Subtype and Stage Inference, or SuStaIn, is an algorithm for discovery of data-driven groups or "subtypes" in chronic disorders. This repository is the Python implementation of SuStaIn, with the option to describe the subtype progression patterns using either the event-based model, the piecewise linear z-score model or the scored events model.

Acknowledgement

If you use pySuStaIn, please cite the following core papers:

Please also cite the corresponding progression pattern model you use:

Thanks a lot for supporting this project.

Installation

Install option 1 (for installing the pySuStaIn code in a chosen directory): clone repository, install locally

Clone this repo
Navigate to the main pySuStaIn directory (where you see setup.py, README.txt, LICENSE.txt, and all subfolders), then run:
```
pip install .
```
Alternatively, you can do pip install -e . where the -e flag allows you to make edits to the code without reinstalling.

Either way, it will install everything listed in requirements.txt, including the awkde package (used for mixture modelling). During the installation of awkde, an error may appear, but then the installation should continue and be successful. Note that you need pip version 18.1+ for this installation to work.

Install option 2 (for simply using pySuStaIn as a package): direct install from repository

Run the following command to directly install pySuStaIn:
```
pip install git+https://github.com/ucl-pond/pySuStaIn
```

Note that if you must already have numpy (1.18+) installed to do this. To create a new environment, follow the instructions in the Troubleshooting section below.

Troubleshooting

If the above install breaks, you may have some interfering packages installed. One way around this would be to create a new Anaconda environment that uses Python 3.7+, then activate it and repeat the installation steps above. To do this, download and install Anaconda/Miniconda, then run:

conda create  --name sustain_env python=3.7
conda activate sustain_env
conda install numpy

To create an environment named sustain_env and install numpy. Then, follow the installation instructions as normal.

Dependencies

Python >= 3.7
NumPy >= 1.18
SciPy
Matplotlib
Scikit-learn for cross-validation
kde_ebm for mixture modelling (KDE and GMM included)
pathos for parallelization
awkde for KDE mixture modelling

Testing

If you want to check that the installation was successful, you can run the end-to-end tests. For this, you will need to navigate to the tests/ subfolder (wherever pySuStaIn has been installed on your system). Then, you can use the following command to run all SuStaIn variants (this may take a bit of time!):

python validation.py -f

For a quicker run (using just MixtureSustain), just use:

python validation.py

instead. Testing of single classes is possible using the -c flag, e.g. python validation.py -c ordinal. To see all options, run python validation.py --help.

Parallelization

Added parallelized startpoints

Running different SuStaIn implementations

sustainType can be set to:

mixture_GMM : SuStaIn with an event-based model progression pattern, with Gaussian mixture modelling of normal/abnormal.
mixture_KDE: SuStaIn with an event-based model progression pattern, with Kernel Density Estimation (KDE) mixture modelling of normal/abnormal.
zscore: SuStaIn with a piecewise linear z-score model progression pattern.

See simrun.py for examples of how to run these different implementations.

SuStaIn Tutorial

See the jupyter notebook in the notebooks folder for a tutorial on how to use SuStaIn using simulated data. We also have a set of tutorial videos on YouTube, which you can find here.

Papers

Methods:

The SuStaIn algorithm: Young et al. 2018
The pySuStaIn software paper: Aksman, Wijeratne et al. 2021
The event-based model: Fonteijn et al. 2012, (with Gaussian mixture modelling Young et al. 2014 or non-parametric kernel density estimation Firth et al. 2020)
The piecewise linear z-score model: Young et al. 2018
The scored events model ('Ordinal SuStaIn'): Young et al. 2021

Applications:

Multiple sclerosis (predicting treatment response): Eshaghi et al. 2021. The trained model is available here.
Tau PET data in Alzheimer's disease: Vogel et al. 2021
COPD: Young and Bragman et al. 2020
Frontotemporal dementia: Young et al. 2021

Funding

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreements 666992. Application of SuStaIn to multiple sclerosis was supported by the International Progressive MS Alliance (IPMSA, award reference number PA-1603-08175).

Quotes

(The authors) have also persuaded me that (SuStaIn is) as clever as e.g. Heiko Braak's brain, (and) can infer longitudinal trajectories based on cross-sectional observations.

Anonymous reviewer

pysustain's People

Contributors

Stargazers

Watchers

Forkers

europond armaneshaghi noxtoby pawij leonaksman ahmedhshahin ayoung11 maybeee18 sea-shunned marestarellas abulusu-del brangl isaac-6 anninnisi gdsttian slieped zlpsophina 4sputnik bzhouzb elseviersoftwarex monicamunnangi reda-abdellah healthbioscienceideas hajerkr vincent-wq catherinescott smukherj09 acasamitjana slaughterfan ghoshsoura felixity-the-cat elliebartlett cubeetor zekelhealthcare d-morrison jadenecke pradovarathan ucl-pond seymour22 ucl-pond katrinacode carrie-jw anlijuncn kellenpetersen sft110 angietep yuloongwang jasonbme shannyfoo cassowary4health darenma marlenelin jiexfmri sangfengchn suraj-adewale lbinding zkan12 darylcheng xiaogahuo jhu256 zhanglianqing

pysustain's Issues

bug in cross-validation when using "select_fold"

Hello!

Parallelizing the cross-validation requires use of the "select_fold" argument. One might, for example, launch 10 instances of cross-validation, one for each of (say) 10 folds, in which case the given fold (say 3) would be passed to the select_fold argument. However, I've run into a few issues with this function.

First, line 218
if select_fold:
Many users will pass 0 to get the first fold. However, 0 will not fulfill the conditional statement if select_fold. There are many ways to fix this, but since the default setting for select_fold is [], my janky fix was just:
if select_fold != []:

The next issue comes in the following lines, 218-220

if select_fold:
    test_idxs  = test_idxs[select_fold]
Nfolds  = len(test_idxs)

I'm not sure what the intention was here, but the result is that Nfolds actually becomes the number of subjects in the test set. So, instead of having the desired 1 fold, you end up with N folds, where N is the number of subjects in the test set.

My solution here requires a few changes. First, lines 218-220 are changed to this:

if select_fold != []:
    Nfolds = 1
else:
    Nfolds = len(test_idxs)

Then, in order to disrupt the code as little as possible, I added the following lines under line 226. I include 226 below for reference:

for fold in range(Nfolds): 
    if select_fold != []:  # or whatever you change line 218 to
        fold = select_fold

Adding these three small changes resulted in the script working without issue for me, though maybe there are more elegant solutions. Thanks for bringing SuStaIn to Python!!

About some installation error

Sorry for taking your time. My installation ended with "ERROR: Could not build wheels for awkde, which is required to install pyproject.toml-based projects", and I guess that belongs to the awkde error you mentioned in the readme, so I proceed to run the test with "python validation.py -f", then another error occurred "ModuleNotFoundError: No module named 'pySuStaIn'". My environment is python 3.11, pip 23.1.2 and dumpy 1.24.4 on Mac. Besides, I have also tried using anaconda to create a virtual environment for downgrading python but the same error occurred.

Is there any possible solution?

SuStaIn in pymc3, and Variational Inference?

Can we model SuStaIn in probabilistic modelling library like pymc3, also can we use Variational Inference instead of MCMC?

Allow for complete model reloading

Background

At present, there are some variables (e.g. ml_f_EM) that are not returned by run_sustain_algorithm but are saved in the pickle files. It doesn't make sense to me that running the model gives you different outputs from loading the pickle file that you get from running it. The run_sustain_algorithm should just output what gets saved (and should probably output it as a dict to avoid having to refer to the order in which things are output).

Further to this, when trying to load previous results from a pickle file, a lot of the same setup is still required as when running the model in the first place, which can result in a lot of unnecessary boilerplate. There is some code that indicates there was once a desire for this. If anybody knows why this wasn't pursued or if there is some obstacle I haven't noticed please let me know.

Solutions

1. Expand what's pickled and reconstruct model instance

A @classmethod to recreate the model instance from a pickle file, resulting in the same thing as if you were to run the method (with Z_vals etc. bundled in), would simplify the average workflow significantly, and would be a fairly simple change. The main issue with this is that it would probably break people's existing code, and would require the notebook(s) to be updated. It would, however, keep to the current concept (pickle the results/arrays).

2. Modify how pySuStaIn pickles and save/load model in its entirety

We could also just pickle/unpickle the model instance itself, following a few changes to enable this. The process should also fix #27 and #41, on top of addressing the above (and simplifying things a lot). I reckon it should be doable without too much bother. Further details/considerations can be found here. This would also be best-served by turning the arrays that are usually pickled into attributes, and so would lead to a lot of small changes through the code.

Practicalities

Either approach will very likely break previous code (if people update, that is!). So, this may want to form part of a proper, versioned release.
I am happy to work on this, but it will be at weekends etc. (though shouldn't take long).

If core maintainers agree, I'll go ahead with it, but if not then I will leave it.

minor installation issue with sklearn

Hey team,

Just a tiny one here. I did a fresh install of pySuStaIn recently and ran into a minor issue. The requirements.txt file lists "sklearn" but it should be "scikit-learn". This caused some issues with my install, which were resolved by making that small change to the requirements.txt file.

--Jake

selecting the best number of subtypes

Hi pySustaIn team，
I had a problem with the two metrics(average test set log-likelihood for each subtype model and CVIC) for selecting the best number of subtypes. I have found some differences in the way they calculate, and sometimes the results of using them to select the best number of subtypes are different.For example, the result of my cross-validation is "Average test set log-likelihood for each subtype model: [-4803.6551647， -4799.39361827， -4802.32005004 ，-4802.00764201， -4803.9085804 ]，CVIC for each subtype model：[48239.54786736 ，48206.06011098 ，48231.46430435， 47821.04024647，47840.63188379]".At this time, which metrics is more reasonable.

Thank you for your reply!

Example code for mixture_KDE

I am in the process of applying the mixture_KDE version of Sustain to an external dataset that contains cross-sectional cognitive test score data for several thousand patients. I have been trying to modify the simrun.py function, but I'm running into a few conceptual and technical roadblocks. For one, we don't have control group data, which to my understanding is acceptable for the mixture_KDE model. Without controls, I'm wondering if the random assignment approach from simrun.py for generating ground_truth_sequences, ground_truth_subtypes, ground_truth_stages_control, and ground_truth_stages_other is the appropriate first step? In the pySustain white paper it says,

Within simrun.py, simulated subjects assigned earliest stages are used as controls and those in latest stages as cases.

I don't quite understand how this transfers onto applying Sustain on real data?

In my script, after generating the random ground truth sequences etc., I comment out this line
data, data_denoised = generate_data_mixture_sustain(ground_truth_subtypes, ground_truth_stages, ground_truth_sequences, sustainType)
and use the numpy array of my own data, which is in the exact same shape as what would be generated by the above line of code. However, I am receiving a LinAlgError: Singular matrix error when running this line:
mixtures = fit_all_kde_models(true_data, labels).

I think having a clearer example script of a mixture_KDE implementation with real data would be very useful in helping me answer some of my questions. Please let me know if there are any resources that you could share with me that might be helpful, or if you could address some of my issues directly.

I can also share my current working script if it would be of any help. Thanks!

Cross-validation for selecting optimal number of clusters

We should code up a script that implements 10-fold CV for selecting the optimal number of subtype (see the SuStaIn paper).
For each fold:

Calculate model likelihood for test and training sets;
Output/Save this, and optionally also plot.

selecting the best number of subtypes

I found a problem with the two metrics(average test set log-likelihood for each subtype model and CVIC) and CVIC) for selecting the best number of subtypes. I have found some differences in the way they calculate, and sometimes the results of using them to select the best number of subtypes are different. At this time, which metrics is more reasonable.

Enabling PVD Plot Legends

Hi all,

I just wanted to share some code here in case it may help anyone else: I've added a legend to the PVDs which shows the Z_val associated with each colour. It assumes that the Z_vals are the same for all biomarkers when creating the legend labels. I'm aware that there's an open issue on PVD colourbars and am not offering a solution to that problem, but simply a fix in the meantime to show clearly which value maps to which colour (I myself found this quite confusing to parse out when using more than 3 Z_vals as there was no documentation showing the mappings). It would be well paired with explanations I've seen used in papers about what the intensity, etc. of each colour indicates.

This requires an extra import from Matplotlib in ZscoreSustain: import matplotlib.patches as mpatches.

To implement it, add the following to plot_positional_var in ZscoreSustain between lines 660 (ax.set_title(title_i, fontsize=title_font_size)) and 661 (# Tighten up the figure). Further customization (i.e. legend size, location, etc) is possible through use of the ax.legend arguments.

                ax.set_xlabel(stage_label, fontsize=stage_font_size+2)
                ax.set_title(title_i, fontsize=title_font_size)
                
                # Add a legend 
                # adding an extra dim to colour mat for RGB reasons
                legend_colour_mat = np.array([[[1, 0, 0], [1, 0, 1], [0, 0, 1], [0.5, 0, 1], [0, 1, 1], [0, 1, 0.5]]])[:N_z]

                patches = [ mpatches.Patch(color=legend_colour_mat[0][i], label=f"Z_val = {zvalues[i]}") for i in range(len(zvalues)) ]

                # put those patches as legend-handles into the legend
                ax.legend(handles=patches, loc = "best" )
            
            # Tighten up the figure
            #plt.tight_layout()
            fig.tight_layout()

this is an example of the result using the Sustain Workshop data:

I also found it helpful to visualize the general mapping of Z_vals to colours, especially if legends aren't used:

which I generated using:

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
legend_colour_mat = np.array([[[1, 0, 0], [1, 0, 1], [0, 0, 1], [0.5, 0, 1], [0, 1, 1], [0, 1, 0.5]]])
ax = plt.subplot()
ax.imshow(legend_colour_mat)
ax.set_yticks([])
ax.set_xticks(range(6), ["1st Z_val", "2nd Z_val", "3rd Z_val", "4th Z_val", "5th Z_val", "6th Z_val"])
plt.show()

I hope this will be helpful to some! Thanks 😊

Missing data

I get divide by zero errors relating to model likelihood, which I tracked back to missing data causing problems with max() and min(), etc. Couldn't fix it with numpy.nanmax(), so we probably need to devise a robust method for handling missing data.

ValueError in AbstractSuStaIn

Hi all,

Thanks for your help with my past issue!

I'm now encountering a new error within the AbstractSuStaIn package that seems to relate to the staging portion of the algorithm:

Error Traceback

MCMC Iteration: 100%|██████████| 10000/10000 [00:24<00:00, 414.37it/s] MCMC Iteration: 100%|██████████| 10000/10000 [00:23<00:00, 422.43it/s] MCMC Iteration: 100%|██████████| 10000/10000 [00:30<00:00, 325.00it/s] MCMC Iteration: 100%|██████████| 1000/1000 [00:02<00:00, 462.84it/s]

[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:556: RuntimeWarning: invalid value encountered in divide
total_prob_subtype_norm = total_prob_subtype /
np.tile(np.sum(total_prob_subtype, 1).reshape(len(total_prob_subtype), 1), (1, N_S))

[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:557: RuntimeWarning: invalid value encountered in divide
total_prob_stage_norm = total_prob_stage / np.tile(np.sum(total_prob_stage, 1).reshape(len(total_prob_stage), 1), (1, nStages + 1)) #removed total_prob_subtype

[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:560: RuntimeWarning: invalid value encountered in divide
total_prob_subtype_stage_norm = total_prob_subtype_stage /
np.tile(np.sum(np.sum(total_prob_subtype_stage, 1, keepdims=True), 2).reshape(nSamples, 1, 1),(1, nStages + 1, N_S))

Traceback (most recent call last):
File "[notebook].py", line 475, in <module>
prob_subtype_stage = sustain_input.run_sustain_algorithm()
File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 186, in run_sustain_algorithm
prob_subtype_stage = self.subtype_and_stage_individuals(self.__sustainData, samples_sequence, samples_f, N_samples) #self.subtype_and_stage_individuals(self.__data, samples_sequence, samples_f, N_samples)

File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 590, in subtype_and_stage_individuals
this_prob_stage = np.squeeze(prob_subtype_stage[i, :, int(ml_subtype[i])])

ValueError: cannot convert float NaN to integer

This error is happening both locally and on a remote computing cluster. I've already added in an assert that none of the data going into SuStaIn contains NaNs, and ensured that my Z_vals are all integers (I am using Zscore SuStaIn). Do you have any ideas of what may be causing this issue or how to solve it?

As a related question, my research group and I are wondering why negative z-scores are not allowed in SuStaIn, and best practices handle them. We are currently shifting the z-score distribution to the right to ensure all values are > 0, but this means that we are losing the interpretability of z = 0, etc. Do you have any advice?

Thank you!

Mislabelled subtype numbers in PVDs

I have noticed an edge case where subtypes are labelled differently by the positional variance diagrams, and the model output.

I'm not 100% certain, but I think this is due to the subtype numbering in the PVDs being assigned according to maximum likelihood of the positional variance, whilst I think the ml_subtype number is assigned according to number of individuals per subtype. In rare cases a smaller subtype can have a higher maximum likelihood PVD. 😕

multiple sclerosis

Hello and thank you very much for sharing your work!
I am sorry for the trouble, but I am not being able to find out how to run your model for multiple sclerosis outcome prediction on the pretrained pkl you provide.
Would it be possible for you to provide some explanation on how to do so, or a reference to some instructions you have already provided and I am not seeing. From the input required, up to the inference process and the expected output.
I thank you enormously in advance.
Lucia

Fix for "rare" divide by zero problem

Dear SuStaIn friends,

I had encountered an error on a number of occasions when using SuStaIn on different datasets. The error itself looked something like this:

Traceback (most recent call last):
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/AbstractSustain.py", line 709, in _find_ml_iteration
    _                               = self._perform_em(sustainData, seq_init, f_init, rng)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/AbstractSustain.py", line 864, in _perform_em
    candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f, rng)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/ZscoreSustain.py", line 324, in _optimise_parameters
    this_S                      = this_S[0, :]
IndexError: index 0 is out of bounds for axis 0 with size 0.

As it turns out, this is caused by a divide by zero problem during the normalization of p_perm_k. This NaN then propagates forward a bit and doesn't turn up as an error until line 324, as shown. This is not itself necessarily caused by any outlying "bad values" (e.g. NaNs) in the original dataset, so it's quite hard (impossible?) to detect before running SuStaIn and getting the error.

This is apparently a known issue, as the following comment exists on line 333 of ZScoreSustain:
#adding 1e-250 fixes divide by zero problem that happens rarely

A few lines later at 335, the "corrected" line occurs:
p_perm_k_norm = p_perm_k_weighted / np.sum(p_perm_k_weighted + 1e-250, axis=(1, 2), keepdims=True)

However, at least in my case, the offending divide by zero problem occurred earlier. Note that the fix (ln 335) occurs before the error in my traceback (ln 324). Instead, the divide by zero problem occurs for me at line 238, which is incidentally the same calculation:
p_perm_k_norm = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)

By once again adding the "corrected" line, the problem is surmounted and I no longer get the error. I'm not sure how rare this issue really is, because this is maybe the third time I've encountered it (on different datasets). Requesting a patch to fix it, pretty please!

Thanks as always for this incredible software!! <3 <3 <3

Two questions in ZscoreSustain

Dear SuStaIn friends,
I have a few questions I'd like to ask. First, in ZscoreSustain, does the z_vals and z_max value correlate with the input data? If so, how to set z_vals and z_max according to the input data? Second, should the input data contain only the z_score of the patient or must contain both the z_score of the patient and the Z_score of the healthy control group? Is there any difference between the two data input methods and who has the better effect? Look forward to your answer, thank you.

`use_parallel_startpoints` fails on numpy 1.21

Just putting this here for reference. Not sure there's anything we can do.

use_parallel_startpoints fails on my Mac because pickle cannot pickle a numpy object.

I was using numpy 1.21 and managed to fix the issue by downgrading to v1.20

Cross validation -- option to control # of threads

Hiya Leon et al.,

I ran into an interesting issue when the sys admin of my cluster reached out about some problematic processes that were instantiated when running a parallelized version of the SuStaIn cross-validation. The wrapper script looked something like this:

from pySuStaIn import Zscore SuStaIn
import multiprocessing as mp

sustain_input = ZscoreSuStaIn(args)
test_idxs = <a list of lists>

jobs = []
NFolds = 10
for fold in range(NFolds):
    p = mp.Process(target = target = sustain_input.cross_validate_sustain_model,
                             args = (test_idxs,fold))
    jobs.append(p)
    p.start()

This script was then submitted to the cluster with an .sh script specifying some parameters, such as the number of nodes and cores (in this case I asked for 1 node and 32 cores). However, it seems that individual jobs were themselves starting several other threads/processes. In this sense, they were overriding the specifications on my .sh script. The result was me asking for 32 cores, but having 32^2 threads running on the node. This results in many context switches and inefficient use of the processors on the node.

I admit this is kind of a niche issue and maybe folks don't care so much about how efficient the code is. But I think this issue might be surmounted quite easily by allowing an argument where the user can control the internal parallelization to some degree, a la the n_jobs framework in sklearn. As is, the parallel qualities do not seem to be controllable by the user.

Forgive me if this isn't clear. Would be happy to provide greater detail!

As always, thanks for such making this amazing library!

How to interpret the Positional Variance Diagram

Hi, Thanks for your great work!

I am working on your SuStaInWorkshop notebook tutorial. It plots positional variance diagrams to interpret the subtype progressions.

However, I am a little confused about how to read these diagrams.

What's different colors means?
What's the learned progression order for each subtype?
Why in demo subtype-1, Biomarker 3 becomes severe quickly and in demo subtype-2, Biomarker 3 becomes abnormal far later?

Grateful for your help again.

Data Preparation Pipeline/Code

Hi, pySuStaIn is a great work, thanks for your effort!

I hope to use SuStaIn to subtype patients from our private Alzheimer's Disease structure MRI dataset.

However, I find it hard to implement the data preparation part (i.e. how to obtain z-scores from the raw MRI images).

Could you share your AD MRI data preparation code? or maybe provide some more detailed pipeline instructions of it?

(I checked the instruction provided in your Nat.Comm. paper, but found it too coarse to reproduce.)

Grateful for your help.

TypeError: 'int' object is not iterable in ZscoreSustain

Hi Leon, Peter, others,

I tried running Sustain on a sporadic AD dataset and it ran well for around 8 hours or so until it crashed with the following error. It seems like some corner case scenario which doesn't occur quite often. Would really appreciate your help in addressing this issue.

Splitting cluster 1 of 3
 + Resolving 2 cluster problem
 + Finding ML solution from hierarchical initialisation
- ML likelihood is [-4233.34529467]
Splitting cluster 2 of 3
 + Resolving 2 cluster problem
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bce0b47cba4f> in <module>
     65                               dataset_name,False)
     66 
---> 67 sustain_input.run_sustain_algorithm()

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in run_sustain_algorithm(self)
    142                 ml_sequence_mat_EM, \
    143                 ml_f_mat_EM,        \
--> 144                 ml_likelihood_mat_EM        = self._estimate_ml_sustain_model_nplus1_clusters(self.__sustainData, ml_sequence_prev_EM, ml_f_prev_EM) #self.__estimate_ml_sustain_model_nplus1_clusters(self.__data, ml_sequence_prev_EM, ml_f_prev_EM)
    145 
    146                 seq_init                    = ml_sequence_EM

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _estimate_ml_sustain_model_nplus1_clusters(self, sustainData, ml_sequence_prev, ml_f_prev)
    584 
    585                     print(' + Resolving 2 cluster problem')
--> 586                     this_ml_sequence_split, _, _, _, _, _ = self._find_ml_split(sustainData_i)
    587 
    588                     # Use the two subtype model combined with the other subtypes to

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _find_ml_split(self, sustainData)
    695 
    696         if ~isinstance(pool_output_list, list):
--> 697             pool_output_list                = list(pool_output_list)
    698 
    699         ml_sequence_mat                     = np.zeros((N_S, sustainData.getNumStages(), self.N_startpoints))

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _find_ml_split_iteration(self, sustainData, seed_num)
    740 
    741             temp_seq_init                   = self._initialise_sequence(sustainData)
--> 742             seq_init[s, :], _, _, _, _, _   = self._perform_em(temp_sustainData, temp_seq_init, [1])
    743 
    744         f_init                              = np.array([1.] * N_S) / float(N_S)

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _perform_em(self, sustainData, current_sequence, current_f)
    826             candidate_sequence,     \
    827             candidate_f,            \
--> 828             candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f)
    829 
    830             HAS_converged                   = np.fabs((candidate_likelihood - current_likelihood) / max(candidate_likelihood, current_likelihood)) < 1e-6

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/ZscoreSustain.py in _optimise_parameters(self, sustainData, S_init, f_init)
    237         p_perm_k_weighted                   = p_perm_k * f_val_mat
    238         p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
--> 239         f_opt                               = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)
    240         f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
    241         f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))

TypeError: 'int' object is not iterable

Thanks in advance.

Vikram

subtype_and_stage_individuals_newData performance issue

I was working on using a pre-trained sustain model to predict subtypes in a new dataset using the function "subtype_and_stage_individuals_newData" and I notied a couple of issues:

The function does not work for a single patient data for an input of size 1 x M, where M is the number of input features. I solved the issue temporarily by replicating the patient data and create an input of size 2 x M for estimating the subtypes. But having it work for single patient data would be useful.
The function's computational time increases non-linearly as the number of patients increases. My test set consisted of N ~ 47,000 patients. An input with size N x M would have taken almost 29 hours to predict (I had to abort after a few hours). Instead, calling the function N times with input of size 2 x M took roughly 7 minutes. I measured the time taken to predict for N = 2 to 3000 and the computational time increases nonlinearly.

Problem when defining high number of biomarkers

Hello!

I'm trying to run PySustain with my data (1129 observations and 38 biomarkers), but maybe for the high number of biomarkers, the algorithm does not move forward (even after 10 hours) on the print "Finding ML solution to 1 cluster problem". I found, inserting some print into the code to debug it, that the "heavy" code is in AbstractSustain.py into the _find_ml(): in particular for these lines of code:

partial_iter = partial(self._find_ml_iteration, sustainData)
pool_output_list = self.pool.map(partial_iter, range(self.N_startpoints))
if ~isinstance(pool_output_list, list):
             pool_output_list = list(pool_output_list)

I think that the map is very slow: the execution hangs on "list(pool_output_list)".
Do you have any idea how to resolve this problem ? I tried also generating simulated data (with 1129 observations and 38 biomarkes) but nothing happened.

Thank you in advance.

Idea: SusStaIn constraint with longitudinal measures

SusStaIn is a cool idea for extracting models from cross-sectional data, but one idea I have is, if my data is longitudinal, I could constrain the possible progression models that are possible?

Can such a feature go into SuStaIn? Does it conceptually make sense?

We have lots of animal longitudinal developmental and interventional data which could be used to test this.

Ordinal Sustain Notebook

Hi,
Is there a notebook for reproducing the ordinal sustain simulations/implementation (as mentioned in ordinal sustain article)?
The current notebooks doesn't support ordinal Sustain and I am not sure how to apply it on external data set.

IndexError during CV

Hi all,

Wanted to submit a fix for an occasional error I get during CV.
The error is as follows:

Traceback (most recent call last):
  File "", line 176, in <module>
    CVIC, loglike_matrix     = sustain_input.cross_validate_sustain_model(test_idxs)
  File "AbstractSustain.py", line 294, in cross_validate_sustain_model
    sustainData_test                = self.__sustainData.reindex(indx_test)
  File "ZscoreSustain.py", line 54, in reindex
    return ZScoreSustainData(self.data[index,], self.__numStages)
IndexError: arrays used as indices must be of integer (or boolean) type

And my fix is simply to explicitly define index_test as an array of integers in line 277 of AbstractSustain:

indx_test                       = (test_idxs[fold]).astype(int)

Would be interested to hear any theories as to why this error happens irregularly; this will happen with some models but not others running on identical versions of my notebook. In that section of my notebook, I follow the SuStaIn workshop essentially verbatim:

labels = sustain_data[label_column].values
cv = sklearn.model_selection.StratifiedKFold(n_splits=N_folds, shuffle=True, random_state=3)
cv_it = cv.split(sustain_data, labels)

# SuStaIn currently accepts ragged arrays, which will raise problems in the future.
# We'll have to update this in the future, but this will have to do for now
test_idxs = []
for train, test in cv_it:
    test_idxs.append(test)
test_idxs = np.array(test_idxs,dtype='object')

for i, (train_index, test_index) in enumerate(cv.split(sustain_data, labels)):
  print(f"Fold {i}:")
  print(f"  Train: index={train_index}")
  print(f"  Test:  index={test_index}")
# perform cross-validation and output the cross-validation information criterion and
# log-likelihood on the test set for each subtypes model and fold combination
CVIC, loglike_matrix     = sustain_input.cross_validate_sustain_model(test_idxs)

Thanks 😊

Two questions in ZscoreSustain

Dear SuStaIn friends,
I have a few questions I'd like to ask. First, in ZscoreSustain, does the z_vals value correlate with the input data? If so, how to set z_vals according to the input data? Second, should the input data contain only the z_score of the patient or must contain both the z_score of the patient and the Z_score of the healthy control group? Is there any difference between the two data input methods and who has the better effect? Look forward to your answer, thank you.

Parallel CV doesn't work (aka "Why do all my CV jobs run for fold 0 only???")

I was running cross-validation in parallel on a cluster using cross_validate_sustain_model() with argument select_fold set to the CV fold desired for each compute job.

I noticed that all 10 folds were returning results for only fold0.

The culprit is line 276, where the loop is through range(Nfolds) (where Nfolds=len(select_fold)) rather than explicitly through the select_fold array itself.

Will send a PR to fix shortly, but wanted to raise this in case others have the same problem

Adding a colourbar to PVD plots

This is minor in the grand scheme of things, but it might be important when creating clear figures for e.g. publication. I've toyed with a few ideas so thought I'd raise an issue on this before unilaterally merging one option for others to use and save them some time.

The Problem

Adding a colourbar to the PVD for the mixture version is straightforward, as colour intensity equates to the certainty of that position. For the z-score version, however, it has two dimensions. While certainty of the colour for a single z-score event equates to certainty (e.g. from pure white to pure red), the colours also mix when different z-score events overlap (and they mix proportionally to their certainty). For example, if a single stage (for a single biomarker) has 50% certainty for z=1 and z=2, this square will be a 50:50 mix of red and magenta (both of which are themselves at 50% intensity). A single colourbar cannot (to me, at least) capture this.

The question is, if adding a colourbar is to be useful, which information is it best that it captures.

Proposals

Here's a few variants I made for this. Other suggestions are welcome.

Simple Colourbar

Gets the point across, but doesn't integrate intensity/certainty or z-score mixing.

Intensity Colourbar

Highlights the difference in intensity, but not z-score mixing.

Mixing Colourbar

Highlights z-score mixing, but not intensity.

The point of this is to add something so others don't need to do this themselves. If no-one feels strongly, I'll just pick one after a week or so to integrate.

IndexError while running the SuStaIn Workshop file

Before running SuStaIn using my own data, I wanted to run SuStaIn using the workshop file. Everything went smoothly except for when I want to plot the positional variance diagrams (under "Evaluate subtypes" and "choosing the optimal number of subtypes"). Python then throws the following error at me:

IndexError Traceback (most recent call last)
in
1 N_S_selected = 2
2
----> 3 pySuStaIn.ZscoreSustain._plot_sustain_model(sustain_input,samples_sequence,samples_f,M,subtype_order=(0,1))
4 _ = plt.suptitle('SuStaIn output')
5

~\xxx\pySuStaIn\ZscoreSustain.py in _plot_sustain_model(self, *args, **kwargs)
450
451 def _plot_sustain_model(self, *args, **kwargs):
--> 452 return ZscoreSustain.plot_positional_var(*args, Z_vals=self.Z_vals, **kwargs)
453
454 def subtype_and_stage_individuals_newData(self, data_new, samples_sequence, samples_f, N_samples):

~\xxx\pySuStaIn\notebooks\pySuStaIn\ZscoreSustain.py in plot_positional_var(samples_sequence, samples_f, n_samples, Z_vals, biomarker_labels, ml_f_EM, cval, subtype_order, biomarker_order, title_font_size, stage_font_size, stage_label, stage_rot, stage_interval, label_font_size, label_rot, cmap, biomarker_colours, figsize, separate_subtypes, save_path, save_kwargs)
622 # Shuffle vals according to subtype_order
623 # This defaults to previous method if custom order not given
--> 624 vals = temp_mean_f[subtype_order]
625
626 if n_samples != np.inf:

IndexError: too many indices for array

Could it be that there is something wrong with the dimensions of the array? When I get rid of ",subtype_order=(0,1))" (line 3), I get at least part of the output. Same applies to when I want to plot the positional variance diagrams before crossvalidation.

Any hint would be very welcome.

Kind regards

Fixing controls in GMM

Hi Neil, Leon, Others

I am trying to use Mixture SuStaIn with fixed controls in GMM (i.e. without optimizing for the controls Gaussian) and I would like to get your opinion if what I am doing is okay.

In mixture_model/utils/fit_all_gmm_models, instead of calling the fit function, I am trying to call the fit_constrained function in gmm.py

Is this okay ? I see that all the necessary functions are in place for making use of this functionality. Is there a reason for you to not provide access to this function through user controllable options (for e.g. if there is an unfixed bug in the fit_constrained function etc.) ?

Do let me know.

Cheers,
Vikram

Apply the trained model to external/independent data

Hello,

Is it possible to apply the optimal model (after CV) to external independent data? If so, can you please add functionality to this tutorial: https://github.com/ucl-pond/pySuStaIn/blob/master/notebooks/SuStaInWorkshop.ipynb

I assume we only need to z-score the new data relative to the CN data used for the training?

Thank you very much!

[Question] Can we discover subtypes in a training test, and use the discovered subtypes to subtype subjects of a test set?

Hi SuStaIn team!

I am trying to use SuStaIn with a train / test like approach, in which I have two dataset:

the training one, that I want to use as to infer the summary subtypes (that is to say: subtype 1 is this sequence of abnormalities, subtype 2 is this sequence, etc), and I do not really worry about the actual subtyping of the individual subjects in this dataset. This is where I would run the run_sustain_algorithm method, if i'm correct.
the test one, that I would like to use as follows: given the subtypes discovered on the training set, I want to subtype (and maybe stage, even though I'm less interest in this) these new subjects with respect to the subtypes discovered on the training set. Intuitively, if I go back to the following notations from the Young et al. (2018) paper, if (f_c, S_c)_c are the subtypes (and their prevalence) discovered at the training step, given some new X (the test data), I would want to evaluate P(X | S_c) for each c, and infer the best subtype for each of my new subjects from this mixture of subtypes.

So it seems to me that this makes sense from a methodological point of view (but I could be mistaken 😅).

Now I don't seem to find exactly how I would proceed to perform this last step, given the output from the first step. I went back to the notebook from the workshop (that I had followed some time ago) and it looks to me that the presented cross_validate_sustain_model mainly focuses on cross validation metrics, rather than outputting the subtypes corresponding to the "test" subtypes.

I am sorry if this is treated somewhere that I have missed, and don't hesitate if the question is somewhat unclear, I'm happy to rephrase or go more into details 🙂

Cheers,
Nemo

Question on Using pySuStaIn on ADNI

Hi, Thanks for releasing pySuStaIn !

I am interested in AD pathology and currently trying to reproduce your subtyping results on ADNI dataset (i.e. the resulst reported in your Nat.Comm. 2018 paper).

I am confused of two problems:

Some subjects in ADNI 3T and 1.5T dataset have longitudinal data (datapoints collected at different time). So, how did you decide which datapoint to use?
Can you provide the positional variance diagrams (containing all brain regions) learned by SuStaIn on ADNI 3T/1.5T dataset? Because the qualitative results of disease progression in your paper are rather rough to understand detailed pathology.

Grateful for your help 😁

Parallelization fails -- TypeError: cannot pickle '_abc._abc_data' object

Hi pySustain team,

I am running into the following error when setting parallelization = True in ZscoreSustain: "TypeError: cannot pickle '_abc._abc_data' object". This error is picked up ~60 times, always tracing to the pickle.py or the _dill.py files within the pysustain package. This has happened using both Jupyter Notebook and Spyder, however I can run Sustain fine when parallelization is set to False.

Error Traceback:

File "[pysustain package location]/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
  exec(code, globals, locals)

File [notebook], line 231, in <module>
  prob_subtype_stage  = sustain_input.run_sustain_algorithm()

File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 164, in run_sustain_algorithm
  ml_likelihood_mat_EM        = self._estimate_ml_sustain_model_nplus1_clusters(self.__sustainData, ml_sequence_prev_EM, ml_f_prev_EM) #self.__estimate_ml_sustain_model_nplus1_clusters(self.__data, ml_sequence_prev_EM, ml_f_prev_EM)

File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 615, in _estimate_ml_sustain_model_nplus1_clusters
  ml_likelihood_mat               = self._find_ml(sustainData)

File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 704, in _find_ml
  pool_output_list                    = self.pool.map(partial_iter, seed_sequences.spawn(self.N_startpoints))

File "[pysustain package location]/lib/python3.10/site-packages/pathos/multiprocessing.py", line 135, in map
  return _pool.map(star(f), zip(*args)) # chunksize

File "[pysustain package location]/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
  return self._map_async(func, iterable, mapstar, chunksize).get()

File "[pysustain package location]/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
  raise self._value

File "[pysustain package location]/lib/python3.10/site-packages/multiprocess/pool.py", line 540, in _handle_tasks
  put(task)

File "[pysustain package location]/lib/python3.10/site-packages/multiprocess/connection.py", line 214, in send
  self._send_bytes(_ForkingPickler.dumps(obj))

File "[pysustain package location]/lib/python3.10/site-packages/multiprocess/reduction.py", line 54, in dumps
  cls(buf, protocol, *args, **kwds).dump(obj)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 394, in dump
  StockPickler.dump(self, obj)

File "[pysustain package location]/lib/python3.10/pickle.py", line 487, in dump
  self.save(obj)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 902, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 887, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 887, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1824, in save_function
  _save_with_postproc(pickler, (_create_function, (

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1089, in _save_with_postproc
  pickler.save_reduce(*reduction)

File "[pysustain package location]/lib/python3.10/pickle.py", line 692, in save_reduce
  save(args)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 887, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 603, in save
  self.save_reduce(obj=obj, *rv)

File "[pysustain package location]/lib/python3.10/pickle.py", line 692, in save_reduce
  save(args)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 887, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1427, in save_instancemethod0
  pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)

File "[pysustain package location]/lib/python3.10/pickle.py", line 692, in save_reduce
  save(args)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 887, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1824, in save_function
  _save_with_postproc(pickler, (_create_function, (

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1084, in _save_with_postproc
  pickler._batch_setitems(iter(source.items()))

File "[pysustain package location]/lib/python3.10/pickle.py", line 998, in _batch_setitems
  save(v)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1698, in save_type
  _save_with_postproc(pickler, (_create_type, (

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1070, in _save_with_postproc
  pickler.save_reduce(*reduction, obj=obj)

File "[pysustain package location]/lib/python3.10/pickle.py", line 692, in save_reduce
  save(args)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/pickle.py", line 902, in save_tuple
  save(element)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 560, in save
  f(self, obj)  # Call unbound method with explicit self

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 1186, in save_module_dict
  StockPickler.save_dict(pickler, obj)

File "[pysustain package location]/lib/python3.10/pickle.py", line 972, in save_dict
  self._batch_setitems(obj.items())

File "[pysustain package location]/lib/python3.10/pickle.py", line 998, in _batch_setitems
  save(v)

File "[pysustain package location]/lib/python3.10/site-packages/dill/_dill.py", line 388, in save
  StockPickler.save(self, obj, save_persistent_id)

File "[pysustain package location]/lib/python3.10/pickle.py", line 578, in save
  rv = reduce(self.proto)

TypeError: cannot pickle '_abc._abc_data' object

System and package info:

MacBook Pro 2021
Python 3.10.6

Packages:
alabaster @ file:///home/ktietz/src/ci/alabaster_1611921544520/work
anyio==3.6.2
applaunchservices @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_96v71vcny2/croots/recipe/applaunchservices_1661854626389/work
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_bc9ine8jfo/croot/arrow_1666726871970/work
astroid @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_9fsa1cbbec/croots/recipe/astroid_1659023133872/work
asttokens==2.0.8
atomicwrites==1.4.0
attrs==22.1.0
autopep8 @ file:///opt/conda/conda-bld/autopep8_1650463822033/work
awkde @ git+https://github.com/noxtoby/awkde.git@1c31e55fe54c0cad80ab423a9605fc9ddfb2614c
Babel==2.10.3
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
beautifulsoup4 @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_croot-cdiouih5/beautifulsoup4_1650462164803/work
binaryornot @ file:///tmp/build/80754af9/binaryornot_1617751525010/work
black @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_d0nhgmyc3l/croots/recipe/black_1660237813406/work
bleach==5.0.1
brotlipy==0.7.0
certifi @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_0ek9yztvu3/croot/certifi_1665076692562/work/certifi
cffi @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_62rp5d8fd4/croots/recipe/cffi_1659598655556/work
chardet @ file:///Users/builder/ci_310/chardet_1642531418028/work
charset-normalizer==2.1.1
click @ file:///opt/concourse/worker/volumes/live/2d66025a-4d79-47c4-43be-6220928b6c82/volume/click_1646056610594/work
cloudpickle @ file:///tmp/build/80754af9/cloudpickle_1632508026186/work
colorama @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_b8ecd5af-5e60-48b8-80ac-92164ecb9b9bxkf0tkfp/croots/recipe/colorama_1657009097162/work
contourpy==1.0.5
cookiecutter @ file:///opt/conda/conda-bld/cookiecutter_1649151442564/work
cryptography @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_3evwafgyg8/croot/cryptography_1665612651044/work
cycler==0.11.0
debugpy==1.6.3
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
diff-match-patch @ file:///Users/ktietz/demo/mc3/conda-bld/diff-match-patch_1630511840874/work
dill @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_88dxe9g1aq/croot/dill_1667919544494/work
docutils @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_10cfb287-0327-45ef-a38e-53dffd30cef1nwpvy20e/croots/recipe/docutils_1657175439973/work
entrypoints @ file:///opt/concourse/worker/volumes/live/5eb4850e-dcbc-41ad-5f22-922bac778f70/volume/entrypoints_1649926457041/work
executing==1.1.1
fastjsonschema @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_b5c1gee32t/croots/recipe/python-fastjsonschema_1661368622875/work
flake8 @ file:///opt/conda/conda-bld/flake8_1648129545443/work
fonttools==4.38.0
future==0.18.2
idna @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_00jf0h4zbt/croot/idna_1666125573348/work
imagesize @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_4a6ed1be-fe30-4d6a-91d4-f867600caa0be5_dxzvt/croots/recipe/imagesize_1657179500955/work
importlib-metadata @ file:///opt/concourse/worker/volumes/live/a8740f82-0523-4b08-5bb5-afa0c929f5e0/volume/importlib-metadata_1648562424930/work
inflection==0.5.1
intervaltree @ file:///Users/ktietz/demo/mc3/conda-bld/intervaltree_1630511889664/work
ipykernel==6.16.2
ipython==8.5.0
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
isort @ file:///tmp/build/80754af9/isort_1628603791788/work
jedi @ file:///opt/concourse/worker/volumes/live/18b71546-5bde-4add-72d1-7d16b76f0f7a/volume/jedi_1644315243726/work
jellyfish @ file:///opt/concourse/worker/volumes/live/d045b25f-e3af-4008-4edc-a00aeffb8b33/volume/jellyfish_1647962558521/work
Jinja2 @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_6adj7x0ejx/croot/jinja2_1666908137966/work
jinja2-time @ file:///opt/conda/conda-bld/jinja2-time_1649251842261/work
joblib==1.2.0
json5==0.9.10
jsonschema @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_d832da7jx3/croots/recipe/jsonschema_1663375475386/work
jupyter-server==1.21.0
jupyter_client==7.4.4
jupyter_core @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_fc_0us_ta7/croot/jupyter_core_1668084443574/work
jupyterlab==3.5.0
jupyterlab-pygments==0.2.2
jupyterlab_server==2.16.1
jupyterthemes==0.20.0
kde-ebm @ git+https://github.com/ucl-pond/kde_ebm.git@26ee48f7f723a82e4ff740e59b9745aa7def3daa
keyring @ file:///Users/builder/ci_310/keyring_1642616528347/work
kiwisolver==1.4.4
lazy-object-proxy @ file:///Users/builder/ci_310/lazy-object-proxy_1642533824465/work
lesscpy==0.15.1
lxml @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_1902c961-4bd2-4871-a3c5-70b7317a6521kpj7nz2o/croots/recipe/lxml_1657545138937/work
MarkupSafe @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_d4a9444f-bd4c-4043-b47d-cede33979b0fve7bm42r/croots/recipe/markupsafe_1654597878200/work
matplotlib==3.6.0
matplotlib-inline @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_9ddl71oqte/croots/recipe/matplotlib-inline_1662014471815/work
mccabe @ file:///opt/conda/conda-bld/mccabe_1644221741721/work
mistune==2.0.4
multiprocess==0.70.14
mypy-extensions==0.4.3
nbclassic==0.4.5
nbclient==0.7.0
nbconvert==7.2.5
nbformat==5.7.0
nest-asyncio==1.5.6
notebook==6.5.1
notebook_shim==0.2.0
numpy==1.23.4
numpydoc @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_adnyzxppoz/croot/numpydoc_1668085907252/work
p2j==1.3.2
packaging @ file:///tmp/build/80754af9/packaging_1637314298585/work
pandas==1.5.1
pandocfilters @ file:///opt/conda/conda-bld/pandocfilters_1643405455980/work
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathos==0.3.0
pathspec @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_e2t1r2kdq7/croots/recipe/pathspec_1659627124303/work
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.2.0
platformdirs @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_7fs8_2xgrm/croots/recipe/platformdirs_1662711383474/work
pluggy @ file:///opt/concourse/worker/volumes/live/8277900c-164a-49c8-6f2a-f55c3c0154be/volume/pluggy_1648042581708/work
ply==3.11
pox==0.3.2
poyo @ file:///tmp/build/80754af9/poyo_1617751526755/work
ppft==1.7.6.6
prometheus-client==0.15.0
prompt-toolkit==3.0.31
psutil==5.9.3
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval==0.2.2
pybind11==2.10.0
pycodestyle @ file:///tmp/build/80754af9/pycodestyle_1636635402688/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydocstyle @ file:///tmp/build/80754af9/pydocstyle_1621600989141/work
pyflakes @ file:///tmp/build/80754af9/pyflakes_1636644436481/work
Pygments==2.13.0
pylint @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_e75_4ydew9/croots/recipe/pylint_1659110352634/work
pyls-spyder==0.4.0
pyobjc-core @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_e7giy3a869/croots/recipe/pyobjc-core_1661848172499/work
pyobjc-framework-Cocoa @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_4c2umern3y/croots/recipe/pyobjc-framework-cocoa_1661850714385/work
pyobjc-framework-CoreServices @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_4717m_ngol/croots/recipe/pyobjc-framework-coreservices_1661853392396/work
pyobjc-framework-FSEvents @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_5atkr691rb/croots/recipe/pyobjc-framework-fsevents_1661852390555/work
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_3a17y2delq/croots/recipe/pyparsing_1661452538853/work
PyQt5-sip==12.11.0
pyrsistent==0.18.1
PySocks @ file:///Users/builder/ci_310/pysocks_1642536366386/work
pySuStaIn @ git+https://github.com/ucl-pond/pySuStaIn@564f07617a2a11477a18aec0b24d5d80825b0371
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-lsp-black @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_14xl6hg757/croots/recipe/python-lsp-black_1661852036282/work
python-lsp-jsonrpc==1.0.0
python-lsp-server @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_6cu9im5n5w/croots/recipe/python-lsp-server_1661813818984/work
python-slugify @ file:///tmp/build/80754af9/python-slugify_1620405669636/work
pytz==2022.5
PyYAML==6.0
pyzmq==24.0.1
QDarkStyle @ file:///tmp/build/80754af9/qdarkstyle_1617386714626/work
qstylizer @ file:///tmp/build/80754af9/qstylizer_1617713584600/work/dist/qstylizer-0.1.10-py2.py3-none-any.whl
QtAwesome @ file:///tmp/build/80754af9/qtawesome_1637160816833/work
qtconsole @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_552cqm7spz/croots/recipe/qtconsole_1662018258355/work
QtPy @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_4e5ppuhz0f/croots/recipe/qtpy_1662014536017/work
requests @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_516b78ce-034d-4395-b9b5-1d78c2847384qtnol99l/croots/recipe/requests_1657734628886/work
rope @ file:///opt/conda/conda-bld/rope_1643788605236/work
Rtree @ file:///Users/builder/ci_310/rtree_1642537064369/work
scikit-learn==1.1.3
scipy==1.9.3
seaborn==0.12.1
Send2Trash==1.8.0
sip @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_88z1zrsfrf/croots/recipe/sip_1659012373083/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
sklearn==0.0
sniffio==1.3.0
snowballstemmer @ file:///tmp/build/80754af9/snowballstemmer_1637937080595/work
sortedcontainers @ file:///tmp/build/80754af9/sortedcontainers_1623949099177/work
soupsieve @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_14fb2zs6e3/croot/soupsieve_1666296397588/work
Sphinx @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_5d9f8d69-b80c-4ca1-8876-1698c70b1faeqe461tx8/croots/recipe/sphinx_1657784127805/work
sphinxcontrib-applehelp @ file:///home/ktietz/src/ci/sphinxcontrib-applehelp_1611920841464/work
sphinxcontrib-devhelp @ file:///home/ktietz/src/ci/sphinxcontrib-devhelp_1611920923094/work
sphinxcontrib-htmlhelp @ file:///tmp/build/80754af9/sphinxcontrib-htmlhelp_1623945626792/work
sphinxcontrib-jsmath @ file:///home/ktietz/src/ci/sphinxcontrib-jsmath_1611920942228/work
sphinxcontrib-qthelp @ file:///home/ktietz/src/ci/sphinxcontrib-qthelp_1611921055322/work
sphinxcontrib-serializinghtml @ file:///tmp/build/80754af9/sphinxcontrib-serializinghtml_1624451540180/work
spyder @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_97gv8v17po/croots/recipe/spyder_1663056808858/work
spyder-kernels @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_1c7pyd81si/croots/recipe/spyder-kernels_1662457889999/work
stack-data==0.5.1
tdt==0.5.4
terminado==0.17.0
text-unidecode @ file:///Users/ktietz/demo/mc3/conda-bld/text-unidecode_1629401354553/work
textdistance @ file:///tmp/build/80754af9/textdistance_1612461398012/work
threadpoolctl==3.1.0
three-merge @ file:///tmp/build/80754af9/three-merge_1607553261110/work
tinycss @ file:///tmp/build/80754af9/tinycss_1617713798712/work
tinycss2 @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_56dshjmms6/croot/tinycss2_1668168824483/work
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
tomli @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_90762ba4-f339-47e8-bd29-416854a59b233d27hku_/croots/recipe/tomli_1657175507767/work
tomlkit @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_5fgtm9if1m/croots/recipe/tomlkit_1658946891645/work
tornado @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_1fimz6o0gc/croots/recipe/tornado_1662061695695/work
tqdm==4.64.1
traitlets==5.5.0
typing_extensions @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_ff5_5nqr6l/croots/recipe/typing_extensions_1659638832447/work
ujson @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_cf44fbd5-5db0-48cf-86c4-c8d4e74d1cbbwhgckc99/croots/recipe/ujson_1657544919410/work
Unidecode @ file:///tmp/build/80754af9/unidecode_1614712377438/work
urllib3 @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_7f7kb5tudl/croot/urllib3_1666298941688/work
watchdog @ file:///Users/builder/ci_310/watchdog_1642516765439/work
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
webencodings==0.5.1
websocket-client==1.4.1
whatthepatch @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_0aabmq0ph3/croots/recipe/whatthepatch_1661795995892/work
wrapt @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_1ade1f68-8354-4db8-830b-ff3072015779vd_2hm7k/croots/recipe/wrapt_1657814407132/work
wurlitzer @ file:///Users/builder/ci_310/wurlitzer_1642539193810/work
yapf @ file:///tmp/build/80754af9/yapf_1615749224965/work
zipp @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_b279673d-f037-44c7-8773-c5a6b6f51037d3wfr9cq/croots/recipe/zipp_1652341773612/work

Thank you for your help!
Katrina

Out of bounds issue in: self._optimise_parameters()

Hi all,
After I initialize the ZSuStaIn model, I run into the following Index out of bounds exception:

Finding ML solution to 1 cluster problem
/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/ZscoreSustain.py:238: RuntimeWarning: invalid value encountered in true_divide
  p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
Traceback (most recent call last):
  File "try0_mspaths_sustain.py", line 151, in <module>
    sustain_input.run_sustain_algorithm()
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 144, in run_sustain_algorithm
    ml_likelihood_mat_EM        = self._estimate_ml_sustain_model_nplus1_clusters(self.__sustainData, ml_sequence_prev_EM, ml_f_prev_EM) #self.__estimate_ml_sustain_model_nplus1_clusters(self.__data, ml_sequence_prev_EM, ml_f_prev_EM)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 552, in _estimate_ml_sustain_model_nplus1_clusters
    ml_likelihood_mat               = self._find_ml(sustainData)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 643, in _find_ml
    pool_output_list                = list(pool_output_list)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 676, in _find_ml_iteration
    _                               = self._perform_em(sustainData, seq_init, f_init)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 828, in _perform_em
    candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/ZscoreSustain.py", line 304, in _optimise_parameters
    this_S                      = this_S[0, :]
IndexError: index 0 is out of bounds for axis 0 with size 0

My current ZSustaIn class initialization is as followed:

# Start pySuStaIn
N = 4 # number of biomarkers
SuStaInLabels = ['Bio1', 'Bio2', 'Bio3', 'Bio4'] # biomarker labels

unt_data = np.vstack((biom_1, biom_2, biom_3, biom_4)) # each biomarker z-scored
data = np.transpose(unt_data) # data.shape --> (5123, 4) # data.shape returns (5123, 4)
Z_vals = np.array([[1,2,3]]*N)  # Z-score stage threshold # Z_vals.shape return (4,3)
Z_max = np.array([np.max(biom_bpf), np.max(biom_t2les),
                  np.max(biom_cgmf), np.max(biom_dgmf)]) # Z_max.shape returns (4,)


# Snipper of my prepared data for SuStaIn:
N_S_gt = 3 # Number of ground truth subtypes
N_startpoints = 10
N_S_max = N_S_gt+1
N_iterations_MCMC = int(1e4)
output_folder = os.path.join(os.path.dirname(__file__), 'rp_mspaths_sim')
if os.path.isdir(output_folder) is False:
    os.mkdir(output_folder)
dataset_name = 'sim'
sustain_input = ZscoreSustain(data,
                              Z_vals,
                              Z_max,
                              SuStaInLabels,
                              N_startpoints,
                              N_S_max,
                              N_iterations_MCMC,
                              output_folder,
                              dataset_name,
                              False)
sustain_input.run_sustain_algorithm() ##

After spending some time attempting to debug, I believe the issue is related to the following functions returning a nan array type:
f_opt = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)

    def _optimise_parameters(self, sustainData, S_init, f_init):
        # Optimise the parameters of the SuStaIn model

        M                                   = sustainData.getNumSamples()   #data_local.shape[0]
        N_S                                 = S_init.shape[0]
        N                                   = self.stage_zscore.shape[1]

        S_opt                               = S_init.copy()  # have to copy or changes will be passed to S_init
        f_opt                               = np.array(f_init).reshape(N_S, 1, 1)
        f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
        f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))
        p_perm_k                            = np.zeros((M, N + 1, N_S))

        for s in range(N_S):
            p_perm_k[:, :, s]               = self._calculate_likelihood_stage(sustainData, S_opt[s])

        p_perm_k_weighted                   = p_perm_k * f_val_mat
        p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
        f_opt                               = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)
        f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
        f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))
        order_seq                           = np.random.permutation(N_S)  # this will produce different random numbers to Matlab
.
.
.

Now, this issue occurs randomly on the nth iteration, as replicated when running my script multiple times.
Any insights on how to fix/troubleshoot this problem?
Thanks in advance

selecting the best number of subtypes

I found a problem with the two metrics(average test set log-likelihood for each subtype model and CVIC) for selecting the best number of subtypes. I have found some differences in the way they calculate, and sometimes the results of using them to select the best number of subtypes are different. At this time, which metrics is more reasonable.

Mixed Data Types

Thank you for a fantastic resource to model disease trajectories. I wanted to inquire about whether it was possible to model both continuous and ordinal data within the same model?

ucl-pond / pysustain Goto Github PK

pysustain's Introduction

pySuStaIn

Acknowledgement

Installation

Install option 1 (for installing the pySuStaIn code in a chosen directory): clone repository, install locally

Install option 2 (for simply using pySuStaIn as a package): direct install from repository

Troubleshooting

Dependencies

Testing

Parallelization

Running different SuStaIn implementations

SuStaIn Tutorial

Papers

Funding

Quotes

pysustain's People

Contributors

Stargazers

Watchers

Forkers

pysustain's Issues

Background

Solutions

1. Expand what's pickled and reconstruct model instance

2. Modify how pySuStaIn pickles and save/load model in its entirety

Practicalities

The Problem

Proposals

Simple Colourbar

Intensity Colourbar

Mixing Colourbar

Recommend Projects

Recommend Topics

Recommend Org