loft-br / xgboost-survival-embeddings Goto Github PK

View Code? Open in Web Editor NEW

317.0 85.0 52.0 10.79 MB

Improving XGBoost survival analysis with embeddings and debiased estimators

Home Page: https://loft-br.github.io/xgboost-survival-embeddings/

License: Apache License 2.0

Makefile 0.47% Python 99.53%

survival-analysis machine-learning

xgboost-survival-embeddings's People

Contributors

Stargazers

Watchers

Forkers

gdmarmerola andregallina alxaline a3digit stjordanis kiminh jingmouren ankitshah009 smfang jest-habi flippercy 1996lixingyu1996 trivialfis andromeda0505 trendingtechnology gabrielgimenez lenamax2355 suyash-bhutara jeanselme gstw7 bardiakh shu-hai longshen931 ducphucnguyen csutjf suryahcu rigocorp jingtian-lelouch jzyjade maikpaixao rumenick muxin0527 zeqiangwangai ziyit dillon-zephyrai tuantx7110 weiliwei666 victorhcm zero506 athospd henriquenakagawa luisperezlombardia felipetufaile yimin-yi empyriumz niharika1000 tcfkaj hungyuchou mpucheniubo jlehtiranta djun

xgboost-survival-embeddings's Issues

Catboost Survivorship Modeling

This isn't really an issue, so much so as a comment/note that Catboost has just added Cox/AFT loss to their package.. It of course doesn't offer the wonderful other functionality of this package, but I've been experimenting with using it alongside XBGSE by fitting a Catboost model, and then pass the fits from that model as a feature to the XGBSE model.

This is helpful in at least two ways: 1) Stacking multiple models is often helpful, so this allows for some performance improvement; and 2) Catboost has extremely strong Categorical feature support, while this is an area where XGboost is relatively lacking. By passing a Catboost fit-feature, this is a way of almost 'sideloading' Categorical features into an XGBSE fit.

I don't have any proposal/suggestion here, apart from potentially flagging this as an option for users in the documentation. If I have time, I may also look into putting together a pull request to let users choose between using an XGboost or a Catboost model as the base model for XGBSE, but this may take more work.

Thanks again for all the work on this package. It's been a huge success for me.

Compare to Actual

Is it possible to provide compare_to_actual function like pysurvival provide?

ref. https://square.github.io/pysurvival/tutorials/maintenance.html

AttributeError: 'XGBSEBootstrapEstimator' object has no attribute 'get_neighbors'

Code sample

xgbse_model = XGBSEStackedWeibull(xgb_params = DEFAULT_PARAMS)
bootstrap_estimator = XGBSEBootstrapEstimator(xgbse_model)
bootstrap_estimator.fit(X_train, y_train, time_bins=TIME_BINS, persist_train = True, index_id = X_train.index)

neighbors = bootstrap_estimator.get_neighbors(
    query_data=above_1,
    index_data=X_train,
    n_neighbors=10
)

print(neighbors)

Problem description

I try to get neighbors of observation to get samples for local explainability and get SHAP values with a bootstrap estimator. But I get this error message: AttributeError: 'XGBSEBootstrapEstimator' object has no attribute 'get_neighbors'.

Expected behavior

Since your documentation states that BaseEstimator is the Base class for all estimators in xgbse, I would expect .get_neigbors to work with all estimators in XGBSE.

Possible solutions

In your documentation, I can see that you never tested .get_neighbors with BootstrapEstimator. That would be a good place to start.

Also, if I have misunderstood something about your BootstrapEstimator please let me know :)

Use MTLR to fit logistic regressions in DebiasedBCE

Out current bottleneck in DebiasedBCE training time is logistic regression training.
We can try to replace scikit learn logistic regression with a custom MTLR module.

The most promising is to use a multi-task logistic regression approach, so we maintain multiple outputs but fit only one model. I think the most straightforward way to do this is via torch, which opens up the possibility of unlocking GPUs across the board. The drawback is to have torch as a dependency to the lib.

lifelines.exceptions.ConvergenceError

When running a XGBSEStackedWeibull, I get a lifelines.exceptions.ConvergenceError with the following message:

lifelines.exceptions.ConvergenceError: Fitting did not converge. Try the following:

0. Are there any lifelines warnings outputted during the `fit`?
1. Inspect your DataFrame: does everything look as expected?
2. Try scaling your duration vector down, i.e. `df[duration_col] = df[duration_col]/100`
3. Is there high-collinearity in the dataset? Try using the variance inflation factor (VIF) to find redundant variables.
4. Try using an alternate minimizer: ``fitter._scipy_fit_method = "SLSQP"``.
5. Trying adding a small penalizer (or changing it, if already present). Example: `WeibullAFTFitter(penalizer=0.01).fit(...)`.
6. Are there any extreme outliers? Try modeling them or dropping them to see if it helps convergence.

Given the pipeline nature of XGBSEStackedWeibull. Are there recommended steps to getting past the convergence error? I.E. Will the lifelines recommendations still hold, or are there other methods I should try?

Which concordance-index is being used?

Hi, I couldn't find in the documentation which concordance-index is being used while calling .metrics.

I guess it's Uno or Harel version of the c-index, right?

Is it possible to pass the monotonic constraints to the fitting function

Instructions

Delete everything between parenthesis (...)
Remove the sections that are not relevant

Describe the feature and the current state.

Currently this package seems to be one of the best ways to fit a xboost model to survival data, the only extension I would like to ask for is that the already existing parameter in xboost to restrain the trees for a monotonic constraint to be passed to the higher level function call.

Will this change a current behavior? How?

I think it only adds more functionality

Additional Information

https://xgboost.readthedocs.io/en/stable/tutorials/monotonic.html

BootstrapEstimator is actually BaggingEstimator

The BootstrapEstimator is actually training n models each with different fraction of the data. The reported CI is not the bootstrapped CI.

As this is bagging and I suggest renaming the class to avoid confusions especially in the academic settings.

feature selection

hi,
I have read the documentaion about xgbse. However I did't find the guidelines for how to use xgbse to do feature selection. Could you privde the related documention or method ?

risk_score and data

您好，感谢您分享这么优秀的工作。我想请教一下，这个模型可以单独输出类似与cox具有的risk_score吗？
此外借助下面两行代码加载数据加载不了，希望您能够上传一部分案例数据，作为参考学习。
from pycox.datasets import metabric
df = metabric.read_df()

Using a Pretrained XGBoost Model

Describe the feature and the current state.

In the XGBSEDebiasedBCE model, there is no way to overwrite the model with an already-trained model. This is useful as XGBSE does not expose some low-level APIs of the XGBoost model, for example, sample/feature weighting. This feature would be useful to add the data and an already trained model and then run the LR on top of that model.

Will this change a current behavior? How?

No. It will only add a method to the XGBSEDebiasedBCE model to replace the 'self.bst' and run the multitask LR task.

Multi-states and competing risk modelling

Hello guys, first of all congratulations for your work :)

Currently I'm facing a multi-state modeling with multiple absorbing states, in the search of some techniques I found this module. As far as I could go into the docs I didn't find any mention to multi-states and competing risks modelling, does my understanding is correct?

If so, do you guys intend to extend the use cases of this module in that direction?

Thanks!

update pandas version requirement

Hi team, when installing this package, it says

DEPRECATION: xgbse 0.2.3 has a non-standard dependency specifier pandas>=1.0.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of xgbse or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at pypa/pip#12063

Since pandas has been updated to 2.0+, could we upgrade the required pandas version?

I made a fork and changed the required pandas version ("pandas>=1.0.0"), and it works fine for me (pip installed pandas 2.0.3 while installing xgbse).

Is it possible to recreate the xgboost object from an mlfow experiment

Describe the feature and the current state.

I haven't seen this functionality

Will this change a current behavior? How?

I don't think so

Additional Information

The idea is that after importing the experiment it would only require you to say the model object, it does not matter for me that ci intervals are lost.

How can we get the importance and output trees?

XGBSEDebiasedBCE.plot_importance(xgbse_model)
AttributeError: type object 'XGBSEDebiasedBCE' has no attribute 'plot_importance'

Can you add the plot_importance function in your xgbse?

xgboost 1.4.0+: ValueError: If using all scalar values, you must pass an index

Using xgboost 1.4.0 or 1.4.1, we are now getting an error:
ValueError: If using all scalar values, you must pass an index

No error with 1.3.3

All releases after 1.3.3, we're receiving a ValueError upon XGBSEBootstrapEstimator.fit() call.
Tested in Python 3.7.2 and 3.8.6

Trace:

C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgboost\core.py:101: UserWarning: ntree_limit is deprecated, use iteration_range or model slicing instead.
  warnings.warn(
Traceback (most recent call last):
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 111, in <module>
	main()
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 100, in main
	script_module.main()
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 271, in main
	do_extrapolation(X=X, X_valid=X_valid, X_train=X_train, y_train=y_train, main_ids=id_column)
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 122, in do_extrapolation
	bootstrap_estimator, mean, upper_ci, lower_ci = fit_predict_bootstrap_est(
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 208, in fit_predict_bootstrap_est
	bootstrap_estimator.fit(
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_meta.py", line 57, in fit
	trained_model = self.base_estimator.fit(X_sample, y_sample, **kwargs)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_kaplan_neighbors.py", line 407, in fit
	pd.DataFrame({"leaf": leaves})
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\frame.py", line 467, in __init__
	mgr = init_dict(data, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 283, in init_dict
	return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 78, in arrays_to_mgr
	index = extract_index(arrays)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 387, in extract_index
	raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index

Throwing code block:

def fit_predict_bootstrap_est(base_model, n_estimators, X_train, y_train, X_valid):
    """Instantiate, fit, and predict a bootstrap_estimator."""
    bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=n_estimators)
    bootstrap_estimator.fit(
        X_train,
        y_train,
        time_bins=TIME_BINS,
    )

I'm unable to share specific data of the train structures, but their types and shapes follow:
X_train = DataFrame: (2916, 11)
y_train = ndarray: (4916,)
TIME_BINS = np.arange(5, 540, 5)

Requirements:
astor==0.8.1
autograd==1.3
autograd-gamma==0.5.0
backcall==0.2.0
colorama==0.4.4
cycler==0.10.0
decorator==5.0.7
ecos==2.0.7.post1
formulaic==0.2.3
future==0.18.2
interface-meta==1.2.3
ipykernel==5.5.3
ipython==7.22.0
ipython-genutils==0.2.0
jedi==0.18.0
joblib==1.0.1
jupyter-client==6.1.12
jupyter-core==4.7.1
kiwisolver==1.3.1
lifelines==0.25.11
matplotlib==3.3.0
numexpr==2.7.3
numpy==1.20.2
osqp==0.6.2.post0
pandas==1.1.0
parso==0.8.2
pickleshare==0.7.5
Pillow==8.2.0
prompt-toolkit==3.0.18
Pygments==2.8.1
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2021.1
pywin32==300
pyzmq==22.0.3
qdldl==0.1.5.post0
scikit-learn==0.24.1
scikit-survival==0.15.0.post0
scipy==1.6.2
six==1.15.0
threadpoolctl==2.1.0
toml==0.10.2
tornado==6.1
traitlets==5.0.5
wcwidth==0.2.5
wrapt==1.12.1
xgboost==1.3.3
# xgboost==1.4.1
xgbse==0.2.1

Competing risks/events

If I have competing events and I treat the occurrence of a competing events as censored observations (informative/dependent censoring) will this model still give me an unbiased estimate of survival? In my use case the competing event will be one that precludes the event of interest, so the two events are completely mutually exclusive, I'm interested in event A, but if event B happens, then event A can never happen.

Generally I gather that competing events can work within a Cox framework, but it will give a biased estimate of incidence using a Kaplan Meier approach and XGBSE partially relies on Kaplan Meier.

What is the meaning of events 0 and 1? can you please add this to the docs or point me to the relevant information?

Instructions

Delete everything between parenthesis (...)
Remove the sections that are not relevant

Describe the documentation issue

(explain more about the issue you found, if it's a missing doc or a improvement)

Adding DOI for Citation

I would like to thank you for the hard work you put into developing this package.

I used your work in my research and wanted to cite it in the manuscript. I saw you have a reference manual but unfortunately, it does not have DOI. I was wondering if you could list your package on zendono so that you can have a DOI for xgbse?

This would facilitate referring to your work for many journals.

Question: proper way to save/load XGBSEStackedWeibull model

Hi. First of all, thank you for developing and sharing XGBSE, which has been of great help for me.
I'd like to know how to properly save/load all the key components of a trained XGBSEStackedWeibull model.
I have tried out several saving methods but have failed thus far to retrieve the exact same results from the loaded model.

XGBSEDebiasedBCE n_jobs parameter

I'm curious to understand how the XGBSEDebiasedBCE (and perhaps other modules') n_jobs parameter works. I see it listed here: https://loft-br.github.io/xgboost-survival-embeddings/bce.html as a separate parameter from the xgboost parameters themselves (which has it's own n_jobs parameter).

Does this parameter impact the logistic regression instead of xgboost? In other words, what is being multiprocessed here?

Perhaps relatedly, I've noticed that the xgboost part of XGBSEDebiasedBCE will typically fit quite quickly on my data (~20 seconds if using GPU and stopping after about 150 rounds), but will then pause for 30-40 seconds before the fitting is complete. Is this due to the logistic regression that XGBSEDebiasedBCE is performing after fitting with xgboost and is there any good way to speed this up?

Ability to pass separate training data sets in for xgboost model vs survival model

I'm wondering if you've considered allowing the user to pass in a separate training sets for the xgboost model vs the survival model?

For example, in XGBSEStackedWeibull, the current state is this:

Train xgboost on X_train, y_train
Predict back on X_train using model from (1), resulting in risk scores
Train Weibull AFT model with risk scores from (2) and y_train

I'm proposing this:

Train xgboost on X_train, y_train
Predict risk scores of X_train_2 using model from (1)
Train Weibull AFT model using risk scores from (2) and y_train_2

The rationale for having different datasets used between the models is that it reduces the chance of overfitting. I've found that the risk scores that come out of step 2 are indicating a tighter relationship between risk score and y_train than there actually is, by nature of the fact that we are predicting back on the dataset that the xgboost model was trained on (and then re-relating things to the original outcome variable, y_train).

Thanks for the awesome package

plot_ci function is not defined

Great package!
I'm having difficulties generating survival plots though.

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(xgbdata, y)

# predicting
event_probs = xgbse_model.predict(xgbdata)
event_probs.head()

#Survival curves
mean, upper_ci, lower_ci = xgbse_model.predict(xgbdata, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

I get the following error message:

plot_ci(mean, upper_ci, lower_ci)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'plot_ci' is not defined

Everything else is working fine.
Is the function part of a different module/package?

Automate documentation deploy using actions

Time varying survival regression

Hi Team,

Awesome work on this! Really looking forward to trying it out.

I looked through the documentation and I wasn't sure if I am not looking at the right place but does the package allow covariates to change over time, as explained below.
https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html

Best,
Jayant

XGBoost Nrounds Parameter

Thanks again for the wonderful package. I'm wondering what the best way to set the nrounds parameter in the underlying Xgboost model is? I have tried nrounds, n_estimators, and num_boost_round. and gotten the following warning each time:

[08:18:45] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516: 
Parameters: { num_boost_round } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

What's the correct interface for this?

pip install xgbse

Code sample

pip install xgbse

Problem description

ERROR: Could not find a version that satisfies the requirement xgbse (from versions: none)
ERROR: No matching distribution found for xgbse

Xgbse.metrics scoring speeds

I've noticed it takes several times longer for xgbse to score the models using the built-in scoring functions (concordance_index/approx_brier_score) than it takes for the models to fit. Is this typical/expected behavior? On my data - approximately 800K rows, the model itself will fit in about 90 seconds, but scoring the model can take 4 minutes or more after.

This additional overhead is significant for the purposes of hyperparameter searching, so I am wondering if there is anything I can do to speed up the scoring (e.g., any parameters to tweak in the scoring functions?).

Question: Predicting probability of survival at arbitrary times

I have a set of observations I would like to predict survival for at varying times. Is there a way to do that with this API? As best as I can tell, we are restricted to a set of time bins that must be uniform over the whole dataset.

Thank you for your assistance!

XGBEmbedKaplanNeighbors appears to now be XGBSEKaplanNeighbors

Hello - thanks for this amazing library. It's truly a life-saver after looking around for a modern survival analysis library which could handle both large data, as well provide the full survival curve for every row. You've built something special here.

In exploring, I discovered that the XGBEmbedKaplanNeighbors import in the documentation should now be a XGBSEKaplanNeighbors import, if I'm reading things correctly. I don't know if any other class names have changed, but wanted to flag to help get more people using this library.

(Unrelatedly, I was unable to install this package in Colab due to a claimed dependency on Python 3.6.9. However, it worked fine on my local machine on 3.8).

Thanks again for the great work. I look forward to exploring the functionality more.

SHAP explanation for XGBSEKaplanTree or bootstrapestimator.

Hi,
Is it possible to use SHAP with XGBSEKaplanTree or bootstrapestimator.
SHAP treeexplainer is not working with them. Permutationexplainer seems to start evaluating but ended up with error "ValueError: max_evals=1785 is too low for the Permutation explainer, it must be at least 2 * num_features + 1 = 1799!"

I am not sure how to fix this error.
If anyone can point me in the right direction, it will be really helpful.
THank you in advance.

Ensembling of predictions

Dear xgbse-team,

what would be the correct way for ensembling of predictions? Let's say that I have 5 StackedWeibull models and would like to ensemble their predictions on a test dataset. Should I average the interval predictions?

Thank you very much.

Fitting results in AttributeError: dlsym(0x7fa409181040, XGDMatrixCreateFromDense): symbol not found

I followed along the coding example from Basic Usage section. Fitting the model resulted in AttributeError: dlsym(0x7fa409181040, XGDMatrixCreateFromDense): symbol not found. I'm using Python 3.8.5 on a MacOS Catalina. I also postet the issue here

Category features support

Only dataframe object is supported for the "fit" method

xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
# X = xgb.DMatrix(X, enable_categorical=True)
xgbse_model.fit(X, y)

but enable_categorical is not set "True" in the source code of xgbse
it gives error :
"ValueError: DataFrame.dtypes for data must be int, float, bool or category. When
categorical type is supplied, DMatrix parameter enable_categorical must
be set to True"

def build_xgb_cox_dmatrix(X, T, E):
    """Builds a XGB DMatrix using specified Data Frame of features (X)
        arrays of times (T) and censors/events (E).

    Args:
        X ([pd.DataFrame, np.array]): Data Frame to be converted to XGBDMatrix format.
        T ([np.array, pd.Series]): Array of times.
        E ([np.array, pd.Series]): Array of censors(False) / events(True).

    Returns:
        (DMatrix): A XGB DMatrix is returned including features and target.
    """

    target = np.where(E, T, -T)

    return xgb.DMatrix(X, label=target)

The last line here does not set "enable_categorical = True". Category features are not supported? Or I just need to change the code here myself.

feature importance

hi，
I want to know how to get feature importance when I use xgbse for survival analysis？ could I use the feature importance to do feature selection？

XGBSE error with sklearn pipeline and GridSearchCV

Code sample

# Your code here

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from xgbse import XGBSEKaplanNeighbors
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import itertools
import time

[Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv](https://github.com/loft-br/xgboost-survival-embeddings/files/9636594/Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv)

csv_path = "C:/Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv"

df = pd.read_csv(csv_path)
df = df.set_index('cohort', drop=True)
df.index.rename('index', inplace=True)
df.index = df.index.astype(int)

X = df.drop(columns=['PFS_binaer_Progress', 'Ereignis_korrigiert_Update_2021_03', 'DFS_M_ED_Update_2021_03',
                     'Pseudonym', 'train_test_mix', 'SUVmax', 'SEX_SPSS', 
                     'DIAGECOG_komplett_ueber_1', 'DIAALTER', 'PTNM_T_SPSS_korr_grob_7th',
                     'PTNM_N_SPSS_korr', 'STADIUM_GROB_SPSS_7thEdition',
                     'R_Status', 'PTNM_T_SPSS_korr_7th', 'STADIUM_SPSS_7thEdition',
                     'Histo_Subtyp', 'NEOADJ_CHEMO', 'NEOADJ_BESTR', 'ADJ_CHEMO', 'ADJ_BESTR',
                     'ANY_CHEMO', 'ANY_BESTR', 'ASP_high_19_5', 'ASP', 'ASP_high_33_3'])

y_df = pd.DataFrame()
y_df['event'] = df['Ereignis_korrigiert_Update_2021_03'].astype(bool)
y_df['time'] = df['DFS_M_ED_Update_2021_03']

# split dataframe into training+test and validation cohort

X_train_test = X.iloc[X.index.isin([1])]
X_valid = X.iloc[X.index.isin([2])]

y_train_test_df = y_df.iloc[y_df.index.isin([1])]
y_valid_df = y_df.iloc[y_df.index.isin([2])]

s = y_df.dtypes
y_train_test = np.array([tuple(x) for x in y_train_test_df.values], dtype=list(zip(s.index, s)))
y_valid = np.array([tuple(x) for x in y_valid_df.values], dtype=list(zip(s.index, s)))


def score_survival_model(model, X, y):
    prediction = model.predict(X)
    result = concordance_index_censored(y['event'], y['time'], prediction)
    return result[0]


feature_select_dict = {
                        "MIC" : SelectKBest(mutual_info_classif, k=30),                        
                        }


p_grid_dict = {"xgbse" : {"estimator__objective": ["survival:aft"],
                             "estimator__eval_metric": ["aft-nloglik"],
                             "estimator__aft_loss_distribution": ["normal", "logistic"],
                             "estimator__aft_loss_distribution_scale": [0.5, 1.0, 1.5],
                             "estimator__tree_method": ["hist"],
                             "estimator__learning_rate": np.logspace(-2, 0, num=6),
                             "estimator__max_depth": [0, 1, 5, 10, 15, 25, 50],
                             "estimator__booster": ["dart"],
                             "estimator__subsample": [0.1, 0.2, 0.4, 0.6, 0.8, 1.0],
                             "estimator__min_child_weight": [1, 2, 5, 10],
                             "estimator__colsample_bynode": [0.5, 1.0, 2.0]}
                        }


models_dict = {
                "xgbse" : XGBSEKaplanNeighbors(xgb_params=p_grid_dict['xgbse'])
                }


inner_cv = sklearn.model_selection.KFold(n_splits=10, shuffle=True, random_state=1)
outer_cv = sklearn.model_selection.KFold(n_splits=10, shuffle=True, random_state=1)



def model_scores(feature_select_dict, models_dict, p_grid_dict, X_train, y_train, X_valid, y_valid):
    
    # define the scaler and prepare empty dict or dataframe to assemble results and best parameters
    
    scaler = RobustScaler(with_centering = False)
    models_df_dict = dict()
    params_df = pd.DataFrame()    
    
    for outerKey in feature_select_dict:
         
        models = pd.DataFrame()  
        feature_select = feature_select_dict[outerKey]

        for innerKey in models_dict:
  
            # instantiate model
        
            model = models_dict[innerKey]
            p_grid = p_grid_dict[innerKey]
        
        
            # inner loop of nested cross-validation: perform hyperparameter tuning in the training set of the outer loop
        
            t1 = time.time()
            
            pipeline = Pipeline([('scaling', scaler), ('feature_selection', feature_select), ('estimator', model)])
            
            clf_model = sklearn.model_selection.GridSearchCV(estimator=pipeline, 
                                                             scoring=score_survival_model,
                                                             param_grid=p_grid, 
                                                             cv=inner_cv, refit=True) 
            
                        
            # outer loop: train the model with training data and score the perfomance on test sets
            nested_test_score = sklearn.model_selection.cross_val_score(clf_model, scoring=score_survival_model,
                                                                        X=X_train, y=y_train, cv=outer_cv)
        
        
            # calculate AUC from test and validation set
        
            test_mean = nested_test_score.mean()
            test_std = nested_test_score.std()
            
            clf_model_fit = clf_model.fit(X_train, y_train)
            
            clf_model_best_parameters = str(clf_model.best_params_)
        
            valid_score = clf_model.score(X_valid, y_valid)
            
            test_plus_valid = test_mean + valid_score
            
            model_time = (time.time()-t1)
                
                
            # at the end of this nested CV: add results for this model to the models dataframe 
        
            models[innerKey] = [test_mean, test_std, model_time, valid_score, test_plus_valid]
            
            df_list = [outerKey, innerKey, clf_model_best_parameters]
            params = pd.DataFrame(df_list)
            params_df = pd.concat([params_df, params], axis = 1)

        # add model results for different feature_select_dict keys to the dict of model results
        # add best parmaeters to the dataframe
       

        models_df_dict[outerKey] = models 
        
        
    # subsequent to all model calculations: add multiindex, flip (transpose) dataframe and sort by highest "test+valid"
    # finalize the "best parameters" dataframe
    
    multiindex = {(outerKey, innerKey): values for outerKey, innerDict in models_df_dict.items() for innerKey, values in innerDict.items()}

    models_df_dict_multiindex = pd.DataFrame(multiindex)
    models_df_dict_multiindex.index = ['nested_test_mean', 'nested_test_SD', 'time', 'valid', 'test_plus_valid']
    
    models_transpose = models_df_dict_multiindex.transpose()
    models_transpose.index.set_names(['pre', 'model'], inplace=True)
    models_transpose = models_transpose.sort_values(by = ['nested_test_mean'], ascending = False)

    params_df = params_df.T
    params_df.columns = ['feature_select', 'model', 'parameters']
    params_df = params_df.sort_values(by = ['model', 'feature_select'], ascending = [True, True])
    
    return models_transpose, params_df


results, params = model_scores(feature_select_dict, models_dict, p_grid_dict, X_train_test, y_train_test, X_valid, y_valid)
results_ready = results.reset_index(level=['pre', 'model'])
print(results_ready)

Problem description

Use of GridSearchCV is not possible because XGBSE requires hyperparameters to be unique and to be passed during model initiation. Furthermore, parameter vales in the parameter dict need to be without [], while GridSearchCV expects values in [].
XGBSE therefore seems to be incompatible with GridSearchCV.
Furthermore, XGBSE seems to be incompatible with sklearn's pipeline.
If the sklearn pipeline is fitted, the estimator XGBSE receives the X dataframe as a np.array in the last step of the pipeline, which misses an index. This gives a corresponding error because XGBSE fitting seems to require X.index.

Expected behavior

__
It would be expected that XGBSE can be used with GridSearchCV and pipeline.

Possible solutions

__
It would be required that hyperparameters could be defined and that fitting would allow X without an index (np.array).

Underlying xgboost model for calculating feature importances

Xgboost models can output permutation or gain feature importances. Additionally, through packages such as Shap, it's possible to get Shapley values for Xgboost models, which can be highly informative about what features are contributing to survival.

It would be helpful to allow the user to access the underlying Xgboost regressor object in order to calculate/view these feature importance. I don't know if this is already possible in some way, but if not, exposing this to the user would be helpful.

TypeError: predict() got an unexpected keyword argument 'iteration_range'

Code sample

# Your code here

# importing dataset from pycox package
from pycox.datasets import metabric

# importing model and utils from xgbse
from xgbse import XGBSEKaplanNeighbors
from xgbse.converters import convert_to_structured

# getting data
df = metabric.read_df()

# splitting to X, y format
X = df.drop(['duration', 'event'], axis=1)
y = convert_to_structured(df['duration'], df['event'])

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(X, y)

Problem description

I tried to run the above example code and i encountered the following error

TypeError Traceback (most recent call last)
in
15 # fitting xgbse model
16 xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
---> 17 xgbse_model.fit(X, y)
18
19 # # predicting

~/opt/anaconda3/lib/python3.8/site-packages/xgbse/_kaplan_neighbors.py in fit(self, X, y, num_boost_round, validation_data, early_stopping_rounds, verbose_eval, persist_train, index_id, time_bins)
176
177 # creating nearest neighbor index
--> 178 leaves = self.bst.predict(
179 dtrain, pred_leaf=True, iteration_range=(0, self.bst.best_iteration + 1)
180 )

TypeError: predict() got an unexpected keyword argument 'iteration_range'

AttributeError: `best_iteration` is only defined when early stopping is used - when xgboost>=2

Code sample

Run the below code snippets.

Requirements:

# I used Python 3.8.
pip install xgbse
pip install "xgboost>=2"

# xgbse: 0.2.3
# xgboost: 2.0.0

import xgbse
import numpy as np

model = xgbse.XGBSEDebiasedBCE(lr_params={"max_iter": 10})

np.random.seed(0)
X = np.random.normal(size=(10, 5))
e = np.random.randint(low=0, high=1 + 1, size=(10, 1), dtype=bool)
t = np.random.rand(10, 1)
y = np.array(list(zip(e, t)), dtype={"names": ("e", "t"), "formats": ("bool", "f8")})

model.fit(X, y)

Problem description

An AttributeError is thrown unexpectedly. Since early_stopping_rounds is set as None in fit (which is the default), the user wouldn't expect to see an error related to early stopping.

Traceback (most recent call last):
  File "xgbse_20230926.py", line 15, in <module>
    model.fit(X, y)
  File "<...>/python3.8/site-packages/xgbse/_debiased_bce.py", line 232, in fit
    dtrain, pred_leaf=True, iteration_range=(0, self.bst.best_iteration + 1)
  File "<...>/python3.8/site-packages/xgboost/core.py", line 2602, in best_iteration
    raise AttributeError(
AttributeError: `best_iteration` is only defined when early stopping is used.

Expected behavior

Code executes without throwing the AttributeError.

Possible solutions

For example, add a check whether early stopping was enabled before accessing self.bst.best_iteration (this appears to be used in self.bst.predict(..., iteration_range=(0, self.bst.best_iteration + 1)) in _debiased_bce.py), otherwise use an appropriate default value.

No observation in time bucket

When there's no observation in a time bucket in BCE we get a value error
ValueError: Error: No observations in a time bucket
It would be nice to see in which time bucket was the error

Giving individual weights in the xgbsestackedweibull

Hi, everyone there,

Problem description

-I have a case-cohort data, which need to give each cases and non-cases corresponding weights to meet the disease rate in a natural population.
-Normally, in a AFT model, as in lifelines: WeibullAFTFitter, the 'weight_col' can let me input weights.
-In the constructions of XGBSEStackedWeibull in https://loft-br.github.io/xgboost-survival-embeddings/modules/stacked_weibull.html, the weilbull_params are the same as lifelines: WeibullAFTFitter, but when I use the code below to put weight_name in weibull_params. I got the error. It seems the 'weight_col' in WeibullAFTFitter cannot work in XGBSE.

Code sample

# parameters
xgb_params = {
    "objective": "survival:aft",
    "eval_metric": "aft-nloglik",
    "aft_loss_distribution": "normal",
    "aft_loss_distribution_scale": 0.795,
    "tree_method": "hist",
    "learning_rate": 5e-2,
    "max_depth": 8,
    "booster": "dart",
    "subsample": 0.5,
    "min_child_weight": 50,
    "colsample_bynode": 0.5
}

weibull_params={ 'weight_col':'weight'}

# fitting XGBSE model
xgbse_model = XGBSEStackedWeibull(xgb_params=xgb_params, weibull_params=weibull_params)

Error

self.weibull_aft = WeibullAFTFitter(**self.weibull_params)

TypeError: __init__() got an unexpected keyword argument 'weight_col'

Expected behavior

Got the individual-weighted XGBoost-AFT model

Possible solutions

Should I label the matrix with weights first?
Or the 'scale_pos_weight' in xgboost can be used?

Using best round when applying early stopping

Hi and thanks for a really interesting library!

I have two small questions.

First of all I was thinking about the reasoning behind your choice of XGBoost instead of LightGBM? I am just curious if XGB has any inherent advantages for survival analysis compared to LGB.

Then on to my main question related to early stopping. As far as I can see, when using early stopping, you are not using the best iteration for predictions, but rather the last iteration. Or, in code; instead of:

y_pred = bst.predict(dtrain)

it should be:

y_pred = bst.predict(dtrain, ntree_limit=bst.best_ntree_limit)

Please correct me if I am mistaken since I have not used your library extensively yet :)

Cheers,

Cure Fraction Problem?

What is the "cure fraction" problem described in the documentation for the XGBSEStackedWeibull. To quote:
We also have better extrapolation capabilities, as opposed to the cure fraction problem in `XGBSEKaplanNeighbors` and `XGBSEKaplanTree`.

loft-br / xgboost-survival-embeddings Goto Github PK

xgboost-survival-embeddings's People

Contributors

Stargazers

Watchers

Forkers

xgboost-survival-embeddings's Issues

Code sample

Problem description

Expected behavior

Possible solutions

Instructions

Describe the feature and the current state.

Will this change a current behavior? How?

Additional Information

Describe the feature and the current state.

Will this change a current behavior? How?

Describe the feature and the current state.

Will this change a current behavior? How?

Additional Information

Instructions

Describe the documentation issue

Code sample

Problem description

Code sample

Problem description

Expected behavior

Possible solutions

Code sample

Problem description

Code sample

Problem description

Expected behavior

Possible solutions

Problem description

Code sample

Error

Expected behavior

Possible solutions

Recommend Projects

Recommend Topics

Recommend Org