ida-humancapital / fife Goto Github PK

Finite-Interval Forecasting Engine: Machine learning models for discrete-time survival analysis and multivariate time series forecasting

License: Other

Batchfile 1.38% Python 98.62%

machine-learning survival-analysis forecasting time-series-analysis forecasting-models competing-risks multivariate-time-series time-to-event panel-analysis

fife's Introduction

The Finite-Interval Forecasting Engine (FIFE) provides machine learning and other models for discrete-time survival analysis and multivariate time series forecasting.

Suppose you have a dataset that looks like this:

ID	period	feature_1	feature_2	feature_3	...
0	2019	7.2	A	2AX	...
0	2020	6.4	A	2AX	...
0	2021	6.6	A	1FX	...
0	2022	7.1	A	1FX	...
1	2019	5.3	B	1RM	...
1	2020	5.4	B	1RM	...
2	2020	6.7	A	1FX	...
2	2021	6.9	A	1RM	...
2	2022	6.9	A	1FX	...
3	2020	4.3	B	2AX	...
3	2021	4.1	B	2AX	...
4	2022	7.4	B	1RM	...
...	...	...	...	...	...

The entities with IDs 0, 2, and 4 are observed in the dataset in 2022.

What are each of their probabilities of being observed in 2023? 2024? 2025?
Given that they will be observed, what will be the value of feature_1? feature_3?
Suppose entities can exit the dataset under a variety of circumstances. If entities 0, 2, or 4 exit in a given year, what will their circumstances be?
How reliable can we expect these forecasts to be?
How do the values of the features inform these forecasts?

FIFE can estimate answers to these questions for any unbalanced panel dataset.

FIFE unifies survival analysis (including competing risks) and multivariate time series analysis. Tools for the former neglect future states of survival; tools for the latter neglect the possibility of discontinuation. Traditional forecasting approaches for each, such as proportional hazards and vector autoregression (VAR), respectively, impose restrictive functional forms that limit forecasting performance. FIFE supports the state-of-the-art approaches for maximizing forecasting performance: gradient-boosted trees (using LightGBM) and neural networks (using Keras).

FIFE is simple to use:

from fife.processors import PanelDataProcessor
from fife.lgb_modelers import LGBSurvivalModeler
import pandas as pd

data_processor = PanelDataProcessor(data=pd.read_csv(path_to_your_data))
data_processor.build_processed_data()

modeler = LGBSurvivalModeler(data=data_processor.data)
modeler.build_model()

forecasts = modeler.forecast()

Want to forecast future states, too? Just replace LGBSurvivalModeler with LGBStateModeler and specify the column you'd like to forecast with the state_col argument.

Want to forecast circumstances of exit ("competing risks")? Try LGBExitModeler with the exit_col argument instead.

Here's a guided example notebook with real data, where we forecast when world leaders will lose power.

You can read the documentation for FIFE here.

fife's People

Contributors

Stargazers

Watchers

Forkers

cdoswald ricklentz evanmiyakawa lwcarani edjishwang

fife's Issues

Add Cumulative Incidence Function Calculation to the Exit Modeler

A key item of interest desired from use of the Exit Modeler is the Cumulative Incidence Function (CIF). Currently the CIF must be pieced together from Survival and Exit Forecasts. We show how to do this in the documentation. This calculation should be automatic, and can be easily implemented. Code for doing this can be found in the feature/CIF branch and will be incorporated into FIFE's Exit Modeler.

construct_embedding_network in tf_modelers.py

In both TFModeler.build_model() and TFModeler.hyperoptimize(), use of construct_embedding_network() attempts to take arguments passed in params, but the expression used to identify these parameters doesn't actually work because it doesn't successfully convert the params to lower case. Here's an example inside TFModeler.build_model():

self.model = self.construct_embedding_network(
    **{k.lower(): v for k, v in params.items() if k in construction_args}
)

This expression correctly makes the dict keys lower case, but does not convert the params keys to lower case when searching for them in construction_args, leading to no custom params actually being used in construct_embedding_network(), such as DROPOUT_SHARE and DENSE_LAYERS.

Simply making this modification should fix this for the places where this occurs:

self.model = self.construct_embedding_network(
    **{k.lower(): v for k, v in params.items() if k.lower() in construction_args}
)

Categorical labels in IFE models

When creating the internal model for IFE, '_label' appears to use the internal labeling from .cat.categories. However, other models (including Exit Modeler) appear to use the original values. This appears to create problems when matching the outputs from these models, especially if the original labels are numeric. We need to investigate the labeling of categorical variables, especially how it relates to IFE.

fife/fife/base_modelers.py

Line 787 in 081ee5a

ids["_label"] = ids["_label"].cat.codes

1e40f26

Add interacted fixed effects state and exit modelers

compute_model_uncertainty in tf_modelers.py

The method compute_model_uncertainty() on line 569 in tf_modelers.py is not working. The branch feature/prediction_intervals was created to address this bug and to further extend the FIFE to produce prediction intervals.

Fix StateModeler initialization bug

I was trying to run the IFEStateModeler but there seems to be a new bug in the StateModeler module itself. First of all, it requires file date, which is awkward for IFE models but not the main bug. The main issue is base_modelers.py line 784-786. The problem is outlined here: https://stackoverflow.com/questions/61153546/addition-subtraction-of-integers-and-integer-arrays-with-timestamp-is-no-longer

Prediction Intervals

Investigate the existing implementation for forecast uncertainty with the TF modeler
Develop prediction intervals for the LGB modelers
The branch feature/prediction_intervals was created for these purposes.

Normalize forecasts from ExitModeler

Forecasts from LGBExitModeler().forecast() do not sum to 1 within Individual-time period. However, they are close. A manual normalization reduces the error by half, as shown in code. However, a manual normalization could obscure other bugs and make debugging harder in the future.

from fife.lgb_modelers import LGBSurvivalModeler
from fife.lgb_modelers import LGBExitModeler
from fife.processors import PanelDataProcessor
from fife.utils import make_results_reproducible
from pandas import concat, date_range, read_csv, to_datetime, Index, DataFrame
#import pandas as pd
#import numpy as np
from numpy import array, repeat, full, isclose

# import warnings
# warnings.filterwarnings('ignore')

SEED = 9999
make_results_reproducible(SEED)

data = read_csv("https://www.dl.dropboxusercontent.com/s/3tdswu2jfgwp4xw/REIGN_2020_7.csv?dl=0")
data.head()

data["country-leader"] = data["country"] + ": " + data["leader"]
data["year-month"] = data["year"].astype(int).astype(str) + data["month"].astype(int).astype(str).str.zfill(2)
data["year-month"] = to_datetime(data["year-month"], format="%Y%m")
data = concat([data[["country-leader", "year-month"]],
               data.drop(["ccode", "country-leader", "leader", "year-month"],
                         axis=1)],
               axis=1)
data.head()

total_obs = len(data)
data = data.drop_duplicates(["country-leader", "year-month"], keep="first")
# n_duplicates = total_obs - len(data)
# print(f"{n_duplicates} observations with a duplicated identifier pair deleted.")
    
data_processor = PanelDataProcessor(data=data)
data_processor.build_processed_data()
# data_processor.data.head()

mdata = data_processor.data.copy()

# Assign unique ID based off broken panels
gaps = mdata.groupby("country-leader")["_period"].shift() < mdata["_period"] - 1
spells = gaps.groupby(mdata["country-leader"]).cumsum()
mdata['country-leader-spell'] = mdata['country-leader'] + array(spells, dtype = 'str')

## Create exit status for dissolution, change of government type, and same government type
mdata['outcome'] = 'no_exit'

for i in mdata['country-leader-spell'].unique():
    # Boolean index for the unique country-leader-spell observations
    dex1 = mdata['country-leader-spell'] == i
    # Boolean index for the last observation for the country-leader-spell observations
    dex2 = dex1 & (mdata['year-month'] == mdata.loc[dex1,'year-month'].max())
    # Boolean index for the last observation for the country-leader-spell observations unless it is censored
    dex3 = dex2 & (mdata['year-month'] != mdata['year-month'].max())
    # Only assign exit status if not censored
    if dex3.sum() > 0:
        futuredex = (mdata['year-month'] >= mdata.loc[dex3,'year-month'].values[0]) & (mdata['country'] == mdata.loc[dex3,"country"].values[0]) & ~dex1
        if futuredex.sum() == 0:
            mdata.loc[dex2,'outcome'] = "Dissolution" # could alternatively use dex1
        else:
            oldgovernment = mdata.loc[dex3,'government'].values[0]
            futuremdata = mdata[futuredex]
            newgovernment = futuremdata.loc[futuremdata['year-month'] == futuremdata['year-month'].min(),'government'].values[0]
            if oldgovernment == newgovernment:
                mdata.loc[dex2,'outcome'] = "Same_government_type" # could alternatively use dex1
            else:
                mdata.loc[dex2,'outcome'] = "Change_of_government_type" # could alternatively use dex1

# We no longer need country-leader-spell
mdata = mdata.drop('country-leader-spell', axis = 1)

# The outcome variable must be a category type
mdata['outcome'] = mdata['outcome'].astype('category')

# Obtain probability of type of exit conditional on exit
exit_modeler = LGBExitModeler(data=mdata, exit_col = 'outcome')
exit_modeler.build_model(parallelize=False)
exit_modeler_forecasts = exit_modeler.forecast()

# Do the outcomes always sum to 1?
summary1a = DataFrame(index = exit_modeler_forecasts.index.unique(), columns = exit_modeler_forecasts.columns[:-1])
for i in exit_modeler_forecasts.index.unique():
    summary1a.loc[i,:] = exit_modeler_forecasts.loc[i, exit_modeler_forecasts.columns != 'Future outcome'].sum(axis = 0)
(summary1a!=1).sum().sum()/(summary1a.shape[0]*summary1a.shape[1]) # Should be 0
abs(1-summary1a).sum().sum()/(summary1a.shape[0]*summary1a.shape[1]) # should be 0

# However, they are "close"
tmp = summary1a.to_numpy(dtype='float64')
isclose(tmp,full(tmp.shape,1)).sum()/(tmp.shape[0] * tmp.shape[1])

# Slightly improved normalization
exit_modeler_forecasts_normalized = exit_modeler_forecasts.copy()
exit_modeler_forecasts_normalized.iloc[:,:-1] = exit_modeler_forecasts_normalized.iloc[:,:-1].values / DataFrame(repeat(summary1a.values, repeats = mdata['outcome'].nunique(), axis = 0)).values

# Do the outcomes always sum to 1?
summary1a_normalized = DataFrame(index = exit_modeler_forecasts_normalized.index.unique(), columns = exit_modeler_forecasts_normalized.columns[:-1])
for i in exit_modeler_forecasts.index.unique():
    summary1a_normalized.loc[i,:] = exit_modeler_forecasts_normalized.loc[i, exit_modeler_forecasts_normalized.columns != 'Future outcome'].sum(axis = 0)
(summary1a_normalized!=1).sum().sum()/(summary1a_normalized.shape[0]*summary1a_normalized.shape[1]) # Should be 0
abs(1-summary1a_normalized).sum().sum()/(summary1a_normalized.shape[0]*summary1a_normalized.shape[1]) # should be 0
# half as many different than one and absolute error is cut in half

Obtain population of exit types from non-censored individuals

LGBExitModeler currently obtains the population of exit types from the last observation from every individual. However, exit modeler only trains models for non-censored individuals. Thus if the exit type for censored individuals is different than non-censored individuals models will be fit including the exit type for censored individuals and produce near 0 probability forecasts. The population of exit types should be obtained from non-censored individuals. (IDA internal: JIRA issue FIFEEXTN-14)

[Feature request] multiindex for forecasts from exitmodeler

LGBExitModeler().forecast() currently outputs the exit type as the last column. It would be more intuitive to use a multi-index structure that contains the individual identifier and the exit type. This would ensure the entire DataFrame is of the same dtype, which would make manipulations easier.

Does FIFE support prediction with future features?

I started playing with this package recently and I like it a lot. However, I'm not sure I fully understand the predict/forecast methods/functions. In particular, prediction or forecasts with some future features states (assuming we have dynamic features) for which we know their future values.

Example: In the REIGN dataset used in the example notebook, some of the continuous features are changing with time for a given leader. Assuming I trained a model with these features but was more interested in what the survival probability of a leader would be if the value of column say, "irregular" is of some value.

Is this sort of predictions/forecasts possible with FIFE? Or better, how can one incorporate such into the workflow.

Please pardon my ignorance. My understand of TTE analysis is somewhat scanty.

Thanks.

Incorporate REIGN data within FIFE package

This would allow the user to practice with the REIGN data without an internet connection. It would also make reading the data faster.