Coder Social home page Coder Social logo

ida-humancapital / fife Goto Github PK

View Code? Open in Web Editor NEW
22.0 6.0 5.0 1.28 MB

Finite-Interval Forecasting Engine: Machine learning models for discrete-time survival analysis and multivariate time series forecasting

License: Other

Batchfile 1.38% Python 98.62%
machine-learning survival-analysis forecasting time-series-analysis forecasting-models competing-risks multivariate-time-series time-to-event panel-analysis

fife's Introduction

The Finite-Interval Forecasting Engine (FIFE) provides machine learning and other models for discrete-time survival analysis and multivariate time series forecasting.

Suppose you have a dataset that looks like this:

ID period feature_1 feature_2 feature_3 ...
0 2019 7.2 A 2AX ...
0 2020 6.4 A 2AX ...
0 2021 6.6 A 1FX ...
0 2022 7.1 A 1FX ...
1 2019 5.3 B 1RM ...
1 2020 5.4 B 1RM ...
2 2020 6.7 A 1FX ...
2 2021 6.9 A 1RM ...
2 2022 6.9 A 1FX ...
3 2020 4.3 B 2AX ...
3 2021 4.1 B 2AX ...
4 2022 7.4 B 1RM ...
... ... ... ... ... ...

The entities with IDs 0, 2, and 4 are observed in the dataset in 2022.

  • What are each of their probabilities of being observed in 2023? 2024? 2025?
  • Given that they will be observed, what will be the value of feature_1? feature_3?
  • Suppose entities can exit the dataset under a variety of circumstances. If entities 0, 2, or 4 exit in a given year, what will their circumstances be?
  • How reliable can we expect these forecasts to be?
  • How do the values of the features inform these forecasts?

FIFE can estimate answers to these questions for any unbalanced panel dataset.

FIFE unifies survival analysis (including competing risks) and multivariate time series analysis. Tools for the former neglect future states of survival; tools for the latter neglect the possibility of discontinuation. Traditional forecasting approaches for each, such as proportional hazards and vector autoregression (VAR), respectively, impose restrictive functional forms that limit forecasting performance. FIFE supports the state-of-the-art approaches for maximizing forecasting performance: gradient-boosted trees (using LightGBM) and neural networks (using Keras).

FIFE is simple to use:

from fife.processors import PanelDataProcessor
from fife.lgb_modelers import LGBSurvivalModeler
import pandas as pd

data_processor = PanelDataProcessor(data=pd.read_csv(path_to_your_data))
data_processor.build_processed_data()

modeler = LGBSurvivalModeler(data=data_processor.data)
modeler.build_model()

forecasts = modeler.forecast()

Want to forecast future states, too? Just replace LGBSurvivalModeler with LGBStateModeler and specify the column you'd like to forecast with the state_col argument.

Want to forecast circumstances of exit ("competing risks")? Try LGBExitModeler with the exit_col argument instead.

Here's a guided example notebook with real data, where we forecast when world leaders will lose power.

You can read the documentation for FIFE here.

fife's People

Contributors

agelder avatar ajain-ida avatar cdoswald avatar dependabot[bot] avatar edjishwang avatar jamesmckownbishop avatar jay-dennis avatar mikeguggis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fife's Issues

Add Cumulative Incidence Function Calculation to the Exit Modeler

A key item of interest desired from use of the Exit Modeler is the Cumulative Incidence Function (CIF). Currently the CIF must be pieced together from Survival and Exit Forecasts. We show how to do this in the documentation. This calculation should be automatic, and can be easily implemented. Code for doing this can be found in the feature/CIF branch and will be incorporated into FIFE's Exit Modeler.

construct_embedding_network in tf_modelers.py

In both TFModeler.build_model() and TFModeler.hyperoptimize(), use of construct_embedding_network() attempts to take arguments passed in params, but the expression used to identify these parameters doesn't actually work because it doesn't successfully convert the params to lower case. Here's an example inside TFModeler.build_model():

self.model = self.construct_embedding_network(
    **{k.lower(): v for k, v in params.items() if k in construction_args}
)

This expression correctly makes the dict keys lower case, but does not convert the params keys to lower case when searching for them in construction_args, leading to no custom params actually being used in construct_embedding_network(), such as DROPOUT_SHARE and DENSE_LAYERS.

Simply making this modification should fix this for the places where this occurs:

self.model = self.construct_embedding_network(
    **{k.lower(): v for k, v in params.items() if k.lower() in construction_args}
)

Categorical labels in IFE models

When creating the internal model for IFE, '_label' appears to use the internal labeling from .cat.categories. However, other models (including Exit Modeler) appear to use the original values. This appears to create problems when matching the outputs from these models, especially if the original labels are numeric. We need to investigate the labeling of categorical variables, especially how it relates to IFE.

ids["_label"] = ids["_label"].cat.codes

1e40f26

compute_model_uncertainty in tf_modelers.py

The method compute_model_uncertainty() on line 569 in tf_modelers.py is not working. The branch feature/prediction_intervals was created to address this bug and to further extend the FIFE to produce prediction intervals.

Prediction Intervals

  1. Investigate the existing implementation for forecast uncertainty with the TF modeler
  2. Develop prediction intervals for the LGB modelers
    The branch feature/prediction_intervals was created for these purposes.

Normalize forecasts from ExitModeler

Forecasts from LGBExitModeler().forecast() do not sum to 1 within Individual-time period. However, they are close. A manual normalization reduces the error by half, as shown in code. However, a manual normalization could obscure other bugs and make debugging harder in the future.

from fife.lgb_modelers import LGBSurvivalModeler
from fife.lgb_modelers import LGBExitModeler
from fife.processors import PanelDataProcessor
from fife.utils import make_results_reproducible
from pandas import concat, date_range, read_csv, to_datetime, Index, DataFrame
#import pandas as pd
#import numpy as np
from numpy import array, repeat, full, isclose

# import warnings
# warnings.filterwarnings('ignore')

SEED = 9999
make_results_reproducible(SEED)

data = read_csv("https://www.dl.dropboxusercontent.com/s/3tdswu2jfgwp4xw/REIGN_2020_7.csv?dl=0")
data.head()

data["country-leader"] = data["country"] + ": " + data["leader"]
data["year-month"] = data["year"].astype(int).astype(str) + data["month"].astype(int).astype(str).str.zfill(2)
data["year-month"] = to_datetime(data["year-month"], format="%Y%m")
data = concat([data[["country-leader", "year-month"]],
               data.drop(["ccode", "country-leader", "leader", "year-month"],
                         axis=1)],
               axis=1)
data.head()

total_obs = len(data)
data = data.drop_duplicates(["country-leader", "year-month"], keep="first")
# n_duplicates = total_obs - len(data)
# print(f"{n_duplicates} observations with a duplicated identifier pair deleted.")
    
data_processor = PanelDataProcessor(data=data)
data_processor.build_processed_data()
# data_processor.data.head()

mdata = data_processor.data.copy()

# Assign unique ID based off broken panels
gaps = mdata.groupby("country-leader")["_period"].shift() < mdata["_period"] - 1
spells = gaps.groupby(mdata["country-leader"]).cumsum()
mdata['country-leader-spell'] = mdata['country-leader'] + array(spells, dtype = 'str')

## Create exit status for dissolution, change of government type, and same government type
mdata['outcome'] = 'no_exit'

for i in mdata['country-leader-spell'].unique():
    # Boolean index for the unique country-leader-spell observations
    dex1 = mdata['country-leader-spell'] == i
    # Boolean index for the last observation for the country-leader-spell observations
    dex2 = dex1 & (mdata['year-month'] == mdata.loc[dex1,'year-month'].max())
    # Boolean index for the last observation for the country-leader-spell observations unless it is censored
    dex3 = dex2 & (mdata['year-month'] != mdata['year-month'].max())
    # Only assign exit status if not censored
    if dex3.sum() > 0:
        futuredex = (mdata['year-month'] >= mdata.loc[dex3,'year-month'].values[0]) & (mdata['country'] == mdata.loc[dex3,"country"].values[0]) & ~dex1
        if futuredex.sum() == 0:
            mdata.loc[dex2,'outcome'] = "Dissolution" # could alternatively use dex1
        else:
            oldgovernment = mdata.loc[dex3,'government'].values[0]
            futuremdata = mdata[futuredex]
            newgovernment = futuremdata.loc[futuremdata['year-month'] == futuremdata['year-month'].min(),'government'].values[0]
            if oldgovernment == newgovernment:
                mdata.loc[dex2,'outcome'] = "Same_government_type" # could alternatively use dex1
            else:
                mdata.loc[dex2,'outcome'] = "Change_of_government_type" # could alternatively use dex1

# We no longer need country-leader-spell
mdata = mdata.drop('country-leader-spell', axis = 1)

# The outcome variable must be a category type
mdata['outcome'] = mdata['outcome'].astype('category')

# Obtain probability of type of exit conditional on exit
exit_modeler = LGBExitModeler(data=mdata, exit_col = 'outcome')
exit_modeler.build_model(parallelize=False)
exit_modeler_forecasts = exit_modeler.forecast()

# Do the outcomes always sum to 1?
summary1a = DataFrame(index = exit_modeler_forecasts.index.unique(), columns = exit_modeler_forecasts.columns[:-1])
for i in exit_modeler_forecasts.index.unique():
    summary1a.loc[i,:] = exit_modeler_forecasts.loc[i, exit_modeler_forecasts.columns != 'Future outcome'].sum(axis = 0)
(summary1a!=1).sum().sum()/(summary1a.shape[0]*summary1a.shape[1]) # Should be 0
abs(1-summary1a).sum().sum()/(summary1a.shape[0]*summary1a.shape[1]) # should be 0

# However, they are "close"
tmp = summary1a.to_numpy(dtype='float64')
isclose(tmp,full(tmp.shape,1)).sum()/(tmp.shape[0] * tmp.shape[1])

# Slightly improved normalization
exit_modeler_forecasts_normalized = exit_modeler_forecasts.copy()
exit_modeler_forecasts_normalized.iloc[:,:-1] = exit_modeler_forecasts_normalized.iloc[:,:-1].values / DataFrame(repeat(summary1a.values, repeats = mdata['outcome'].nunique(), axis = 0)).values

# Do the outcomes always sum to 1?
summary1a_normalized = DataFrame(index = exit_modeler_forecasts_normalized.index.unique(), columns = exit_modeler_forecasts_normalized.columns[:-1])
for i in exit_modeler_forecasts.index.unique():
    summary1a_normalized.loc[i,:] = exit_modeler_forecasts_normalized.loc[i, exit_modeler_forecasts_normalized.columns != 'Future outcome'].sum(axis = 0)
(summary1a_normalized!=1).sum().sum()/(summary1a_normalized.shape[0]*summary1a_normalized.shape[1]) # Should be 0
abs(1-summary1a_normalized).sum().sum()/(summary1a_normalized.shape[0]*summary1a_normalized.shape[1]) # should be 0
# half as many different than one and absolute error is cut in half

Obtain population of exit types from non-censored individuals

LGBExitModeler currently obtains the population of exit types from the last observation from every individual. However, exit modeler only trains models for non-censored individuals. Thus if the exit type for censored individuals is different than non-censored individuals models will be fit including the exit type for censored individuals and produce near 0 probability forecasts. The population of exit types should be obtained from non-censored individuals. (IDA internal: JIRA issue FIFEEXTN-14)

[Feature request] multiindex for forecasts from exitmodeler

LGBExitModeler().forecast() currently outputs the exit type as the last column. It would be more intuitive to use a multi-index structure that contains the individual identifier and the exit type. This would ensure the entire DataFrame is of the same dtype, which would make manipulations easier.

Does FIFE support prediction with future features?

I started playing with this package recently and I like it a lot. However, I'm not sure I fully understand the predict/forecast methods/functions. In particular, prediction or forecasts with some future features states (assuming we have dynamic features) for which we know their future values.

Example: In the REIGN dataset used in the example notebook, some of the continuous features are changing with time for a given leader. Assuming I trained a model with these features but was more interested in what the survival probability of a leader would be if the value of column say, "irregular" is of some value.

Is this sort of predictions/forecasts possible with FIFE? Or better, how can one incorporate such into the workflow.

Please pardon my ignorance. My understand of TTE analysis is somewhat scanty.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.