coltallen / btyd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from camdavidsonpilon/lifetimes

115.0 115.0 9.0 7.22 MB

Buy Till You Die and Customer Lifetime Value statistical models in Python.

Home Page: https://btyd.readthedocs.io/

License: Apache License 2.0

Python 99.94% Makefile 0.06%

bayesian buy-til-you-die customer-lifetime-value data-science python

btyd's Introduction

Welcome to my data science project portfolio! These projects are an ongoing WIP and I'm interesting in collaborating with others on projects involving Bayesian inference in Python and Stan, so please keep checking back for updates and contact me if you're interested. You can also find me on .

btyd's People

Stargazers

Watchers

Forkers

guilhermeprokisch ananddw24 mdjhnsn mopatches tuantx7110 ml-acope yukoga prakhar3949

btyd's Issues

Todos and priorities

Hi everyone,

What are the current development priorities for getting involved?

I was involved in the thread of the original repository back in February. I see that since then, quite some things have happened, and I want to understand where I can get myself involved.

Before the fork, I was working on updating lifetimes with type hinting, and I saw this set as a list of to-dos in the previous thread. Is it still current?

It would be helpful to have a list of open issues to identify where one can contribute.

Does it make sense to contribute with the type hinting? What else remains relevant?

Tutorial - steps on how to verify/validate the churn probabilities

Thanks for this useful package and incorporating some useful functions.

Currently, we are exploring this package for deriving some business insights on customers.

While the expected purchase count and expected average revenue can be verified using a typical sklearn metrics such as MSE, RMSE, am unable to implement how to use arviz for verifying the probabilities of churn. Mainly because, am more of applied data scientist. So, unable to use the arviz package as is for our problem (of verifying churn probability - probability_alive and probability_alive_upto_time_t). I did refer the post here - #33

But am not sure how I can do it in a simple intuitive manner for typical sklearn scientists

Is there any simple tutorial that you can share on how to validate and interpret the results? would really be helpful

Question - How to get ITT or transaction rate of a customer?

I am trying to know whether we can extract the transaction rate (lambda) of each customer through btyd package? By ITT, I mean Intra Transaction Time

I know the algo uses transaction rate to predict/determine future values. Is it possible to get the transaction rate for each customer through btyd package (instead of us computing the average gap between transactions)? Am not sure whether our average computation (and transaction rate is one and the same). But would be good to verify the value that package/algo outputs

Missing arguments on BetaGeoFitter.fit

Hi!

My name is Antonio and I tried to use you library, I was following the user guide when I tried to use the function BetaGeoFitter().fit(data) i received the following error:

I went to the code and I see that we can pass the values as a np array so I tried and another error arise:

conflict in monetary_value calculation and definition

The definition for monetary_value in the document:
"monetary_value represents the average value of a given customer’s purchases. This is equal to the sum of all a customer’s purchases divided by the total number of purchases. Note that the denominator here is different than the frequency described above."

summary_data_from_transaction_data function calculates monetary value as the mean of repeat purchases which equals to frequency.

Add `conditional_probability_of_being_alive_up_to_time` to `ParetoNBD`

There is an additional predictive capability for the ParetoNBD model that is not included in the legacy Lifetimes library:

CamDavidsonPilon#213

Python 3.10 support

Can you add release a new version of the package with python 3.10 support?

Issue while fitting BetaGeoModel

Hi,

I am facing below error while fitting the data in BetaGeoModel.

Error:
ContextualVersionConflict: (scipy 1.7.1 (/databricks/python3/lib/python3.9/site-packages), Requirement.parse('scipy>=1.8.0'), {'arviz'})

I have even tried reinstalling the Arviz package and I have upgraded the Scipy package to 1.10 version. But still I am facing the above error. Please help me to resolve the above mentioned error.

Support pickle

The Fitters' save_model and load_model use dill. Using pickle is a better option as it's part of the standard library and doesn't require all save and load use cases to be file based.

Add Continuous Time CLV Calculations

Unrealistically high CLV estimations have been a common complaint with the legacy lifetimes CLV formulations. I believe these discrepancies are due to calculations being made in discrete rather than continuous time.

Follow this link for a primer on comparing summation to integration,:

https://math.stackexchange.com/questions/2089929/comparing-discrete-sums-and-integrals

Assume in the graphic below that the red line is our customer value function, and integrating over time will give us our total CLV for the customer:

The bars represent discrete value measurements at regular time intervals. Summing up the area of the bars is essentially what the existing customer_lifetime_value method is doing. However, note the corners of the bars protruding beyond the red line - this will inflate total CLV estimations; the wider the bar, the greater the inflation. This width is fixed to monthly intervals in the current customer_lifetime_value implementation; the freq parameter only reflects the time intervals in which data was aggregated for model training, which is daily for most use cases. In theory training models on weekly or monthly data would improve the accuracy of CLV estimates.

Fortunately continuous-time CLV expressions exist. In addition to accuracy they also have the advantage of summing over the total lifetime of the customer rather than a user-specified time period. However, implementation is model-specific.

An expression for the Pareto-NBD model is provided as equation (2) on page 8 of this paper:

http://brucehardie.com/papers/rfm_clv_2005-02-16.pdf

I've also found implementations for the Beta-Geometric/Beta-Binomial model and a few other models that haven't been added yet to btyd, but none for the BG/NBD model. I'll try reaching out to Fader himself on LinkedIn for assistance on this.

Prediction API

Modules: lifetimes.models.__init__.BaseModel class object

Issue: Currently the User API has multiple predictive methods which must have arrays passed in individually as arguments, which is clunky for the end user.

Work Summary: Add a self.predict() function with arguments for each of the predictive methods and their respective inputs. Something like this:

def predict(self, method: str, *args) -> arraylike:
    
    method_dict = {
        'p_alive': self.conditional_probability_alive
        'n_purchases': self.expected_number_of_purchases_up_to_time
        }

    array_out = method_dict.get(method)(*args)

    return array_out

Other Comments: method_dict will probably need to be defined outside of this method as an object of the model subclass.

Persist Full InferenceData object as JSON

This is a fairly straightforward task that will go a long way towards improving model functionality and maintainability of the code base.

Modules: lifetimes.models.__init__.BaseModel class object

Issue: An ArviZ InferenceData object is created as a model attribute whenever model.fit() is called. Currently model persistence entails extracting model parameters from this attribute and dumping them into a memory-optimized JSON file. However, once this JSON file is loaded into a model, ArviZ plotting and statistical functions are no longer supported. The pre/post-processing code to format this JSON also adds unnecessary complexity to the BaseModel class and could make future maintenance more difficult. Plus let's be honest, this isn't a 350GB NLP model; reducing a <10 MB InferenceData object down to a <4 MB JSON is not worth the hassle.

Work Summary: Replace JSON formatting code in _unload_params() , fit(), save_params() and load_params() with ArViz methods like arviz.InferenceData.to_json() and arviz.from_json().

https://arviz-devs.github.io/arviz/api/data.html

remove_hypers can also be removed as a model class attribute, and I'm not opposed to renaming save_params() and load_params() to save_model() and load_model() either.

Other Comments: JSON is the preferred format for model persistence. Pickle files have their place for the fast read/writes demanded of online learning and passing objects between CPU threads, but the added complexity of their implementation just isn't worth it for a model that is only saved & loaded one time. They are also a security risk since malware can be obscured in a pickle format. I could totally see a hacker with prior system access overwriting a .pkl model file with an executable that exfiltrates customer IDs whenever the model is ran.

ImportError: DLL load failed while importing _nnls: The specified module could not be found.

I am trying to import btyd 0.1b3 in my windows 10 system with python 3.8.8

But I got the below error

C:\Users\test\Anaconda3\lib\site-packages\scipy\_distributor_init.py:30: UserWarning: loaded more than 1 DLL from .libs:
C:\Users\test\Anaconda3\lib\site-packages\scipy\.libs\libopenblas.3HBPCJB5BPQGKWVZAVEBXNNJ2Q2G3TUP.gfortran-win_amd64.dll
C:\Users\test\Anaconda3\lib\site-packages\scipy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  warnings.warn("loaded more than 1 DLL from .libs:"
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-1ea10774841a> in <module>
      9 import math
     10 from math import sqrt
---> 11 import btyd
     12 from btyd import *

~\AppData\Roaming\Python\Python38\site-packages\btyd\__init__.py in <module>
      1 import warnings
      2 
----> 3 from .fitters import BaseFitter
      4 from .fitters.beta_geo_fitter import BetaGeoFitter
      5 from .fitters.beta_geo_beta_binom_fitter import BetaGeoBetaBinomFitter

~\AppData\Roaming\Python\Python38\site-packages\btyd\fitters\__init__.py in <module>
      8 import pandas as pd
      9 from textwrap import dedent
---> 10 from scipy.optimize import minimize
     11 from autograd import value_and_grad, hessian
     12 from ..utils import _save_obj_without_attr, ConvergenceError

~\Anaconda3\lib\site-packages\scipy\optimize\__init__.py in <module>
    409 from ._nonlin import *
    410 from ._slsqp_py import fmin_slsqp
--> 411 from ._nnls import nnls
    412 from ._basinhopping import basinhopping
    413 from ._linprog import linprog, linprog_verbose_callback

ImportError: DLL load failed while importing _nnls: The specified module could not be found.

How can I import without any error?

CLV value too high in btyd

I am trying to predict the CLTV of a customer in next 365 days (12 month) window. I encountered the same issue with lifetimes package as well (and I see there is one open bug there). The issues can be found here and here

I use 4 years data as calibration data and 1 year data as holdout data.

I wrote the below piece of code

ggf = GammaGammaFitter(penalizer_coef=0.01) # model object
ggf.fit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal']) # model fitting
# Prediction of expected amount of average profit
monetary_cal_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal'])

monetary_cal_df["cltv_12m"] = ggf.customer_lifetime_value(bgf,
                                   monetary_cal_df['frequency_cal'],
                                   monetary_cal_df['recency_cal'],
                                   monetary_cal_df['T_cal'],
                                   monetary_cal_df['avg_monetary_value_cal'],
                                   time=12,  # 12 month
                                   freq="D",  # frequency of T
                                   discount_rate=0.01)
monetary_cal_df.sort_values("cltv_12m",ascending=False).head()

But the problem is CLTV values are high. For instance, in the below screenshot, (Don't worry about the column header name in the screenshot.Both expected purchase and CLV is for 12 months only)

When I do a multiplication of "expected_avg_spend" and "expected_purchase", they are low by more than 100K.

Can you guide me on whether it is normal to see such a huge differences (and have you encountered cases like this in your work-setting doing this project)? How is CLTV calculated? Is it correct from my end to think that they should be somewhat closer to the result of multiplication (between expected_avg_spend and expected_purchase)

Apart from R2, RMSE, I also plotted the below graph to see whether they are okay visually and I guess it is okay but is there any red flag that I should be aware of when doing assessments like this?

Accordingly, few related questions on the theory of CLTV

a) I see that we compute CLTV for a specific time horizon like 6 months, 12 months etc. So, when we mean Customer Lifetime Value in next 6 or 12 months or t months, am I right to interpret that this is the amount/revenue that we expect from the customer in the next 6 or 12 months or t months? The keyword lifetime doesn't mean the total revenue from the customer during his lifetime with us (ex: A customer can continue to stay with us for 10 years or more (as it is future, we may not know). but the CLV we predict is specifically for the next 6/12/t months.). Is my understanding right?

b) Can the time parameter in ggf.customer_lifetime_value() accept values only in terms of months as shown in code snippet comments. Because, I see that freq parameter is for a different column (T). So, CLTV can only be computed based on month intervals. Am I right to understand that?

c) For GGM, the prerequisite correlation between avg_monetary_cal and frequency_cal is 0.28. So, is 0.28 indicative of weak correlation to proceed further? Because I see in all tutorials online, it is all very low such as 0.03, 0.07 etc. So, 0.28 is low enough to proceed further?

d) When I compute r2 between my monetary_val_cal and expected_avg_spend, there is 92% of variance explained. But when I do the r2 calculation for monetary_val_holdout and expected_avg_spend (which is what we are expected to do I guess), I get only negative values. I guess it is overfitting. Is there any pointers that you can help me on how I can help improve the performance? but the r2 is around 72% for expected future transaction count (when I compare it with holdout_frequency). any suggestions on how can this be improved? I already set the penalizer to 0 (as it resulted in better r2 score) and subset my dataframe for GGM (to include only customers who have monetary_value_cal > 0). I filter only based on monetary value_cal > 0 because the package gave me an error that there are values = 0. So, I included the filter criteria. Basically, my BG/NBD models works well but it is GGM that is causing issue.

e) I use 4 years data as calibration data and 1 year data as holdout data. Not all customers are present in both sets. Some are present only in calibration and some are present only in holdout (because they became a customer only during holdout period). Anyway, we restrict our analysis only people who are present in both sets. Is that the right thing to do? or we should let the model predict for people who are missing in holdout set (may be they left during holdout period) but were present in calibration? Or this package should be used only for customers who are present in both calibration and holdout sets

f) Do you think it would be wise to reduce the dataset size? May be this problem doesn't require 5 years of data (may be they just add noise). Just choosing to model only based on past 2 years would be good route to take to solve this problem?

Currently, the metrics looks like below for monetary_holdout and expct_avg_spending. I drop NA because the model returned NaN as CLV value using gamma function (for rows where the frequency_holdout is zero). So, I dropped those records

Why monetary_val and monetary_holdout is missing?

@ColtAllen - Why is that we don't see monetary_val for calibration and holdout when we use "calibration_and_holdout_data" utility function. You can see that when I use the above btyd or lifetimes function (both tried), it results in below table (monetary value is missing). Can help me understand what is wrong here? If it is not there by design, then how do we compare the monetary values for actual and holdout datasets (using gamma gamma distributions)?

For ex: Currently, am able to compare the frequency between actual holdout and predicted expected purchase as shown below

But how can we do the same for monetary value (using Gamma gamma model) when we don't have monetary column value in the calibration_and_holdout_data? currently the table looks like below. I should do it locally?

one side question - I see that the duration in holdout dataset is 365 days. So, just to ensure proper comparison between actual and predicted values, I should set the time horizon in the model predict function also as 365 days. Am I right? I know based on business requirement, we can set different time horizons but to compare them correctly, we need to set as 365 days. Am I right?

Big data: pandas limitation

Hi everyone!

Are there any workarounds for working with large datasets? When I load (in spark) some large dataset and try to convert it to pandas, this process may display an error or take a long time.

Thanks! 👍

GammaGammaModel with BetaGeoCovarsFitter

There are errors when predicting CLV with GammaGammaModel using BetaGeoCovarsFitter.

At first, the error is 'BetaGeoCovarsFitter' object has no attribute '_conditional_expected_number_of_purchases_up_to_time'. Did you mean: 'conditional_expected_number_of_purchases_up_to_time'? which is due to gamma_gamma_model.py calling transaction_prediction_mode._conditional_expected_number_of_purchases_up_to_time but in beta_geo_cover_fitter.py, they are named without the leading underscore.

After added the underscore in beta_geo_cover_fitter.py, a different error shows: BetaGeoCovarsFitter._conditional_expected_number_of_purchases_up_to_time() missing 2 required positional arguments: 'X_tr' and 'X_do'.

btyd library is dependent on lifetimes but doesn't install it.

I installed btyd using pip install btyd.

It installed with no errors. However, it won't import as I haven't installed the lifetimes library in that environment. I think the references need to be changed in modified_beta_geo_fitter.py

Error message:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [1], in <cell line: 4>()
      2 import pandas as pd
      3 import numpy as np
----> 4 import btyd

File /libraries/CLTV/lib/python3.9/site-packages/btyd/__init__.py:6, in <module>
      4 from .fitters.beta_geo_fitter import BetaGeoFitter
      5 from .fitters.beta_geo_beta_binom_fitter import BetaGeoBetaBinomFitter
----> 6 from .fitters.modified_beta_geo_fitter import ModifiedBetaGeoFitter
      7 from .fitters.pareto_nbd_fitter import ParetoNBDFitter
      8 from .fitters.gamma_gamma_fitter import GammaGammaFitter

File /libraries/CLTV/lib/python3.9/site-packages/btyd/fitters/modified_beta_geo_fitter.py:11, in <module>
      8 from autograd.scipy.special import gammaln, beta, gamma
      9 from scipy.special import hyp2f1
---> 11 from lifetimes import BetaGeoFitter
     12 from lifetimes.generate_data import modified_beta_geometric_nbd_model
     15 class ModifiedBetaGeoFitter(BetaGeoFitter):

ModuleNotFoundError: No module named 'lifetimes'

Version: btyd==0.1a1

plot_period_transactions after BetaGeoModel().fit()

I was trying to plot using plot_period_transactions after BetaGeoModel().fit() but got an error:

AttributeError: 'BetaGeoModel' object has no attribute 'data'. Did you mean: '_idata'?

I tried, but wasn't able to figure out how to plot after Model(). Thank you.

Remove Model Persistence

Removing the save_model() and load_model() methods to streamline the new modeling API is an act some of you may find controversial, but both methods are literally single-line wrappers for the equivalent functions in ArViZ.

ArViZ is a very powerful library that is automatically installed along with BTYD because it is a transitive dependency of BTYD's core dependency, PyMC. There will be a future PR wrapping an aggregation of useful arviz model evaluation plots and metrics, but otherwise I don't want to see BTYD become a thin wrapper for what can be done by simply passing in Model.idata into ArViZ.

In the next PR I'll be adding instructions to the documentation on how to persist models in ArViz:

import btyd
import arviz as az

data_df = btyd.load_cdnow_summary()

bgm = btyd.BetaGeoModel().fit(data_df)

# Save inference data of a fitted model as a JSON:
bgm.idata.to_json('path/to/file.json')

# Load model inference data from a JSON:
bgm._idata = az.from_json('path/to/file.json')

Another advantage of scrapping persistence from the BTYD model API is that ArViZ supports a variety of file formats, so users are no longer restricted to JSON files for models. Once this project is in Beta I also want to add some model type checkers originally intended for load_model(), but those can be assumed by the BaseModel._unload_params() method instead.

Fix Plotting Dependencies on Calibration vs. Holdout Purchases

It's been a longstanding issue in the legacy Lifetimes library that the plot_calibration_purchases_vs_holdout_purchases function in plotting.py cannot be used without first running data through calibration_and_holdout_data in utils.py. This is a useful plotting function for model evaluation, but for those of us working with huge volumes of transaction data, utils.py just isn't going to cut it and the RFM aggregations must be done in a separate service like Spark or Snowflake, effectively blocking the use of this function.

Replacing the calibration_holdout_matrix argument with separate calibration_matrix and holdout_matrix arguments should be enough to resolve this matter.

Create Aesara `Op` for `scipy.special.hyp2f1`

This task must be completed before the ParetoNBD Model can be added.

I've created an issue for this in the Aesara repo, but until that PR is merged and published in a future release of that library, this Op will need to be included in this library.

https://aesara.readthedocs.io/en/latest/extending/creating_an_op.html

btyd.fitters.BaseFitter._fit() got multiple values for keyword argument 'bounds'

Hi,

I want to pass custom bound values to the scipy.optimize.minimuze function in the GammaGammaFitter.fit() function, but I get the following error:
btyd.fitters.BaseFitter._fit() got multiple values for keyword argument 'bounds'.

A small, simplified example to replicate this error would be:

from btyd import BetaGeoFitter
from btyd import GammaGammaFitter

from lifetimes.datasets import load_cdnow_summary_data_with_monetary_value

df_parsed = load_cdnow_summary_data_with_monetary_value()
returning_customers_summary = df_parsed[df_parsed["frequency"] != 0]
model = BetaGeoFitter()
model.fit(
    frequency=df_parsed["frequency"], 
    recency=df_parsed["recency"], 
    T=df_parsed["T"]
)
print(model.summary)

gg_model = GammaGammaFitter(penalizer_coef=1e-2)
gg_model.fit(
    frequency=returning_customers_summary["frequency"],
    monetary_value=returning_customers_summary["monetary_value"],
    bounds=((None, None), (0.00996, None), (None, None))
)

I need this feature to avoid problems with GammaGammaFitter.conditional_expected_average_profit() and GammaGammaFitter.customer_lifetime_value().
For my data I need a higher penalizer_coef which results in a q smaller than 1. Which then leads to negative CLV outputs.
When I set q_constraint = True (meaning bounds=((None, None), (0.0, None), (None, None))) , I get q=1.0 as a parameter, which leads to nan values in GammaGammaFitter.conditional_expected_average_profit() and GammaGammaFitter.customer_lifetime_value(), so I want to "force" q to be bigger than 1.0 with custom bounds.

Question - Fit and predict() only using calibration variables?

I currently have a code written like below

bgf = BetaGeoFitter(penalizer_coef=0.001) # model object
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal']) # model fitting

# Prediction of expected number of transaction for each customer for one year (365 days)
summary_cal_holdout['expctd_num_of_purch'] = bgf.predict(365, summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal']) 
summary_cal_holdout.sort_values("expctd_num_of_purch",ascending=False).head()

As you can see that I fit the model using frequency_cal, recency_cal and T_cal (from calibration dataset).

So, now in the next line (that is bgf.predict), I again use the same frequency_cal, recency_cal and T_cal.

So, is this the right thing to do? Meaning, we pass the same variables as input to both fit and predict methods? I am so used to fitting using X_train and predicting using X_test in usual ML models. So, I was bit confused.

Since, our objective is to predict the frequency_holdout, we have no other option than to fit and predict using the same calibration variables? So, through fit(), we learn the model parameters (through probability distributions) and with predict() with a time horizon, model uses the same input variables along with time horizon value and uses the parameters learnt to come up with prediction output?

Sorry, if my question seems redundant. Am bit confused. but, all your responses are helping towards getting some clarity. Appreciate your help.

Add Transitive Dependencies to `setup.cfg`

btyd installs & runs without a hitch whenever creating a new Python environment, but when installing in managed environments like DataBricks or Google Colab, previously installed dependencies like scipy and numpy are not automatically updating and can make btyd unusable.

For now this is easily remedied by updating scipy and numpy to their latest versions, but for version 0.1b2 I intend to add an [install-requires] line to setup.cfg and fix this.

Start Here for Tasks and Priorities

Tasks and Priorities Discussion

Add Images for Documentation Here

To create weblinks for images in the package documentation, make a new post in this issue and drag/drop images into the comments field. Copy/paste the weblink that appears.

Examples

UML Diagram for previous model versions :
https://user-images.githubusercontent.com/10178857/158027958-449be0c6-69e0-4f18-b6c9-c97a42bbf698.png

pydeps Module Dependency Diagram:
https://user-images.githubusercontent.com/10178857/158029011-2a1118b7-05e9-4317-934a-7510e723897d.png

Add Capability to Load Models from CSV Files

Models can presently be saved in either JSON or CSV format, but whenever attempting to load from CSV files I encounter the following error:

xarray.core.variable.MissingDimensionsError: 'chain' has more than 1-dimension and the same name as one of its dimensions ('chain', 'draw'). xarray disallows such variables because they conflict with the coordinates used to label dimensions.

I would appreciate assistance on this from anyone who is familiar with the arviz or xarray libraries, as this issue could probably be fixed in 1 or 2 lines of code.

Wrong Holdout Monetary Value

This is an unresolved issue with the legacy Lifetimes library that I think warrants further investigation. If the calibration_and_holdout_data function in utils.py is indeed not calculating monetary value correctly between the holdout (training) and calibration (test) datasets, then this is a serious bug that will impact the ability to properly evaluate the GammaGamma model.

ggf.customer_lifetime_value produces NaN clv

Hi there, I successfully fitted a modified beta geo model as well as a gamma gamma model. However, when I use the customer_lifetime_value function from the gg model, all CLV are NaN.

DISCOUNT_a = 0.06                # annual discount rate
LIFE = 12                        # lifetime expected for the customers in months

discount_m = (1 + DISCOUNT_a)**(1/12) - 1     # monthly discount rate

clv = ggf.customer_lifetime_value(
        transaction_prediction_model = bgf,  
        frequency = df_rftv["frequency"],  
        recency = df_rftv["recency"],  
        T = df_rftv["T"],  
        monetary_value = df_rftv["monetary_value"],  
        time = LIFE,    
        freq = "D",                          
        discount_rate = discount_m)

btyd on conda

Hi,

I'd like the ability to install btyd through anaconda and conda forge. I was wondering if you are planning to or are open to the idea of creating a conda receipe for this. I could also take a shot at this if you'd like.

Thanks and Regards!

Calculation of CLV value returns NaN in some cases

I tried to calculate CLV but unfortunately for 2/3 of my data I got NaN values.

`ggf = GammaGammaFitter(penalizer_coef = 1e-06)
ggf.fit(
frequency = df_rftv["frequency"],
monetary_value = df_rftv["monetary_value"],
weights = None,
verbose = True,
tol = 1e-06,
q_constraint = True)

clv = ggf.customer_lifetime_value(
transaction_prediction_model = bgf,
frequency = df_rftv["frequency"],
recency = df_rftv["recency"],
T = df_rftv["T"],
monetary_value = df_rftv["monetary_value"],
time = LIFE,
freq = "D",
discount_rate = discount_m)

df_rftv.insert(0, "CLV", clv)`

The results that I got:

Python version: 3.8.12
BTYD version: 0.1b3

Unstable probability of being alive distribution in Pareto/NBD model

Hi there,
I am working on estimating the probability of being alive in a pool of customers using the Pareto/NBD model.
I found the following a bit weird.
The plots show the distribution of probabilities of being alive of my customers. The first is a concentrated distribution while the second one is more smooth. Both of them have been computed using the Pareto/NBD with the same customers and the same l2 coefficient. However, due to the optimization process (Nelder - Mead) the parameters of the model are different (a couple of figures below). Nonetheless, even with very similar parameters (see the third figure) there is this change in the shape of the distribution.
Is there any explanation for this?

Confidence intervals description

Hi!

First of all, thanks to @ColtAllen for improving and continuing this amazing library!

After creating a model, we can see confidence intervals, but what is the meaning of r, alpha, a, and b?

e.i
betageo_model.confidence_intervals_

	lower 95% bound	upper 95% bound
r ?	x	x
alpha ?	x	x
a ?	x	x
b ?	x	x

Thanks!! 👍

pip install btyd does not work on google colab

Relying on btyd for business decisions

Thanks for this package and it helps us with lot of interesting info about customer churn, expected purchase orders and revenue etc.

For our business, it is especially useful to know whether the customer will churn in the next few months (3 or 6 or n months). So, your newly incorporated method helps us do that.

However, I would like to know whether we can rely on this package to make real time decisions (in business setting). Meaning, can we trust the outcome from this package for our business operations?.

I asl this mainly because it is in beta and you are worling on some modifications for CLTV calculations etc.

But yes, we really appreciate your help with this new package (which is improved from old lifetimes)

BG/NBD Model in PyMC

Hey! I have recently drafted a BG/NBD model in PyMC, see https://juanitorduz.github.io/bg_nbd_pymc/
In particular I have also added the possibility to add time-invariant regressors (even though the sampler and/or model parametrisation need some improvements). Other models from the lifetimes package can be ported to PyMC in a similar way (because the lifetimes package has all the log-likelihood functions written in numpy already).

Start Here

This post will be actively maintained, so keep checking back and if there are any specific tasks or suggestions you want to work on for the Beta release and beyond, please let me know.

Rodrigorivera inspired me to create Projects for all outstanding tasks, which will be sorted by version release:

BETA RELEASE

Deprecate fitters module and fix known lifetimes bugs:

Add pareto_nbd_model.py module
Add gamma_gamma_model.py module
Add mbgd_model.py module
Add bg_bb_model.py module
Add bg_nbd_covars_model.py module
Docstrings for predictive methods.
Remove utils.py plotting dependencies from plotting.py
Add matrix plotting method to BaseModel
Add capability to load models from CSV files.
Raise pymc exception whenever model fails to converge
Create model exception class and raise whenever the wrong model type is loaded
Type hinting for utils.py, plotting.py, and generate_data.py modules
Delete redundant load_cdnow_summary() function

RELEASE CANDIDATE

Polish up documentation and fix any bugs identified while working on beta release(s):

Add arviz plotting suggestions and code to docs for model inference
Add an explanatory chart to documentation summarizing model characteristics
Create and add logo for library
CI/CD for auto-updates to project documentation
Consider hosting docs on GitHub and emulating documentation style of ArViZ and bambi

FUTURE VERSION RELEASES

Implement additional BTYD Models, enhance plotting functions, documentation and library performance, etc.

Add individual customer variant of BetaGeoModel
Add individual customer variant of ParetoNBDModel
Add individual customer variant of ModBetaGeoModel
Add time-invariant covariates to ParetoNBDModel
Add Pareto/GGG Model
Add PDO Model
Add BG/NBD-k and MBG/NBD-k Models
Add time-varying covariates to ParetoNBDModel? (Note this model could take days to train and necessitate checkpointing functionality to be added)
Suggestions for any other models to add?
Add tutorial notebook(s)
Add sample_kwargs to BaseModel.fit
Fix BetaGeo conditional probability calculation
Review viability of modifying hyperprior RVs
Enhance existing plotting functions with seaborn
RFM score function for quartiles
jax sampling via pymc to speed up model fitting times
gpu sampling via pymc to speed up model fitting times
Comprehensive evaluation of model priors to determine best defaults
Refactor utils.py
Add code for mlflow model registration to documentation
Refactor `BetaGeoModel._conditional_probability_alive' unit test to remove extraneous args.
Refactor `BetaGeoModel._conditional_expected_number_of_purchases_up_to_time' unit test to remove extraneous args.

Projects

Documentation Adventures (Discussion)
Deprecate fitters Module (Discussion)
Refactor utils.py (Discussion)
Plotting Parea (Discussion)
New Models to Add (Discussion)
CI/CD (Discussion)
Library Optimization (Discussion)

Documentation issue

I see in the documentation the GGM model is mentioned like as below but what is available (auto populates upon tab key) is GammaGammaFitter and not GammaGammaModel as shown in doc below. I guess it should be updated else the import statement doesn't work.

Adding time-invariant covariates

Hey! Continuing this project is a great idea. I think a nice addition is to work on merging the open PR for time-invariant covariates CamDavidsonPilon#342 by @meremeev It seems the core and tests are already there.

ImportError: cannot import name 'minimize' from 'scipy.optimize' (unknown location)

I am using python 3.8.8 in windows 10 with btyd 0.1b3.

While my btyd package is installed successfully, I get the below error when I try to import the btyd package in my jupyter notebook

ImportError                               Traceback (most recent call last)
<ipython-input-1-8ac04f5820c7> in <module>
     10 from math import sqrt
     11 import pycaret
---> 12 import btyd
     13 from btyd import *

~\AppData\Roaming\Python\Python38\site-packages\btyd\__init__.py in <module>
      1 import warnings
      2 
----> 3 from .fitters import BaseFitter
      4 from .fitters.beta_geo_fitter import BetaGeoFitter
      5 from .fitters.beta_geo_beta_binom_fitter import BetaGeoBetaBinomFitter

~\AppData\Roaming\Python\Python38\site-packages\btyd\fitters\__init__.py in <module>
      8 import pandas as pd
      9 from textwrap import dedent
---> 10 from scipy.optimize import minimize
     11 from autograd import value_and_grad, hessian
     12 from ..utils import _save_obj_without_attr, ConvergenceError

ImportError: cannot import name 'minimize' from 'scipy.optimize' (unknown location)

How to import btyd package successfully without any error?

my pip show command returns the below output

pip show scipy

Name: scipy
Version: 1.8.1
Summary: SciPy: Scientific Library for Python
Home-page: https://www.scipy.org
Author: None
Author-email: None
License: BSD
Location: c:\users\test\anaconda3\lib\site-packages
Requires: numpy
Required-by: ziln-cltv, yellowbrick, xgboost, xarray-einstats, umap-learn, scikit-optimize, scikit-learn, salib, pynndescent, pymc, pyLDAvis, pycaret, phik, pandas-profiling, mlxtend, mlflow, missingno, kmodes, imbalanced-learn, ImageHash, hyperopt, gensim, feature-engine, catboost, Boruta, arviz, aesara, aeppl, statsmodels, seaborn, scikit-plot, scikit-image, pyod, optuna, lightgbm, category-encoders

my pip show for btyd looks like below

Name: btyd
Version: 0.1b3
Summary: Buy Till You Die and Customer Lifetime Value statistical models in Python.
Home-page: None
Author: Colt Allen
Author-email: None
License: Apache 2.0
Location: c:\users\users\anaconda3\lib\site-packages
Requires: autograd, pymc, numpy
Required-by:

When I degrade the scipy to 1.6.1, I get the below warning and error in jupyter notebook

WARNING (aesara.configdefaults): g++ not available, if using conda: `conda install m2w64-toolchain`
WARNING (aesara.configdefaults): g++ not detected!  Aesara will be unable to compile C-implementations and will default to Python. Performance may be severely degraded. To remove this warning, set Aesara flags cxx to an empty string.
---------------------------------------------------------------------------
NoSectionError                            Traceback (most recent call last)
~\AppData\Roaming\Python\Python38\site-packages\aesara\configparser.py in fetch_val_for_key(self, key, delete_key)
    236             try:
--> 237                 return self._aesara_cfg.get(section, option)
    238             except InterpolationError:

~\Anaconda3\lib\configparser.py in get(self, section, option, raw, vars, fallback)
    780         try:
--> 781             d = self._unify_values(section, vars)
    782         except NoSectionError:

~\Anaconda3\lib\configparser.py in _unify_values(self, section, vars)
   1148             if section != self.default_section:
-> 1149                 raise NoSectionError(section) from None
   1150         # Update with the entry specific variables

NoSectionError: No section: 'blas'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~\AppData\Roaming\Python\Python38\site-packages\aesara\configparser.py in __get__(self, cls, type_, delete_key)
    352             try:
--> 353                 val_str = cls.fetch_val_for_key(self.name, delete_key=delete_key)
    354                 self.is_default = False

~\AppData\Roaming\Python\Python38\site-packages\aesara\configparser.py in fetch_val_for_key(self, key, delete_key)
    240         except (NoOptionError, NoSectionError):
--> 241             raise KeyError(key)
    242 

KeyError: 'blas__ldflags'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-8ac04f5820c7> in <module>
     10 from math import sqrt
     11 import pycaret
---> 12 import btyd
     13 from btyd import *

~\Anaconda3\lib\site-packages\btyd\__init__.py in <module>
      8 from .fitters.gamma_gamma_fitter import GammaGammaFitter
      9 from .fitters.beta_geo_covar_fitter import BetaGeoCovarsFitter
---> 10 from .models.beta_geo_model import BetaGeoModel
     11 from .models.mod_beta_geo_model import ModBetaGeoModel
     12 from .models.gamma_gamma_model import GammaGammaModel

~\Anaconda3\lib\site-packages\btyd\models\__init__.py in <module>
     10 import pandas as pd
     11 
---> 12 import pymc as pm
     13 import aesara.tensor as at
     14 import arviz as az

~\AppData\Roaming\Python\Python38\site-packages\pymc\__init__.py in <module>
     45 
     46 
---> 47 __set_compiler_flags()
     48 
     49 from pymc import _version, gp, ode, sampling

~\AppData\Roaming\Python\Python38\site-packages\pymc\__init__.py in __set_compiler_flags()
     28 def __set_compiler_flags():
     29     # Workarounds for Aesara compiler problems on various platforms
---> 30     import aesara
     31 
     32     current = aesara.config.gcc__cxxflags

~\AppData\Roaming\Python\Python38\site-packages\aesara\__init__.py in <module>
    118 
    119 # isort: off
--> 120 from aesara import scalar, tensor
    121 from aesara.compile import (
    122     In,

~\AppData\Roaming\Python\Python38\site-packages\aesara\tensor\__init__.py in <module>
    103 # adds shared-variable constructors
    104 from aesara.tensor import sharedvar  # noqa
--> 105 from aesara.tensor import (  # noqa
    106     blas,
    107     blas_c,

~\AppData\Roaming\Python\Python38\site-packages\aesara\tensor\blas.py in <module>
    160 from aesara.scalar import bool as bool_t
    161 from aesara.tensor import basic as at
--> 162 from aesara.tensor.blas_headers import blas_header_text, blas_header_version
    163 from aesara.tensor.elemwise import DimShuffle, Elemwise
    164 from aesara.tensor.exceptions import NotScalarConstantError

~\AppData\Roaming\Python\Python38\site-packages\aesara\tensor\blas_headers.py in <module>
   1013 
   1014 
-> 1015 if not config.blas__ldflags:
   1016     _logger.warning("Using NumPy C-API based implementation for BLAS functions.")
   1017 

~\AppData\Roaming\Python\Python38\site-packages\aesara\configparser.py in __get__(self, cls, type_, delete_key)
    355             except KeyError:
    356                 if callable(self.default):
--> 357                     val_str = self.default()
    358                 else:
    359                     val_str = self.default

~\AppData\Roaming\Python\Python38\site-packages\aesara\link\c\cmodule.py in default_blas_ldflags()
   2861         if any("mkl" in fl for fl in ret):
   2862             ret.extend(["-lm", "-lm"])
-> 2863         res = try_blas_flag(ret)
   2864         if res:
   2865             if "mkl" in res:

~\AppData\Roaming\Python\Python38\site-packages\aesara\link\c\cmodule.py in try_blas_flag(flags)
   1993     cflags.extend([f"-L{path_wrapper}{d}{path_wrapper}" for d in std_lib_dirs()])
   1994 
-> 1995     res = GCC_compiler.try_compile_tmp(
   1996         test_code, tmp_prefix="try_blas_", flags=cflags, try_run=True
   1997     )

~\AppData\Roaming\Python\Python38\site-packages\aesara\link\c\cmodule.py in try_compile_tmp(cls, src_code, tmp_prefix, flags, try_run, output, comp_args)
   2394             src_code,
   2395             tmp_prefix,
-> 2396             cls.patch_ldflags(flags),
   2397             try_run,
   2398             output,

~\AppData\Roaming\Python\Python38\site-packages\aesara\link\c\cmodule.py in patch_ldflags(flag_list)
   2432         if not libs:
   2433             return flag_list
-> 2434         libs = GCC_compiler.linking_patch(lib_dirs, libs)
   2435         for flag_idx, lib in zip(flag_idxs, libs):
   2436             flag_list[flag_idx] = lib

~\AppData\Roaming\Python\Python38\site-packages\aesara\link\c\cmodule.py in linking_patch(lib_dirs, libs)
   2453                 windows_styled_libs = [
   2454                     fname
-> 2455                     for fname in os.listdir(lib_dir)
   2456                     if not (os.path.isdir(os.path.join(lib_dir, fname)))
   2457                     and fname.split(".")[0] == lib

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'D:\\a\\1\\s\\numpy\\build\\openblas_info'

CLV value is NA

Am trying to compute CLV for customers in my dataset.

However, I see that for lot of the customers, the CLV value is NaN as shown below

Though some of them have zero as value for "frequency_holdout", then shouldn't the CLV be zero as well? why is it NA? Or is there any mistake in my code?

My code looks like below

ggf = GammaGammaFitter(penalizer_coef=0) # model object
ggf.fit(summary_df['frequency_cal'], summary_df['monetary_value_cal']) # model fitting
# Prediction of expected amount of average profit
summary_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(summary_df['frequency_cal'], summary_df['monetary_value_cal'])

summary_df["cltv_5m"] = ggf.customer_lifetime_value(bgf,
                                   summary_df['frequency_cal'],
                                   summary_df['recency_cal'],
                                   summary_df['T_cal'],
                                   summary_df['monetary_value_cal'],
                                   time=5,  # 5 month
                                   freq="D",  # frequency of T
                                   discount_rate=0.01)

summary_df.sort_values("cltv_5m",ascending=False).tail(20)