camdavidsonpilon / lifetimes Goto Github PK

View Code? Open in Web Editor NEW

1.4K 54.0 369.0 4.14 MB

Lifetime value in Python

License: MIT License

Makefile 0.27% Python 99.73%

python statistics data-science

lifetimes's Introduction

Measuring users is hard. Lifetimes makes it easy.

Read me first: Latest on the lifetimes project

👋 This codebase has moved to "maintenance-mode". I won't be adding new features, improvements, or even answering issues in this codebase (but perhaps the occasional bug fix). Why? I don't use lifetimes anymore, nor do I keep up with the literature around RFM.

A project has emerged as a successor to lifetimes, PyMC-Lab/PyMC-Marketing, please check it out!

Introduction

Lifetimes can be used to analyze your users based on a few assumption:

Users interact with you when they are "alive".
Users under study may "die" after some period of time.

I've quoted "alive" and "die" as these are the most abstract terms: feel free to use your own definition of "alive" and "die" (they are used similarly to "birth" and "death" in survival analysis). Whenever we have individuals repeating occurrences, we can use Lifetimes to help understand user behaviour.

Applications

If this is too abstract, consider these applications:

Predicting how often a visitor will return to your website. (Alive = visiting. Die = decided the website wasn't for them)
Understanding how frequently a patient may return to a hospital. (Alive = visiting. Die = maybe the patient moved to a new city, or became deceased.)
Predicting individuals who have churned from an app using only their usage history. (Alive = logins. Die = removed the app)
Predicting repeat purchases from a customer. (Alive = actively purchasing. Die = became disinterested with your product)
Predicting the lifetime value of your customers

Specific Application: Customer Lifetime Value

As emphasized by P. Fader and B. Hardie, understanding and acting on customer lifetime value (CLV) is the most important part of your business's sales efforts. And (apparently) everyone is doing it wrong (Prof. Fader's Video Lecture). Lifetimes is a Python library to calculate CLV for you.

Installation

pip install lifetimes

Contributing

Please refer to the Contributing Guide before creating any Pull Requests. It will make life easier for everyone.

Documentation and tutorials

Official documentation

Questions? Comments? Requests?

Please create an issue in the lifetimes repository.

Main Articles

Probably, the seminal article of Non-Contractual CLV is Counting Your Customers: Who Are They and What Will They Do Next?, by David C. Schmittlein, Donald G. Morrison and Richard Colombo. Despite it being paid, it is worth the read. The relevant information will eventually end up in this library's documentation though.
The other (more recent) paper is “Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model, by Peter Fader, Bruce Hardie and Ka Lok Lee.

More Information

Roberto Medri did a nice presentation on CLV at Etsy.
Papers, lots of papers.
R implementation is called BTYD (Buy 'Til You Die).
Bruce Hardie's Website, especially his notes, is full of useful and essential explanations, many of which are featured in this library.

lifetimes's People

Contributors

Stargazers

Watchers

Forkers

vladimirkazantsevpm erikvdp douglas-larocca yhat davidlopezberzosa patrick-russell davegerson luke14free pierrearb tachyon5 statwonk tomaugspurger skogsbaeck pombredanne jyt109 xanadu31 praneethkumarpidugu debajyoti7 rhydomako jackyckma unws yashwanth11 ehdr bigsnarfdude freshfrechfrais codygreco anaveenan animeshdubeyad michael-webster benvandyke rsallo leandroloi pabloschall eshaankuls25 zhouyonglong palomacue munduruca mindis vyraun tinacloud munchery-jackyin kingfink vicros pteehan ilopezfr rampratheek kevin-mercurio hhammond keshavramaswamy volodymyrk hunjaejung noahliot maniuber va1da2 aashanand gtadiparthi abudata jpotts18 he9qi hannahl-eb jhurliman stevenyesz sunliwen libardo1 frankfqchen asadisaghar katakwar86 sokolramaj zzanswer amodig guojiangwei2 egkachai the-hof catwang42 sarath234 alvinthai donald-cl aprotopopov singhpkj xxlatgh milovanovicm vladserkoff adrimarteau cclinard cindyccui zapier harrisong sfarhd14 nnfuzzy oelesin rahmanhpu schneiderjay sainiudit rishika16 mike-peth gearchen fbertola austinrochford zhammi maythapk

lifetimes's Issues

Code in tutorial not working

Thanks Cam for this great package, it is very useful.

I am currently going through the tutorial and I noticed something strange. When executing:

from lifetimes.plotting import plot_period_transactions

plot_period_transactions(bgf)

I get a chart which is significantly different from what you use in your page (it looks like a worse fit to me):

Do you know what can be the reason?

Also, one other minor problem: the pip install gives you a different version of your package, in which the file cdnow_customers_transactions.csv is not present.

Aggregating transactional data neglects first-day purchases

Hello,
I want to point out an issue regarding the data aggregator:

from lifetimes.utils import summary_data_from_transaction_data

If you aggregate data with the Days (D) frequency (which is the default) the users that do purchase the first day, just after the install, result in zero frequency in the aggregated dataset.
So we are neglecting a possibly large fraction of purchasing users, considering them as non-purchasing.

If you want an example, consider this transactional data:

1,2016-01-01 00:01:44,0.0
2,2016-01-01 00:01:50,0.0
1,2016-01-01 00:01:55,5.0
1,2016-01-01 00:01:58,5.0
2,2016-01-05 00:01:50,1.0

Run the aggregator on them:
from lifetimes.utils import summary_data_from_transaction_data data = summary_data_from_transaction_data(transaction_data, 'id', 'date',monetary_value_col='monetary_value', observation_period_end='2016-04-08', freq='D') print data.head()
You'll get:

  frequency  recency     T  monetary_value
id                                          
1         0.0      0.0  98.0             0.0
2         1.0      4.0  98.0             1.0

Which is plain wrong in my opinion.
The purchases that user 1 does the first day are completely neglected.
How to fix this?

Cheers
Marco

Computing P(alive) Using the BG/NBD Model

I found that for all one-time buyers, the P(alive)==1.0, which is rather strange.

And later I find in the paper http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf that it's the shortcoming of the mathematical formula being used since the model assume that death
occurs after a purchase and that customers are alive at the beginning of the
observation period. The paper also introduced a variant of the basic BG/NBD model which also estimate the number of dead at the beginning of observation period. It would be nice if that variant will be included in the library. I'm not yet able to figure out how to find the parameters of the likelihood function, may spend some time on it but not sure if I can make it since I don't have experience in solving likelihood function.

PyData talk

This PyData talk might be of interest:

Implementing and Training Predictive Customer Lifetime Value Models in Python by Jean-Rene Gauthier, Ben Van Dyke. https://www.youtube.com/watch?v=gx6oHqpRgpY&list=PLGVZCDnMOq0rxoq9Nx0B4tqtr891vaCn7&index=45

and the accompanying notebook: https://github.com/datascienceinc/pydata-seattle-2017/blob/master/lifetime-value/pareto-nbd.ipynb

they use a couple lifetimes routines to do some data preparation on the CDNOW dataset
but the main topic of the talk/notebook is an implementation of Pareto-NBD as a hierarchal model using the pymc3 library
I've seen some other MCMC implementations of Pareto-NBD (https://github.com/mplatzer/BTYDplus/blob/master/R/pareto-nbd-mcmc.R, which follows from http://ieeexplore.ieee.org/document/4344404/); but at least to me, the pymc3 model interface makes the above implementation particularly concise

vebose flag in fitting models

summary_data_from_transaction_data implicitly requires Transaction Data to be sorted by date

The utility summary_data_from_transaction_data implicitly requires the input DataFrame to be sorted by the Datetime column in ascending order. Either this requirement should be explicitly stated in the documentation, or the utility itself should conduct the sort prior to summarization. Many transactional datasets (from Shopify's API, for example) come sorted in descending order of timestamps by default.

Minimal repro below. Notice the difference in monetary_value field for customer with ID 2.

import pandas as pd
from lifetimes.utils import summary_data_from_transaction_data

# Replicating output from Readme.md under "Example using transactional datasets"
# https://github.com/CamDavidsonPilon/lifetimes#example-using-transactional-datasets
cust = pd.Series([0,1,2,2,2])
dates1 = pd.to_datetime(pd.Series(['2014-03-08 00:00:00',
                  '2014-05-21 00:00:00',
                  '2014-03-14 00:00:00',
                  '2014-04-09 00:00:00',
                  '2014-05-21 00:00:00']))
sales1 = pd.Series([10,20,10,20,25])
transaction_data1 = pd.DataFrame({'date': dates1, 'id': cust, 'sales': sales1})
summary1 = summary_data_from_transaction_data(transaction_data1, 'id', 'date', 'sales', observation_period_end='2014-12-31')
print(summary1)

    frequency  recency      T  monetary_value
id                                           
0         0.0      0.0  298.0             0.0
1         0.0      0.0  224.0             0.0
2         2.0     68.0  292.0            22.5

# Now we just change the order of the 3rd and 5th transactions
dates2 = pd.to_datetime(pd.Series(['2014-03-08 00:00:00',
                  '2014-05-21 00:00:00',
                  '2014-05-21 00:00:00',
                  '2014-04-09 00:00:00',
                  '2014-03-14 00:00:00']))
sales2 = pd.Series([10,20,25,20,10])
transaction_data2 = pd.DataFrame({'date': dates2, 'id': cust, 'sales': sales2})
summary2 = summary_data_from_transaction_data(transaction_data2, 'id', 'date', 'sales', observation_period_end='2014-12-31')
print(summary2)

    frequency  recency      T  monetary_value
id                                           
0         0.0      0.0  298.0             0.0
1         0.0      0.0  224.0             0.0
2         2.0     68.0  292.0            15.0

conditional_probability_alive in ModifiedBetaGeoFitter and BetaGeoFitter

It's strange but there are different code realization for conditional_probability_alive for ModifiedBetaGeoFitter and BetaGeoFitter with the same link on Computing P(alive) Using the BG/NBD Model:

class BetaGeoFitter(BaseFitter):

    def conditional_probability_alive(self, frequency, recency, T):
        """
        Compute the probability that a customer with history (frequency, recency, T) is currently
        alive. From http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf

        Parameters:
            frequency: a scalar: historical frequency of customer.
            recency: a scalar: historical recency of customer.
            T: a scalar: age of the customer.

        Returns: a scalar

        """
        r, alpha, a, b = self._unload_params('r', 'alpha', 'a', 'b')
        return 1. / (1 + (frequency > 0) * (a / (b + frequency - 1)) * ((alpha + T) / (alpha + recency)) ** (r + frequency))

class ModifiedBetaGeoFitter(BetaGeoFitter):

    def conditional_probability_alive(self, frequency, recency, T):
        """
        Compute the probability that a customer with history (frequency, recency, T) is currently
        alive. From http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf

        Parameters:
            frequency: a scalar: historical frequency of customer.
            recency: a scalar: historical recency of customer.
            T: a scalar: age of the customer.

        Returns: a scalar

        """
        r, alpha, a, b = self._unload_params('r', 'alpha', 'a', 'b')
        return 1. / (1 + (a / (b + frequency)) * ((alpha + T) / (alpha + recency)) ** (r + frequency))

conditional_probability_alive BetaGeoFitter

conditional_probability_alive ModifiedBetaGeoFitter

Save models

Hi, is there a quick way to save the models that take a long time to fit? The idea is to just save them and then use them...
Thanks!

customer_lifetime_value

Hi @CamDavidsonPilon, thanks so much for this library.

I have a couple questions about ggf.customer_lifetime_value()

When I used it, it was outputting higher CLV than I expected. After looking at the function, it seems to me that each expected_revenues_period_x contains cumulative transactions predicted up to and including the month rather than just transactions predicted in that month.

(lifetimes/lifetimes/estimation.py:106):

lambda r: (m*transaction_prediction_model.predict(i, r['frequency'], r['recency'], r['T'])/(1+d)**(i/30)),

For my calculation, I changed it to

lambda r: (m*(transaction_prediction_model.predict(i, r['frequency'], r['recency'], r['T'])-transaction_prediction_model.predict(i-30, r['frequency'], r['recency'], r['T']))/(1+d)**(i/30)),

I noticed the default discount rate is 1 and the rate in the tutorial is 0.7. These strike me as high and make me wonder if I'm misunderstanding something. In most literature, I'm seeing annual discount rates between .05 and .15. I understand that the discount rate will vary depending on the situation, but I'm wondering why the default and the example use what seem like high rates to me.

Bug when there are multiple orders in same day

The current functions to obtain summary data from transaction data do not work properly when there is more than one order per day. For example:

import pandas as pd
from lifetimes.utils import summary_data_from_transaction_data

test = [{'id': 1, 'date':'2012-01-06'},
        {'id': 1, 'date':'2012-01-06'}]
test = pd.DataFrame(test)
test['date'] = pd.to_datetime(test['date'])
summary = summary_data_from_transaction_data(test, 'id', 'date', observation_period_end='2012-12-31')
print(summary)

yields

frequency=1 recency=0 T=360, where frequency > recency. This is probably due to commit c05f88b .

To my knowledge, cases where frequency > recency will break the fitting procedure. I think that in the literature if there is more than one order during the same time unit, the second order is considered as a 'top up' and is merged with the first, i.e. the frequency is (number of unique dates) -1

How to run bgf.fit in pyspark?

How to run bgf.fit function in pyspark? Do I have to convert spark dataframe to pandas? Thanks

Add check for max(T) > 0

Monetary value estimation

Hi @CamDavidsonPilon I am working on implementing a few monetary values estimations as described in http://www.essec.edu/faculty/showDeclFileRes.do?declId=8555&key=Publication-Content (Gamma/Gamma submodel and Pareto/Independent). Anyways to fit your framework these models require a few relevant changes (e.g. adding monetary values for transaction/summaries, changing a few fitting methods etc..). So before doing too much work on the current structure of the library I wanted to know what was your take on the idea of adding monetary models to lifetimes.

Implement stability suggestions in https://github.com/theofilos/BTYD

expected_number_of_purchases_up_to_time is missing from Pareto

error conditional_probability_alive_matrix

Hi @CamDavidsonPilon,

thank you for this good work! I just wanted to let you know that there is a small mistake in the following method: ParetoNBDFitter.conditional_probability_alive_matrix(),

It seems you switched params (frequency/recency), which results in a transposed matrix representation:

Z[i, j] = self.conditional_probability_alive(recency, frequency, max_recency)

should be :

Z[i, j] = self.conditional_probability_alive(frequency, recency, max_recency)

right?

What is the time period for conditional_expected_number_of_purchases_up_to_time method?

From the source

   """
   Calculate the expected number of repeat purchases up to time t for a randomly choose individual from
    the population, given they have purchase history (frequency, recency, T)

    Parameters:
        t: a scalar or array of times.
        frequency: a scalar: historical frequency of customer.
        recency: a scalar: historical recency of customer.
        T: a scalar: age of the customer.

    Returns: a scalar or array
    """

The docs say t is just one period of time. What is that period?

import error for name _zeta when running quickstart example

Using anaconda Python 3.4, and after having installed with pip, running the first quickstart example,

from lifetimes.datasets import load_cdnow

results in:

ImportError Traceback (most recent call last)
in ()
----> 1 from lifetimes.datasets import load_cdnow

/home/lynd/anaconda3/lib/python3.4/site-packages/lifetimes/init.py in ()
----> 1 from .estimation import BetaGeoFitter, ParetoNBDFitter, GammaGammaFitter, ModifiedBetaGeoFitter
2 from .version import version
3
4 all = ['BetaGeoFitter', 'ParetoNBDFitter', 'GammaGammaFitter', 'ModifiedBetaGeoFitter']

/home/lynd/anaconda3/lib/python3.4/site-packages/lifetimes/estimation.py in ()
7 from pandas import DataFrame
8
----> 9 from scipy import special
10 from scipy import misc
11

/home/lynd/anaconda3/lib/python3.4/site-packages/scipy/special/init.py in ()
636 from ._ufuncs import *
637
--> 638 from .basic import *
639 from . import specfun
640 from . import orthogonal

/home/lynd/anaconda3/lib/python3.4/site-packages/scipy/special/basic.py in ()
13 where, mgrid, sin, place, issubdtype, extract,
14 less, inexact, nan, zeros, atleast_1d, sinc)
---> 15 from ._ufuncs import (ellipkm1, mathieu_a, mathieu_b, iv, jv, gamma,
16 psi, _zeta, hankel1, hankel2, yv, kv, _gammaln,
17 ndtri, errprint, poch, binom, hyp0f1)

ImportError: cannot import name '_zeta'

Installation error

Output from pip install attempt:

$ pip install lifetimes
You are using pip version 7.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting lifetimes
  Using cached Lifetimes-0.2.2.1.tar.gz
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/lib64/python2.6/site-packages (from lifetimes)
Collecting scipy (from lifetimes)
  Using cached scipy-0.18.1.tar.gz
Collecting pandas>=0.19 (from lifetimes)
  Using cached pandas-0.19.0.tar.gz
Collecting python-dateutil (from pandas>=0.19->lifetimes)
  Using cached python_dateutil-2.5.3-py2.py3-none-any.whl
Collecting pytz>=2011k (from pandas>=0.19->lifetimes)
  Using cached pytz-2016.7-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil->pandas>=0.19->lifetimes)
  Using cached six-1.10.0-py2.py3-none-any.whl
Building wheels for collected packages: lifetimes, scipy, pandas
  Running setup.py bdist_wheel for lifetimes
  Complete output from command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-DOw4uO/lifetimes/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpOmPS2xpip-wheel-:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for lifetimes
  Running setup.py bdist_wheel for scipy
  Complete output from command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-DOw4uO/scipy/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmp07PPZVpip-wheel-:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for scipy
  Running setup.py bdist_wheel for pandas
  Stored in directory: /home/jwhitmore/.cache/pip/wheels/95/de/b6/d0219d3007532bde54b775a18943296a7a31f20980901c37ee
Successfully built pandas
Failed to build lifetimes scipy
Installing collected packages: scipy, six, python-dateutil, pytz, pandas, lifetimes
  Running setup.py install for scipy
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-DOw4uO/scipy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-XZvmqa-record/install-record.txt --single-version-externally-managed --compile:

    Note: if you need reliable uninstall behavior, then install
    with pip instead of using `setup.py install`:

      - `pip install .`       (from a git repo or downloaded source
                               release)
      - `pip install scipy`   (last SciPy release on PyPI)


    lapack_opt_info:
    openblas_lapack_info:
      libraries openblas not found in ['/usr/local/lib64', '/usr/local/lib', '/usr/lib64', '/usr/lib']
      NOT AVAILABLE

    lapack_mkl_info:
    mkl_info:
      libraries mkl,vml,guide not found in ['/usr/local/lib64', '/usr/local/lib', '/usr/lib64', '/usr/lib']
      NOT AVAILABLE

      NOT AVAILABLE

    atlas_3_10_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas,tatlas not found in /usr/local/lib64
      libraries lapack_atlas not found in /usr/local/lib64
      libraries tatlas,tatlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries tatlas,tatlas not found in /usr/lib64/atlas
      libraries lapack_atlas not found in /usr/lib64/atlas
      libraries tatlas,tatlas not found in /usr/lib64/sse2
      libraries lapack_atlas not found in /usr/lib64/sse2
      libraries tatlas,tatlas not found in /usr/lib64
      libraries lapack_atlas not found in /usr/lib64
      libraries tatlas,tatlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
      NOT AVAILABLE

    atlas_3_10_info:
      libraries satlas,satlas not found in /usr/local/lib64
      libraries lapack_atlas not found in /usr/local/lib64
      libraries satlas,satlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries satlas,satlas not found in /usr/lib64/atlas
      libraries lapack_atlas not found in /usr/lib64/atlas
      libraries satlas,satlas not found in /usr/lib64/sse2
      libraries lapack_atlas not found in /usr/lib64/sse2
      libraries satlas,satlas not found in /usr/lib64
      libraries lapack_atlas not found in /usr/lib64
      libraries satlas,satlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_3_10_info'>
      NOT AVAILABLE

    atlas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in /usr/local/lib64
      libraries lapack_atlas not found in /usr/local/lib64
      libraries ptf77blas,ptcblas,atlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries ptf77blas,ptcblas,atlas not found in /usr/lib64/atlas
      libraries lapack_atlas not found in /usr/lib64/atlas
      libraries ptf77blas,ptcblas,atlas not found in /usr/lib64/sse2
      libraries lapack_atlas not found in /usr/lib64/sse2
      libraries ptf77blas,ptcblas,atlas not found in /usr/lib64
      libraries lapack_atlas not found in /usr/lib64
      libraries ptf77blas,ptcblas,atlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_threads_info'>
      NOT AVAILABLE

    atlas_info:
      libraries f77blas,cblas,atlas not found in /usr/local/lib64
      libraries lapack_atlas not found in /usr/local/lib64
      libraries f77blas,cblas,atlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries f77blas,cblas,atlas not found in /usr/lib64/atlas
      libraries lapack_atlas not found in /usr/lib64/atlas
      libraries f77blas,cblas,atlas not found in /usr/lib64/sse2
      libraries lapack_atlas not found in /usr/lib64/sse2
      libraries f77blas,cblas,atlas not found in /usr/lib64
      libraries lapack_atlas not found in /usr/lib64
      libraries f77blas,cblas,atlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_info'>
      NOT AVAILABLE

    /usr/lib64/python2.6/site-packages/numpy/distutils/system_info.py:1552: UserWarning:
        Atlas (http://math-atlas.sourceforge.net/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [atlas]) or by setting
        the ATLAS environment variable.
      warnings.warn(AtlasNotFoundError.__doc__)
    lapack_info:
      libraries lapack not found in ['/usr/local/lib64', '/usr/local/lib', '/usr/lib64', '/usr/lib']
      NOT AVAILABLE

    /usr/lib64/python2.6/site-packages/numpy/distutils/system_info.py:1563: UserWarning:
        Lapack (http://www.netlib.org/lapack/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [lapack]) or by setting
        the LAPACK environment variable.
      warnings.warn(LapackNotFoundError.__doc__)
    lapack_src_info:
      NOT AVAILABLE

    /usr/lib64/python2.6/site-packages/numpy/distutils/system_info.py:1566: UserWarning:
        Lapack (http://www.netlib.org/lapack/) sources not found.
        Directories to search for the sources can be specified in the
        numpy/distutils/site.cfg file (section [lapack_src]) or by setting
        the LAPACK_SRC environment variable.
      warnings.warn(LapackSrcNotFoundError.__doc__)
      NOT AVAILABLE

    Running from scipy source directory.
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-DOw4uO/scipy/setup.py", line 415, in <module>
        setup_package()
      File "/tmp/pip-build-DOw4uO/scipy/setup.py", line 411, in setup_package
        setup(**metadata)
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/core.py", line 135, in setup
        config = configuration()
      File "/tmp/pip-build-DOw4uO/scipy/setup.py", line 335, in configuration
        config.add_subpackage('scipy')
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 1002, in add_subpackage
        caller_level = 2)
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 971, in get_subpackage
        caller_level = caller_level + 1)
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 908, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "scipy/setup.py", line 15, in configuration
        config.add_subpackage('linalg')
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 1002, in add_subpackage
        caller_level = 2)
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 971, in get_subpackage
        caller_level = caller_level + 1)
      File "/usr/lib64/python2.6/site-packages/numpy/distutils/misc_util.py", line 908, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "scipy/linalg/setup.py", line 20, in configuration
        raise NotFoundError('no lapack/blas resources found')
    numpy.distutils.system_info.NotFoundError: no lapack/blas resources found

    ----------------------------------------
Command "/usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-DOw4uO/scipy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-XZvmqa-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-DOw4uO/scipy```

Cumulative Transactions Plot

I think it'll be a good idea to also add Cumulative Transactions Plot like in R BTYD: https://rdrr.io/cran/BTYD/man/bgbb.PlotTrackingCum.html

observation_period_end should default to the max transaction time

In utils.calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end=datetime.today(), freq='D', datetime_format=None): datetime.today() is set once when the file is first read. This can be confusing when dealing with historical data, as the holdout column and evaluation results become dependent on the time the process is run.

It'd be better to default to None and then in the function body do:

if observation_period_end is None:
    observation_period_end = transactions[datetime_col].max()

ModifiedBetaGeoFitter doens't converge

Hello,
I tried to fit a ModifiedBetaGeo model on some data generated by the model itself.
It doesn't give good results. I tried to modify the initial parameters but it basically never works.
It seems to have problems in fitting a and b parameters. They are always next to zero.

Here is the code to reproduce the issue:

import lifetimes.generate_data as gen
from lifetimes import estimation
from lifetimes.models import extract_frequencies

T = 40
r = 0.24
alpha = 4.41
a = 0.8
b = 2.43

data = gen.modified_beta_geometric_nbd_model(T, r, alpha, a, b, size=1000)
fitter = estimation.ModifiedBetaGeoFitter()
fitter.fit(data['frequency'],data['recency'],data['T'])
print "Library Fit"
print fitter.params_

conditional_expected_average_profit by individual all the same?

Hi there,

Again, thanks for the awesome program. When I try to run the conditional_expected_average_profit function on individual customers I get the same response. Below is the result applied to the CDNOW dataset following the tutorial. Am I misunderstanding how this function works? The descriptions says it will compute for a group of one or more customers...

This method computes the conditional expectation of the average profit per transaction for a group of one or more customers.

Thanks in advance!

id format in function:summary_data_from_transaction_data

Hi,

I am trying to import some data and convert it to the right form using:summary_data_from_transaction_data.

However it seems that it does not allow IDs in string format (e.g. 8a583481da078d09a2ad3a1e04c1740c).
I believe it would be useful to allow non-numbers for instance to process customers ID which have previously hashed.

Best
David L.

serialize the models better

https://twitter.com/elesinOlalekan/status/853112408273031169

utils.calibration_and_holdout_data() should work with monetary value

There's currently a subtle bug in utils.calibration_and_holdout_data():

calibration_summary_data = summary_data_from_transaction_data(calibration_transactions, customer_id_col, datetime_col, datetime_format, observation_period_end=calibration_period_end, freq=freq)

The fourth positional argument of summary_data_from_transaction_data is actually monetary_value_col, not datetime_format.

The ideal solution would be to add monetary_value_col as a parameter to calibration_and_holdout_data() and pass most parameters to summary_data_from_transaction_data by name to avoid future errors.

potential error in calculating CLV

I found an issue about the calculation of CLV.

In customer_lifetime_value(), the discounted cash flow for each period is calculated as:

df['expected_revenues_period_'+str(i)] = df.apply(
    lambda r: (m*transaction_prediction_model.predict(i, r['frequency'], r['recency'], r['T'])/(1+d)**(i/30)),
    axis=1
    )

in which,
i is th looping counter (in steps of 30),
m is self.conditional_expected_average_profit(),
d is discount rate.
transaction_prediction_model.predict() returns "conditional_expected_number_of_purchases_up_to_time"

It seems to me that m should be the expected profit for individual customer instead of all the training records, and the discounted cash flow for a single period is m * (predict(t) - predict(t-1)) / (1+d)**(i/30)

May I propose to make such change:

df['monetary_value'] = monetary_value

for i in range(30, (time*30)+1, 30):
    df['expected_revenues_period_'+str(i)] = df.apply(
          lambda r: (r['monetary_value']*
              (transaction_prediction_model.predict(i, r['frequency'], r['recency'], r['T'])
               - transaction_prediction_model.predict(i - 30, r['frequency'], r['recency'], r['T'])) /(1+d)**(i/30)),
              axis=1
     )

ggf.customer_lifetime_value returns sum of CLV instead of CLV per customer_id

Hi @CamDavidsonPilon, long time no see!

Just wanted to point out an issue in the Quickstart tutorial in the Gamma-gamma model section. When I run the example code for ggf.customer_lifetime_value I get a single value (1992970.28649 to be exact) instead of a column containing one prediction per customer like in the docs. To get an output similar to the docs I needed to do a groupby on customer_id and then apply ggf.customer_lifetime_value to each chunk, which is incredibly slow.

Hope this was helpful, keep up the good work!

Pareto_NBD overflow

When we have customers with frequency > 300 _A_0 function returns 0, because both denominators (max_alpha_beta + rec) ** r_s_x and (max_alpha_beta + age) ** r_s_x return Inf.
But in _negative_log_likelihood we only need log of that value, which can be achieved by logsumexp function from scipy.misk.
Suppose we have to calculate log(a/A^r - b/B^r) when r is so large that A^r and B^r overflows to Inf.
In that case we can do the following: log(A_0) = log(a/A^r - b/B^r) = log(a_B^r - b_A^r) - r_log(A_B) =
log(exp(a + r_log(B)) - exp(b + r_log(A))) - r_log(A_B)
now we can use logsumexp for first term to get correct result.

Same problem occurs in calculating conditional_probability_alive:
(alpha + T) ** (r + x) * A_0 overflows to Inf * 0 and returns NaN
In contrast to _A_0 where we need to be fast for optimization routines, usually conditional_probability_alive executes just once, and we can avoid overflow by using mpmath from sympy which allows precise calculations without overflowing.

And i would also recommend to add an option for differential evolution solver from scipy.optimize

Adding some functions here:

def _log_a_0(r, alpha, s, beta, freq, rec, age):
    min_ab, max_ab, t = (alpha, beta, r + freq) if alpha < beta else (beta, alpha, s + 1)
    abs_ab = max_ab - min_ab
    rsx = r + s + freq
    p_1, q_1 = hyp2f1(rsx, t, rsx + 1., abs_ab / (max_ab + rec)), (max_ab + rec)
    p_2, q_2 = hyp2f1(rsx, t, rsx + 1., abs_ab / (max_ab + age)), (max_ab + age)
    sign = ones(len(freq))
    return logsumexp([log(p_1) + rsx * log(q_2), log(p_2) + rsx * log(q_1)],
        axis=0, b=[sign, -sign]) - rsx * log(q_1 * q_2)

import sympy.mpmath as mp

def p_alive_precise(self, freq, rec, age):
    freq, rec, age = check_inputs(freq, rec, age)
    r, alpha, s, beta = self.pars
    log_a_0 = self._log_a_0(r, alpha, s, beta, freq, rec, age)

    def precise(freq_, age_, log_a_0_):
        return float(mp.power(alpha + age_, r + freq_) * mp.exp(log_a_0_))

    p = array(list(map(precise, freq, age, log_a_0)))
    return 1. / (1. + (s / (r + s + freq)) * (beta + age) ** s * p)

Summary transaction data round to whole periods not float

Tried to get RFM matrix from raw CDNOW transactions sample but without success with summary_data_from_transaction_data due to to_freq rounding to whole digits. Do you have idea how to improve that?
I came with hucky solution to set freq='D' and then divide recency and T on 7 to get appropriate numbers in function summary_data_from_transaction_data but it doesn't seem a good way to solve problems like that.

from lifetimes.datasets import load_transaction_data, load_cdnow_summary, load_dataset
from lifetimes.utils import summary_data_from_transaction_data

df_cdnow_summary = load_cdnow_summary()

transactions = load_dataset('CDNOW_sample.txt', header=None, sep='\s+')
transactions.columns = ['id_total', 'id_sample', 'date', 'num_cd_purc', 'total_value']
summary_trans = summary_data_from_transaction_data(transactions, 'id_sample', 'date', datetime_format='%Y%m%d', 
                                   observation_period_end='19970930', freq='W')

df_cdnow_summary.head()
"""
   ID  frequency  recency      T
0   1          2    30.43  38.86
1   2          1     1.71  38.86
2   3          0     0.00  38.86
3   4          0     0.00  38.86
4   5          0     0.00  38.86
"""
summary_trans.head()
"""
           frequency  recency     T
id_sample                          
1                2.0     30.0  39.0
2                1.0      2.0  39.0
3                0.0      0.0  39.0
4                0.0      0.0  39.0
5                0.0      0.0  39.0
"""

Error generating the Frequency/Recency Matrix

Hi,

When trying to generate the Frequency/Recency Matrix I get an error.
I'm using Anaconda matplotlib-1.5. Any idea how to fix this?

AttributeError                            Traceback (most recent call last)
<ipython-input-4-7b0ad3a728a5> in <module>()
      1 from lifetimes.plotting import plot_frequency_recency_matrix
      2 
----> 3 plot_frequency_recency_matrix(bgf)

//anaconda/lib/python2.7/site-packages/lifetimes/plotting.pyc in plot_frequency_recency_matrix(model, T, max_frequency, max_recency, **kwargs)
    116     # necessary for colorbar to show up
    117     PCM = ax.get_children()[2]
--> 118     plt.colorbar(PCM, ax=ax)
    119 
    120     return ax

//anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py in colorbar(mappable, cax, ax, **kw)
   2235         ax = gca()
   2236 
-> 2237     ret = gcf().colorbar(mappable, cax = cax, ax=ax, **kw)
   2238     return ret
   2239 colorbar.__doc__ = matplotlib.colorbar.colorbar_doc

//anaconda/lib/python2.7/site-packages/matplotlib/figure.py in colorbar(self, mappable, cax, ax, use_gridspec, **kw)
   1593                 cax, kw = cbar.make_axes(ax, **kw)
   1594         cax.hold(True)
-> 1595         cb = cbar.colorbar_factory(cax, mappable, **kw)
   1596 
   1597         self.sca(current_ax)

//anaconda/lib/python2.7/site-packages/matplotlib/colorbar.py in colorbar_factory(cax, mappable, **kwargs)
   1328         cb = ColorbarPatch(cax, mappable, **kwargs)
   1329     else:
-> 1330         cb = Colorbar(cax, mappable, **kwargs)
   1331 
   1332     cid = mappable.callbacksSM.connect('changed', cb.on_mappable_changed)

//anaconda/lib/python2.7/site-packages/matplotlib/colorbar.py in __init__(self, ax, mappable, **kw)
    878         # Ensure the given mappable's norm has appropriate vmin and vmax set
    879         # even if mappable.draw has not yet been called.
--> 880         mappable.autoscale_None()
    881 
    882         self.mappable = mappable

AttributeError: 'Spine' object has no attribute 'autoscale_None'

Test dataset issue

Hi,

I think there may be a bug related to the load_summary_data_with_monetary_value dataset.
When I execute:

from lifetimes.datasets import load_summary_data_with_monetary_value

summary_with_money_value = load_summary_data_with_monetary_value()
summary_with_money_value.head()
returning_customers_summary = summary_with_money_value[summary_with_money_value['frequency']>0]

returning_customers_summary.head()

I obtain:

frequency   recency     T   monetary_value
customer_id  
1   2   30.43   38.86   22.35
2   1   1.71    38.86   11.77
6   7   29.43   38.86   73.74
7   1   5.00    38.86   11.77
9   2   35.71   38.86   25.55

Which is consistent with the content of https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/datasets/cdnow_customers_transactions.csv

But according to the tutorial I should get:

frequency   recency T   monetary_value
id  
2   1           0       262 44.500000
3   2           90      272 20.353333
4   2           213     273 24.673333
5   7           257     273 32.651250
8   3           183     273 26.447500

How is it possible? Does the tutorial refer to an older version of the dataset?

The parts of the tutorial using this dataset are consequently different from what one obtains repeating the same commands.

PS: Thanks for the amazing work and Happy New Year!

Why the customers with 0 frequency has greater monetary value than others?

Before I create this issue, I should say thank you to the all contributors about this great open source library.

Question below,

Why the customers with 0 frequency(id 3,4,5,8,10 below) have greater monetary value than customer 2,8,9?

print ggf.conditional_expected_average_profit(
        summary_with_money_value['frequency'],
        summary_with_money_value['monetary_value']
    ).head()
"""
customer_id
1     24.658616
2     18.911480
3     35.171003
4     35.171003
5     35.171003
6     71.462851
7     18.911480
8     35.171003
9     27.282408
10    35.171003
dtype: float64
"""

I realized that "how" and "where" this results comes from. But I'm still curious "why" it should be. Is it intended?

def conditional_expected_average_profit(self, frequency=None, monetary_value=None):
     """
     # ...
     individual_weight = p * frequency / (p * frequency + q - 1)
     population_mean = v * p / (q - 1)
     return (1 - individual_weight) * population_mean + individual_weight * monetary_value

Slow running summary_data_from_transaction_data

My data is not sorted by customer id, would it perform faster if my data comes pre-sorted?

summary_data_from_transaction_data coerces the CustomerID to a float

The utility summary_data_from_transaction_data coerces the CustomerID to a float in line 182. I dont see the sense in this and it mean I've to coerce it back to its original format if I want to merge with other customer data.

Optimizer for fit method

I found, that optimizer for _fit was changed from minimize to fmin in utils with that commit. I have some questions:

why Nelder-Mead algorithm was finally chosen?
why minimize was changed to fmin while there are some notes about that in scipy docs:

The specific optimization method interfaces below in this subsection are not recommended for use in new scripts; all of these methods are accessible via a newer, more consistent interface provided by the functions above

I'm also interesting in increasing training speed for model (BetaGeoFitter) because in my case I need to wait about 45 minutes to train (~15 mln records). On sample I saw that SLSQP was faster. Could it be used for training model or the are any pitfalls with that?

_print_params is a bad name

https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/estimation.py#L33

doesn't even print!

iterative_fitting not really iterative?

Reading through the code of utils._fit, it looks like the assignment to params_init in the end of the loop (https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/utils.py#L247) doesn't have any effect because params_init gets overwritten in the beginning of the loop (https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/utils.py#L240). Looks like line 240 should be outside the loop body.

I'd also give params_init a different name (current_params?) to make the difference between that variable and initial_params clearer.

add python3 tag for pypi

Getting black and white images when I use seaborn

Hi Cam,
When I use seaborn it ruins the frequency-recency plots and converts them all to black and white.

Pip Install Error

Hi,

I'm trying to pip install, and getting the following error. I also tried it from a fresh conda env with the same results. I can install if from python setup.py install, so maybe the readme isn't available to pip at install time? Though not really sure...

Here's the traceback:

Collecting lifetimes
  Using cached Lifetimes-0.1.1.tar.gz
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/setup.py", line 18, in <module>
        long_description=read('README.md'),
      File "/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/setup.py", line 6, in read
        return open(os.path.join(os.path.dirname(__file__), fname)).read()
    IOError: [Errno 2] No such file or directory: '/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/README.md'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

      File "<string>", line 20, in <module>

      File "/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/setup.py", line 18, in <module>

        long_description=read('README.md'),

      File "/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/setup.py", line 6, in read

        return open(os.path.join(os.path.dirname(__file__), fname)).read()

    IOError: [Errno 2] No such file or directory: '/private/var/folders/7b/trxdysx57xz95389p87cbhqrxgb3nq/T/pip-build-XwlQNs/lifetimes/README.md'

Here's the conda env.

Current conda install:

             platform : osx-64
        conda version : 3.8.3
  conda-build version : not installed
       python version : 2.7.9.final.0
     requests version : 2.5.1
     root environment : /Users/thauck/bin/miniconda  (writable)
  default environment : /Users/thauck/projects/ltv/env
     envs directories : /Users/thauck/bin/miniconda/envs
        package cache : /Users/thauck/bin/miniconda/pkgs
         channel URLs : http://repo.continuum.io/pkgs/free/osx-64/
                        http://repo.continuum.io/pkgs/pro/osx-64/
          config file : /Users/thauck/.condarc
    is foreign system : False

Transforming transactional data into summary data

Hello.-
I'm trying to transform the following transactional data into summary data...

Transactional data:
date id
0 08/03/2014 0
1 21/05/2014 1
2 14/03/2014 2
3 09/04/2014 2
4 21/05/2014 2

To transform transactional data into summary data:
summary = summary_data_from_transaction_data(transaction_data, 'id', 'date', observation_period_end='31/12/2014')

and I get the following matrix:
id frequency recency T
0 0 0 150
1 0 0 224
2 2 174 292

but If I do it manually I get the following one,
id frequency recency T
0 0 0 298
1 0 0 224
2 2 68 292

what I'm doing wrong? These have been my numbers ......
298 = 31/12/2014 - 08/03/2014
224 = 31/12/2014 - 21/05/2014
292 = 31/12/2014 - 14/03/2014
68 = 21/05/2014 - 14/03/2014

Thanks in advanced,
Izaskun

pip install lifetimes error

Hi,

i can not pip insall lifetimes module. i get the following error message:

runfile('C:/Users/az/.spyder2-py3/temp.py', wdir='C:/Users/az/.spyder2-py3')
File "C:/Users/az/.spyder2-py3/temp.py", line 12
pip install lifetimes
^
SyntaxError: invalid syntax

Any clue about this?

Holdout data util method does not check for observation period end.

Hello Cam, great library, thanks for putting it together. I discovered a bug in the util.calibration_and_holdout_data function. When it creates the holdout aggregate, it does not limit for the observation_period_end parameter, only the calibration_period_end parameter. See line 55 in util.py. The result is that the holdout period is aggregated through the end of the dataset rather than the upper bound set in the parameter.

In util.summary_data_from_transaction_data you have this code at line 102:

transactions = transactions.ix[transactions[datetime_col] <= observation_period_end]

Adding this line to the holdout util function will fix the problem.

docs on plot_probability_alive paths

plot_probability_alive_matrix give AttributeError: 'Spine' object has no attribute 'autoscale_None'

I'm trying to recreate the example in on the main lifetimes page, but I'm using my own data.

I've cleaned my data and run the BetaGeoFilter correctly.

from lifetimes import BetaGeoFitter

bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(df_merge_2['frequency'], df_merge_2['recency'], df_merge_2['t'])

print bgf

>> lifetimes.BetaGeoFitter: fitted with 3560 subjects, r: 0.29, alpha: 0.00, a: 0.66, b: 0.14

However when I run:

from lifetimes.plotting import plot_frequency_recency_matrix
or
from lifetimes.plotting import plot_probability_alive_matrix

I get the same error for both:

AttributeError: 'Spine' object has no attribute 'autoscale_None'

Any idea why this error would occur?

Let's move more params to the init

Ex: tolerance, number of fits, etc. to make it similar to scikit-learn. This will have to be a new version release.

cc @yanirs

Question - pre-processing assumptions

We're using the BetaGeoFitter to model our RF(T)M data and predict CLTV and Survival over the next period, training on 5 periods prior and had some questions on intended pre-processing.

Can the model consume customers with Tenure=0 or Recency=0 (last purchase on first day of transaction) We currently pre-process by calculating Tenure and Recency from the day after the end of the training period, so that Tenure >= 1, Recency >= 1, and Recency <= Tenure. Is this a necessary step, or is the model fit more efficiently if Tenure and Recency are allowed to be 0?

We only include customers who had a transaction in the training period, but is the BGF intended to also model customers with 0 transactions? It would be difficult to assert a forward-looking CLTV because we apply the average observed spend on the predicted count of visits.

Thanks to all contributors for your work on this library.

standardize on one nomenclature

is it recency or t_x? frequency or x?

summary_data_from_transaction_data() returns different result when monetary_value_col included

Hi I found that summary_data_from_transaction_data() returns a different result when monetary_value_col is included. Specifically, when there are multiple transaction (of different price) in the same day, the result will be incorrect.

I chased the code and believe the cause is:
period_transactions = reduce_events_to_period(transactions, *select_columns).reset_index(level=select_columns[1:])

which will grouped the dataframe by *select_columns, and I think monetary_value_col should not be included.

So perhaps a modification is
period_transactions = reduce_events_to_period(transactions, [customer_id_col, datetime_col]).reset_index(level=select_columns[1:])

While at the same time, (near the end of the function):
customers['monetary_value'] = period_transactions.groupby(level=customer_id_col)[monetary_value_col].mean()

should be changed to:
customers['monetary_value'] = transactions.groupby(customer_id_col)[monetary_value_col].mean()