onsbigdata / precon Goto Github PK

5.0 1.0 1.0 359 KB

Functions for price index economics.

License: MIT License

Makefile 0.93% Python 99.07%

precon's Introduction

precon: Python functions for Price Index production

What is it?

precon is a Python package that provides a suite of speedy, vectorised functions for implementing common methods in the production of Price Indices. It aims to provide the high-level building blocks for building statistical systems at National Statistical Institutes (NSIs) and other research institutions concerned with creating indices. It has been developed in-house at the Office for National Statistics (ONS) and aims to become the standard library for price index production. This can only be achieved with help from the community, so all contributions are welcome!

Installation

pip install precon

Use

import precon

API

Many functions in the precon package are designed to work with pandas DataFrames or Series that contain only one type of value, with any categorical or descriptive metadata contained within either the index or columns axis. Each component of a statistical operation or equation will usually be within it's own DataFrame, i.e. prices in one Frame and weights in another. When dealing with time series data, the functions expect one axis to contain only the datetime index. Where a function accepts more than one input DataFrame, they will need to share the same index values so that pandas can match up the components that the programmer wants to process together. Processing values using this matrix format approach allows the functions to take advantage of powerful pandas/numpy vectorised methods.

It is not always necessary that the time series period frequencies match up if the values in one DataFrame do not change over the given period frequency in another DataFrame, as the functions will resample to the smaller period frequency and fill forward the values.

Check the docs for detailed guidance on each function and its parameters.

Features

Calculate fixed-base price indices using common index methods.
Combine or aggregate lower-level indices to create higher-level indices.
Chain fixed-base indices together for a continuous time series.
Re-reference indices to start from a different time period.
Calculate contributions to higher-level indices from each of the component indices.
Impute new base prices over a time series.
Uprating values by index movements.
Rounding weight values with adjustment to ensure the sum doesn't change.
Stat compiler functions to quickly produce common sets of statistics.

Dependencies

pandas
NumPy

Contributing to precon

See CONTRIBUTING.rst

precon's People

Contributors

Stargazers

Watchers

Forkers

martinr-l

precon's Issues

Revisit index_calculator, maybe create runner functions for scenarios

I think index_calculator may need to support multiple scenarios:

Simple base price getting
Base price imputation
Already passed base prices

Consider a sensible way of implementing this - might need some tests first!

Move base_price_fill_shift into imputation.py

Add pre-commit hooks for devs

I want to add some pre-commit hooks for developers.

Remove whitespace
Flake 8 linting
Check commit msg subject len

Remove ternary operator on first line of _get_values_to_adjust function

The ternary operator is unnecessary here - a simple conditional will do since it returns True or False anyway.

precon/precon/rounding.py

Line 82 in 46752ea

asc = True if np.sign(no_of_adjustments) == -1 else False

Bug: change to isin

There's a bug here, since base_period is a list rather than a single int. Change to isin() method.

precon/precon/imputation.py

Line 191 in 4e441a7

base_prices = prices.where(months.eq(base_period))

dropna in jan_adjustment will remove all values in row

precon/precon/adjustments.py

Line 24 in 4e441a7

adjusted = adjusted.dropna()

The above line in the function means if passing in a dataframe with the following format

date	col1	col2
2019-01-01	101	NaN
...	...	...
2019-05-01	104	NaN
2019-06-01	103	100
...	...	...
2020-01-01	101	102

(i.e. col2 timeseries starts later than col1 ) then jan_adjustment will drop the entire row for 2019-01-01.

Not sure on the correct behaviour, but anecdotally removing the dropna seems to work well.

Change the axis argument definition in round_and_adjust

The axis to adjust across.

Add spaces either side of the colon in all docstrings

This is to abide by the numpy style convention.

Add a fillna(0) to the weights in the aggregation method to stop Zero Division bug

Still totally unsure whether this will solve the issue a user is experiencing, but in some adapted code the lines were:

    zeros_and_nans = indices.isna() | indices.eq(0)
    weights = weights.mask(zeros_and_nans, 0).fillna(0)

Consider implementing it on it's own line with a comment explaining why that fill is necessary. Also find out what edge case it solves and write a test for it.

precon/precon/aggregation.py

Line 68 in cf0df3a

weights = weights.mask(indices.isna() | indices.eq(0), 0)

Add .fillna(base_prices) in index_calculator function to cover NaNs from shift

precon/precon/pipelines.py

Line 80 in cf0df3a

base_prices = base_prices.shift(1, axis=axis)

Add a .fillna(base_prices) method to cover the NaNs created by the shift.

Be mindful that this is changing in impute_base_prices too, but it's covered their already with the .fillna(start_prices).

Add base_prices use_first option which resamples annually picking first.

Similar to Matt's implementation here:

in_year_base = indices.resample('AS').first()

    # Align base indices to full time series values
    in_year_base = (
        in_year_base
        .reindex_like(indices, method='ffill')

Documentation

It would be useful to be able to view the docs for this project.

Currently, I think, you have to clone and build them yourself?

A solution would be to use GitHub pages to serve the docs as this works well with sphinx.

Add new aggregation functionality

Add functionality to aggregate for a given MultiIndex level or set of levels, and extend that functionality to enable an aggregation up a hierarchical tree given by a set of MultiIndex levels.

Add tests and ensure docstrings are thorough.

Chain function produces incorrect indices if period missing

The chain does not handle missing periods correctly but still produces a result.

import pandas as pd
from pandas import Timestamp
import precon

df_all_periods = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), 100.478600),  # April 2019 value present
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_period_missing = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), None),  # April 2019 value missing
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

expected = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 105.386833),
        (Timestamp('2019-03-01'), 106.046713),
        (Timestamp('2019-04-01'), 104.745404),
        (Timestamp('2019-05-01'), 104.921789),
        (Timestamp('2019-06-01'), 104.704227),
        (Timestamp('2019-07-01'), 106.521034),
        (Timestamp('2019-08-01'), 104.881133),
        (Timestamp('2019-09-01'), 106.401255),
        (Timestamp('2019-10-01'), 106.767473),
        (Timestamp('2019-11-01'), 107.133691),
        (Timestamp('2019-12-01'), 107.499909),
        (Timestamp('2020-01-01'), 107.915346),
        (Timestamp('2020-02-01'), 108.682084),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_all_periods['chained'] = precon.chain(df_all_periods)

df_period_missing['chained'] = precon.chain(df_period_missing)

pd.concat([df_all_periods, df_period_missing, expected], keys=['all_periods', 'period_missing', 'expected'], axis=1)

In the above example expected is calculated for if all periods are present but using the equation of unlinked index * linked base / 100 so the chained indices after the missing period are not affected. precon.chain doesn't have an issue as it uses a backfill after shifting the indices by one period to fill in the first month.

Add base_prices * adjustments if adjustments passed

precon/precon/pipelines.py

Line 77 in 823c0c1

base_prices = get_base_prices(prices, base_period, axis)

Add this after:

    if adjustments is not None:         
        base_prices = base_prices * adjustments

And update docstring.

Add random index generator for test data

Create a generator to create random index data in a reproducible way.

Support the generation of hierarchical structure of indices.

Modify get_base_prices to only fill within year

This might need some generalisation later on, but replace what is there for now. Maybe this function can move too, index_methods? Move index_calculator there too?

precon/precon/imputation.py

Lines 143 to 152 in ea185fa

    
           def get_base_prices( 
        
                   prices: pd.DataFrame, 
        
                   base_period: int = 1, 
        
                   axis: pd._typing.Axis = 0, 
        
                   ffill: bool = True, 
        
                   ) -> pd.DataFrame: 
        
               """Returns the prices at the base month in the same shape as prices. 
        
               Default behaviour is to fill forward values, but can be changed to 
        
               return NaN where not base_month by setting ffill=False.

Chang to applying _get_adjustments in round_and_adjust function

Change from the following:

    elif isinstance(obj, pd.core.frame.DataFrame):

        # Create an empty DataFrame to fill with adjustments
        adjustments = pd.DataFrame().reindex_like(obj)

        for index, row in iter_method(obj):
            # Create a selector based on the axis
            slice_ = axis_slice(index, axis)

            adjustments.loc[slice_] = _get_adjustments(row, decimals)

to this:

    elif isinstance(obj, pd.core.frame.DataFrame):

        adjustments = obj.apply(_get_adjustments, args=(decimals), axis=axis)

This should also allow for the removal of:

    iter_dict = {
        0: pd.DataFrame.iterrows,
        1: pd.DataFrame.iteritems,
    }
    iter_method = iter_dict.get(axis)

Slimming the function right down.

While taking care of this, remember to also do the following:

- Ensure empty line at EOF
- change the isinstance calls so that we're removing core.Series./core.Frame

Added extra contributions functionality from faster indicators project

Additional contributions functions were developed for the consumer prices faster indicators project. Pull these into precon.

period_on_period_contributions
contributions_level
contributions_up_hierarchy

Review existing contributions code and add some tests and documentation.

	def get_base_prices(
	prices: pd.DataFrame,
	base_period: int = 1,
	axis: pd._typing.Axis = 0,
	ffill: bool = True,
	) -> pd.DataFrame:
	"""Returns the prices at the base month in the same shape as prices.

	Default behaviour is to fill forward values, but can be changed to
	return NaN where not base_month by setting ffill=False.