Coder Social home page Coder Social logo

precon's Introduction

precon: Python functions for Price Index production

What is it?

precon is a Python package that provides a suite of speedy, vectorised functions for implementing common methods in the production of Price Indices. It aims to provide the high-level building blocks for building statistical systems at National Statistical Institutes (NSIs) and other research institutions concerned with creating indices. It has been developed in-house at the Office for National Statistics (ONS) and aims to become the standard library for price index production. This can only be achieved with help from the community, so all contributions are welcome!

Installation

pip install precon

Use

import precon

API

Many functions in the precon package are designed to work with pandas DataFrames or Series that contain only one type of value, with any categorical or descriptive metadata contained within either the index or columns axis. Each component of a statistical operation or equation will usually be within it's own DataFrame, i.e. prices in one Frame and weights in another. When dealing with time series data, the functions expect one axis to contain only the datetime index. Where a function accepts more than one input DataFrame, they will need to share the same index values so that pandas can match up the components that the programmer wants to process together. Processing values using this matrix format approach allows the functions to take advantage of powerful pandas/numpy vectorised methods.

It is not always necessary that the time series period frequencies match up if the values in one DataFrame do not change over the given period frequency in another DataFrame, as the functions will resample to the smaller period frequency and fill forward the values.

Check the docs for detailed guidance on each function and its parameters.

Features

  • Calculate fixed-base price indices using common index methods.
  • Combine or aggregate lower-level indices to create higher-level indices.
  • Chain fixed-base indices together for a continuous time series.
  • Re-reference indices to start from a different time period.
  • Calculate contributions to higher-level indices from each of the component indices.
  • Impute new base prices over a time series.
  • Uprating values by index movements.
  • Rounding weight values with adjustment to ensure the sum doesn't change.
  • Stat compiler functions to quickly produce common sets of statistics.

Dependencies

Contributing to precon

See CONTRIBUTING.rst

precon's People

Contributors

mitches-got-glitches avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

martinr-l

precon's Issues

Add pre-commit hooks for devs

I want to add some pre-commit hooks for developers.

  • Remove whitespace

  • Flake 8 linting

  • Check commit msg subject len

dropna in jan_adjustment will remove all values in row

adjusted = adjusted.dropna()

The above line in the function means if passing in a dataframe with the following format

date col1 col2
2019-01-01 101 NaN
... ... ...
2019-05-01 104 NaN
2019-06-01 103 100
... ... ...
2020-01-01 101 102

(i.e. col2 timeseries starts later than col1 ) then jan_adjustment will drop the entire row for 2019-01-01.

Not sure on the correct behaviour, but anecdotally removing the dropna seems to work well.

Add a fillna(0) to the weights in the aggregation method to stop Zero Division bug

Still totally unsure whether this will solve the issue a user is experiencing, but in some adapted code the lines were:

    zeros_and_nans = indices.isna() | indices.eq(0)
    weights = weights.mask(zeros_and_nans, 0).fillna(0)

Consider implementing it on it's own line with a comment explaining why that fill is necessary. Also find out what edge case it solves and write a test for it.

weights = weights.mask(indices.isna() | indices.eq(0), 0)

Documentation

It would be useful to be able to view the docs for this project.

Currently, I think, you have to clone and build them yourself?

A solution would be to use GitHub pages to serve the docs as this works well with sphinx.

Add new aggregation functionality

Add functionality to aggregate for a given MultiIndex level or set of levels, and extend that functionality to enable an aggregation up a hierarchical tree given by a set of MultiIndex levels.

Add tests and ensure docstrings are thorough.

Chain function produces incorrect indices if period missing

The chain does not handle missing periods correctly but still produces a result.

import pandas as pd
from pandas import Timestamp
import precon

df_all_periods = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), 100.478600),  # April 2019 value present
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_period_missing = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), None),  # April 2019 value missing
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

expected = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 105.386833),
        (Timestamp('2019-03-01'), 106.046713),
        (Timestamp('2019-04-01'), 104.745404),
        (Timestamp('2019-05-01'), 104.921789),
        (Timestamp('2019-06-01'), 104.704227),
        (Timestamp('2019-07-01'), 106.521034),
        (Timestamp('2019-08-01'), 104.881133),
        (Timestamp('2019-09-01'), 106.401255),
        (Timestamp('2019-10-01'), 106.767473),
        (Timestamp('2019-11-01'), 107.133691),
        (Timestamp('2019-12-01'), 107.499909),
        (Timestamp('2020-01-01'), 107.915346),
        (Timestamp('2020-02-01'), 108.682084),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_all_periods['chained'] = precon.chain(df_all_periods)

df_period_missing['chained'] = precon.chain(df_period_missing)

pd.concat([df_all_periods, df_period_missing, expected], keys=['all_periods', 'period_missing', 'expected'], axis=1)

In the above example expected is calculated for if all periods are present but using the equation of unlinked index * linked base / 100 so the chained indices after the missing period are not affected. precon.chain doesn't have an issue as it uses a backfill after shifting the indices by one period to fill in the first month.

Modify get_base_prices to only fill within year

This might need some generalisation later on, but replace what is there for now. Maybe this function can move too, index_methods? Move index_calculator there too?

precon/precon/imputation.py

Lines 143 to 152 in ea185fa

def get_base_prices(
prices: pd.DataFrame,
base_period: int = 1,
axis: pd._typing.Axis = 0,
ffill: bool = True,
) -> pd.DataFrame:
"""Returns the prices at the base month in the same shape as prices.
Default behaviour is to fill forward values, but can be changed to
return NaN where not base_month by setting ffill=False.

Chang to applying _get_adjustments in round_and_adjust function

Change from the following:

    elif isinstance(obj, pd.core.frame.DataFrame):

        # Create an empty DataFrame to fill with adjustments
        adjustments = pd.DataFrame().reindex_like(obj)

        for index, row in iter_method(obj):
            # Create a selector based on the axis
            slice_ = axis_slice(index, axis)

            adjustments.loc[slice_] = _get_adjustments(row, decimals)

to this:

    elif isinstance(obj, pd.core.frame.DataFrame):

        adjustments = obj.apply(_get_adjustments, args=(decimals), axis=axis)

This should also allow for the removal of:

    iter_dict = {
        0: pd.DataFrame.iterrows,
        1: pd.DataFrame.iteritems,
    }
    iter_method = iter_dict.get(axis)

Slimming the function right down.

While taking care of this, remember to also do the following:

  • - Ensure empty line at EOF
  • - change the isinstance calls so that we're removing core.Series./core.Frame

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.