Coder Social home page Coder Social logo

ajcr / rolling Goto Github PK

View Code? Open in Web Editor NEW
198.0 8.0 9.0 225 KB

Computationally efficient rolling window iterators for Python

License: MIT License

Python 100.00%
rolling-windows algorithm rolling-algorithms python iterator sliding-windows efficient-algorithm rolling-hash-functions

rolling's Introduction

rolling

PyPI version

A collection of computationally efficient rolling window iterators for Python.

Useful arithmetical, logical and statistical operations on rolling windows (including Sum, Min, Max, Mean, Median and more). Both fixed-length and variable-length windows are supported for most operations. Many operations also support "indexed" windows.

To get started, see the Overview section below, or have a look at the some recipes.

Installation

pip install rolling

You can also install from source if you want the very latest development changes:

git clone https://github.com/ajcr/rolling.git
cd rolling/
pip install .

There are no external library dependencies for running this module. If you want to run the unit tests, you'll need to install pytest. Once done, just run pytest from the base directory.

Overview

Consider a sequence of integers:

seq = [3, 1, 4, 1, 5, 9, 2]

Suppose we want to find the maximum value in each window of five consecutive integers:

alt tag

One way to do this would be to use Python's max function and apply it to each consecutive slice of five elements:

>>> [max(seq[i:i+5]) for i in range(len(seq) - (5-1))]
[5, 9, 9]

However, as well as being quite verbose, applying builtin functions (like max and sum) to a window becomes increasingly slow as the window size gets bigger. This is because all values in the window are visited each time the function is invoked, and so the complexity is typically linear (i.e. O(k) where k is the size of the window).

It's clear by looking at the picture above that most of the values remain in the window when it is rolled forward. By keeping track of information about the window and the values that are removed and added, an operation such as finding the maximum value can be completed much more efficiently, often in constant time (i.e. O(1), not dependent on the size of the window).

This library implements efficient ways to perform useful operations on rolling windows:

>>> import rolling              # import library
>>> roll = rolling.Max(seq, 5)  # iterator returning maximum of each window of size 5
>>> list(roll)
[5, 9, 9] 

Note that these time complexity values apply to "fixed" and "variable" window types (not the "indexed" window type which depends on the index values encountered).

Operations

The algorithms implemented so far in this module are summarised below.

The cost of updating the window (rolling it forward) and the memory footprint of the rolling object are given, where k denotes the size of the window.

The 'Builtin' column shows the comparable method that is found in the Python standard library. This method could be applied to the window (at higher computational cost) to get the same result. Note that it may not be equivalent in all cases, for example due to differences in floating point arithmetic.

Arithmetical

Rolling objects to apply common aggregation or measurement operations to the window.

Object Update Memory Description Builtin
Sum O(1) O(k) Sum of the window values sum
Product O(1) O(k) Product of the window values math.prod
Nunique O(1) O(k) Number of unique window values N/A
Min O(1) O(k) Minimum value of window min
MinHeap O(log(k)) O(k) Minimum value (internally uses a heap) min
Max O(1) O(k) Maximum value of window max

Statistical

Rolling objects to apply statistical operations to the window.

Object Update Memory Description Builtin
Mean O(1) O(k) Arithmetic mean of window values statistics.mean
Median O(log k) O(k) Median value of window: O(log k) update if 'skiplist' used statistics.median
Mode O(1) O(k) Set of most frequently appearing values in window statistics.multimode
Var O(1) O(k) Variance of window, with specified degrees of freedom statistics.pvariance
Std O(1) O(k) Standard deviation of window, with specified degrees of freedom statistics.pstdev
Skew O(1) O(k) Skewness of the window N/A
Kurtosis O(1) O(k) Kurtosis of the window N/A

Logical

Rolling objects to apply a logical operation to the window.

Object Update Memory Description Builtin
Any O(1) O(1) True if any value in the window is True in a Boolean context, else False any
All O(1) O(1) True if all values in the window are True in a Boolean context, else False all
Monotonic O(1) O(1) True if all values in the window are monotonic increasing or decreasing N/A
Match O(k) O(k) True if window is equal to a specified target sequence (O(k) update if match, else O(1)) N/A

Miscellaneous

Rolling objects implementing other operations.

Object Update Memory Description Builtin
Apply ? O(k) Applies a specified callable object to the window (thus update complexity is dependent on the callable) N/A
Entropy O(1) O(k) Shannon entropy of the window (fixed-size windows only) N/A
JaccardIndex O(1) O(k+s) Jaccard index (similarity coefficient) of window with a target set (s is size of target set) N/A
PolynomialHash O(1) O(k) Polynomial hash of window N/A

By default, fixed length windows are used in all operations. Variable-length windows can be specified using the window_type argument.

This allows windows smaller than the specified size to be evaluated at the beginning and end of the iterable. For instance, here's the Apply operation being used to apply Python's tuple callable to variable-length windows:

>>> seq = [3, 1, 4, 1, 5, 9, 2]
>>> roll_list = rolling.Apply(seq, 3, operation=tuple, window_type='variable')
>>> list(roll_list)
[(3,),
 (3, 1),
 (3, 1, 4),
 (1, 4, 1),
 (4, 1, 5),
 (1, 5, 9),
 (5, 9, 2),
 (9, 2),
 (2,)]

If values are indexed by a monotoncally-increasing index (e.g. with an integer key, timestamp or datetime) then the indexed window type can be used. The size of the window is the maximum distance between the oldest and newest values (e.g. an integer, or timedelta):

>>> idx = [0, 1, 2, 6, 7, 11, 15]
>>> seq = [3, 1, 4, 1, 5,  9,  2]
>>> roll_list_idx = rolling.Apply(zip(idx, seq), window_size=3, operation=tuple, window_type='indexed')
>>> list(roll_list_idx)
[(3,),
 (3, 1),
 (3, 1, 4),
 (1,),
 (1, 5),
 (9,),
 (2,)]

References and resources

Some rolling algorithms are widely known (e.g. Sum), so I am not sure which source to cite. Some algorithms I made up as I was putting the module together (e.g. Any, All), but these are relatively simple and probably exist elsewhere.

Other rolling algorithms are very cleverly designed by others and I learned a lot by reading their implementations. Here are the main resources that I used:

  • Max and Min are implemented using the Ascending Minima and Descending Maxima algorithms described by Richard Harter here. This algorithm is also used in pandas and bottleneck. My attention was first drawn to this algorithm by Jaime Fernandez del Rio's excellent talk The Secret Life Of Rolling Pandas. The algorithm is also described by Keegan Carruthers-Smith here, along with code examples.

  • Median uses the indexable skiplist approach presented by Raymond Hettinger here.

  • Var and Std use Welford's algorithm. I referred to the rolling variance implementation in pandas as well as an older edit of the Wikipedia page Algorithms for calculating variance.

Discussion and future work

The algorithms implemented by this module are chosen to be efficient in the sense that the cost of computing each new window value scales efficiently with the size of window.

In practice you might find that it is quicker not to use the the 'efficient' algorithm, and instead apply a function directly to the window. This is especially true for very small window sizes where the cost of updating a window is relatively complex. For instance, while the window size k is less than approximately 50, it may quicker to use rolling.Apply(array, k, min) (apply Python's builtin minimum function min) rather than using rolling.Min(array, k).

With this in mind, it might be worth implementing some of the algorithms in this module in more specialised/compiled code to improve performance.

rolling's People

Contributors

ac-freeman avatar ajcr avatar daviddavo avatar davidpratt512 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rolling's Issues

Rolling statistics of series of two random variables

  • PROD, rolling mean of X * Y: 1/window * sum(x*y)
  • COV, rolling covariance of X, Y, PROD - mean(X) * mean(Y)
  • CORR, rolling correlation of X, Y, COV / (std(X) * std(Y))
  • SLOPE, rolling estimation of slope of regression: y = inteception + slope * x + \epsilon, COV / var(X)
  • INTERCEPT, rolling estimation of intercept of regression: y = inteception + slope * x + \epsilon, ...
  • BETA, rolling estimation of y = \beta * x + \epsilon

Implement rolling mode

The mode is the most common observation in the window.

Need to decide how to handle cases where there are two or more equally common values - statistics module raises an error, while pandas returns each item.

I probably prefer pandas' approach here.

Make rolling Mean more numerically stable

Rolling Mean is currently implemented as the sum of the window divided by the size of the window.

To give better numerical stability when working with large floating point values, it should use the approach taken in Welford's algorithm (cf. the Rolling Var class).

Performance Comparisons

Hi,

thanks for the good effort you have put in this project already.

May I ask for an enhancement of the documentation?

It would be good to have an overview of the (overall) performance of the different functions.

My use case is that I have a numpy array and want to apply a rolling standard deviation window on one of the columns and put it in another column of this numpy array.

It would be good to compare the effort for this with the time it takes to do this for alternatives (like: use pandas directly).

Thanks in advance.

Add a Suffix Tree implementation

Suffix Tree representing window should support O(1) updates (append tail, delete head).

This would allow implementation of various matching/search algorithms.

Rolling window contains function

Essentially a rolling version of Python's in operator. Return True if the window contains a given string, e.g.:

>>> seq = 'rollrollingndsfw'
>>> r = rolling.Contains(seq, window_size=10, match='rolling')
>>> list(r)
[False, True, True, ...]

Could also be extended to match multiple fixed strings.

Summation of float values is not precise enough in the calculation of variance/std

The rolling summation of float values is not precise enough in the calculation of variance/std.

>>> import rolling
>>> list(rolling.Std([0,1,1,1], 3))
[0.5773502691896258, 7.450580596923828e-09]

The first value is correct. The second value 7.450e-09 is small, but should be 0. It is not close to 0 using the default tolerances in math.isclose() for example.

This is related to #20 which caused the variance to drop below 0 (when it should have been 0).

Add type annotations

Methods/classes should have type annotations for easier integration with other codebases.

'expanding' type: allow all iterator classes to be used as online accumulators

The classes should have a window_type='expanding' mode. I.e. the window grows with each new value that is added from the iterator.

There would be no need to keep track of seen values as nothing is removed from the window.

(Later Note): for some algorithms this may not be possible (e.g. median) or worthwhile (e.g. any, all). I need to think further whether it's worth it.

Upload 0.5.0 to pypi

Hello! Latest version on pypi is 0.4.0 from Mar' 23, could you update it to the latest version?

rolling.Std ValueError: math domain error

Environment: OSX 10.15.7, python 3.7.9, rolling=0.2.0

When I run the following code,

values = [
    138,
    136,
    137,
    137,
    135,
    136,
    135,
    135,
    135,
]
std = rolling.Std(values, window_size=3, window_type='variable')
for _ in values:
    next(std)

I got an error.

.venv/lib/python3.7/site-packages/rolling/stats.py", line 166, in current_value
    return sqrt(self._sslm / (self._obs - self.ddof))
ValueError: math domain error

This is because self._sslm is a negative value.
The value of sslm changed as follows.

0.0
2.0
2.0
0.6666666666666572
2.6666666666666288
1.9999999999999147
0.6666666666664867
0.6666666666664867
-2.2737367544323206e-13

Add ddof parameter for rolling variance and rolling standard deviation

ddof means 'delta degrees of freedom' (cf. NumPy).

Currently the code implicitly assumes ddof=1 (i.e. k - 1 degrees of freedom, sample variance). This should be the default, but the user should be able to set ddof with a keyword argument when a rolling iterator is instantiated if they want to.

Exponential weighted moving window statistics

It seems that pandas has not provided iterators for rolling/ewm functions, and your project is really nice structured. Maybe another base class is needed for iterators of ewm statistics.

median implementation is slow

TLDR; don't use skip lists

I've found that a sorted list implemented with standard list + bisect module will be faster for tracking median than rolling's implementation up to about N < 50,000. For example, for N of 10,000 it's 4x faster. Resources explaining why list + bisect are unbeatable at these sizes: http://www.grantjenks.com/docs/sortedcontainers/implementation.html, http://www.grantjenks.com/docs/sortedcontainers/performance-scale.html

Of course, rolling would want to perform well for N > 50,000 too. So use sortedcontainers.SortedList. Even though it doesn't beat standard list + bisect until about N of 20,000, it's still faster than rolling's implementation for all sizes. For example, for N of 100,000 using SortedList is 2x faster.

Add optional dependency for SortedContainers library

Rolling median uses a basic SortedList implementation (binary search/insert on a Python list).

The SortedList implementation in sortedcontainers is more advanced and will scale better for larger window sizes.

This should be an added as an optional dependency (e.g. pip install rollling[extras]) and rolling median should use the third party sortedcontainers implementation if it's available:

try:
    from sortedcontainers import SortedList
except Import Error:
    from rolling.structures.sorted_list import SortedList

Some minor work is needed to make the SortedList method names compatible with the calls in the rolling median implementation (or vice versa).

Handling of NaN

Short question, is it somehow possible to extend this to handle NaN, like numpy nanmedian?

Ability to add data instead of having to pass generators

It would be very convenient if there was a method such as append() to add series in arbitrary manner instead of having to use generators. In many cases, converting existing code to form generators is time consuming and can become complex task in multithreaded environment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.