Coder Social home page Coder Social logo

bitmapist's Introduction

bitmapist

Build Status

NEW! Try out our new standalone bitmapist-server, which improves memory efficiency 443 times and makes your setup much cheaper to run (and more scaleable). It's fully compatiable with bitmapist that runs on Redis.

bitmapist: a powerful analytics library for Redis

This Python library makes it possible to implement real-time, highly scalable analytics that can answer following questions:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action "X"?
  • How many users have been active this month? This hour?
  • How many unique users have performed action "X" this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?
  • What users performed action "X"?

This library is very easy to use and enables you to create your own reports easily.

Using Redis bitmaps you can store events for millions of users in a very little amount of memory (megabytes). You should be careful about using huge ids as this could require larger amounts of memory. Ids should be in range [0, 2^32).

Additionally bitmapist can generate cohort graphs that can do following:

  • Cohort over user retention
  • How many % of users that were active last [days, weeks, months] are still active?
  • How many % of users that performed action X also performed action Y (and this over time)
  • And a lot of other things!

If you want to read more about bitmaps please read following:

Installation

Can be installed very easily via:

$ pip install bitmapist

Ports

Examples

Setting things up:

from datetime import datetime, timedelta
from bitmapist import setup_redis, delete_all_events, mark_event,\
                      MonthEvents, WeekEvents, DayEvents, HourEvents,\
                      BitOpAnd, BitOpOr

now = datetime.utcnow()
last_month = datetime.utcnow() - timedelta(days=30)

Mark user 123 as active and has played a song:

mark_event('active', 123)
mark_event('song:played', 123)

Answer if user 123 has been active this month:

assert 123 in MonthEvents('active', now.year, now.month)
assert 123 in MonthEvents('song:played', now.year, now.month)
assert MonthEvents('active', now.year, now.month).has_events_marked() == True

How many users have been active this week?:

print(len(WeekEvents('active', now.year, now.isocalendar()[1])))

Iterate over all users active this week:

for uid in WeekEvents('active'):
    print(uid)

If you're interested in "current events", you can omit extra now.whatever arguments. Events will be populated with current time automatically.

For example, these two calls are equivalent:

MonthEvents('active') == MonthEvents('active', now.year, now.month)

Additionally, for the sake of uniformity, you can create an event from any datetime object with a from_date static method.

MonthEvents('active').from_date(now) == MonthEvents('active', now.year, now.month)

Get the list of these users (user ids):

print(list(WeekEvents('active', now.year, now.isocalendar()[1])))

There are special methods prev and next returning "sibling" events and allowing you to walk through events in time without any sophisticated iterators. A delta method allows you to "jump" forward or backward for more than one step. Uniform API allows you to use all types of base events (from hour to year) with the same code.

current_month = MonthEvents()
prev_month = current_month.prev()
next_month = current_month.next()
year_ago = current_month.delta(-12)

Every event object has period_start and period_end methods to find a time span of the event. This can be useful for caching values when the caching of "events in future" is not desirable:

ev = MonthEvent('active', dt)
if ev.period_end() < now:
    cache.set('active_users_<...>', len(ev))

As something new tracking hourly is disabled (to save memory!) To enable it as default do::

import bitmapist
bitmapist.TRACK_HOURLY = True

Additionally you can supply an extra argument to mark_event to bypass the default value::

mark_event('active', 123, track_hourly=False)

Unique events

Sometimes the date of the event makes little or no sense, for example, to filter out your premium accounts, or in A/B testing. There is a UniqueEvents model for this purpose. The model creates only one Redis key and doesn't depend on the date.

You can combine unique events with other types of events.

A/B testing example:

active_today = DailyEvents('active')
a = UniqueEvents('signup_form:classic')
b = UniqueEvents('signup_form:new')

print("Active users, signed up with classic form", len(active & a))
print("Active users, signed up with new form", len(active & b))

Generic filter example

def premium_up(uid):
    # called when user promoted to premium
    ...
    mark_unique('premium', uid)


def premium_down(uid):
    # called when user loses the premium status
    ...
    unmark_unique('premium', uid)

active_today = DailyEvents('active')
premium = UniqueEvents('premium')

# Add extra Karma for all premium users active today,
# just because today is a special day
for uid in premium & active_today:
    add_extra_karma(uid)

To get the best of two worlds you can mark unique event and regular bitmapist events at the same time.

def premium_up(uid):
    # called when user promoted to premium
    ...
    mark_event('premium', uid, track_unique=True)

Perform bit operations

How many users that have been active last month are still active this month?

active_2_months = BitOpAnd(
    MonthEvents('active', last_month.year, last_month.month),
    MonthEvents('active', now.year, now.month)
)
print(len(active_2_months))

# Is 123 active for 2 months?
assert 123 in active_2_months

Alternatively, you can use standard Python syntax for bitwise operations.

last_month_event = MonthEvents('active', last_month.year, last_month.month)
this_month_event = MonthEvents('active', now.year, now.month)
active_two_months = last_month_event & this_month_event

Operators &, |, ^ and ~ supported.

Work with nested bit operations (imagine what you can do with this ;-))!

active_2_months = BitOpAnd(
    BitOpAnd(
        MonthEvents('active', last_month.year, last_month.month),
        MonthEvents('active', now.year, now.month)
    ),
    MonthEvents('active', now.year, now.month)
)
print(len(active_2_months))
assert 123 in active_2_months

# Delete the temporary AND operation
active_2_months.delete()

Deleting

If you want to permanently remove marked events for any time period you can use the delete() method:

last_month_event = MonthEvents('active', last_month.year, last_month.month)
last_month_event.delete()

If you want to remove all bitmapist events use:

bitmapist.delete_all_events()

When using Bit Operations (ie BitOpAnd) you can (and probably should) delete the results unless you want them cached. There are different ways to go about this:

active_2_months = BitOpAnd(
    MonthEvents('active', last_month.year, last_month.month),
    MonthEvents('active', now.year, now.month)
)
# Delete the temporary AND operation
active_2_months.delete()

# delete all bit operations created in runtime up to this point
bitmapist.delete_runtime_bitop_keys()

# delete all bit operations (slow if you have many millions of keys in Redis)
bitmapist.delete_temporary_bitop_keys()

bitmapist cohort

With bitmapist cohort you can get a form and a table rendering of the data you keep in bitmapist. If this sounds confusing please look at Mixpanel.

Here's a simple example of how to generate a form and a rendering of the data you have inside bitmapist:

from bitmapist import cohort

html_form = cohort.render_html_form(
    action_url='/_Cohort',
    selections1=[ ('Are Active', 'user:active'), ],
    selections2=[ ('Task completed', 'task:complete'), ]
)
print(html_form)

dates_data = cohort.get_dates_data(select1='user:active',
                                   select2='task:complete',
                                   time_group='days')

html_data = cohort.render_html_data(dates_data,
                                    time_group='days')

print(html_data)

# All the arguments should come from the FORM element (html_form)
# but to make things more clear I have filled them in directly

This will render something similar to this:

bitmapist cohort screenshot

Contributing

Please see our guide here

Local Development

We use Poetry for dependency management & packaging. Please see here for setup instructions.

Once you have Poetry installed, you can run the following to install the dependencies in a virtual environment:

poetry install

Testing

To run our tests will need to ensure a local redis server is installed.

We use pytest to run unittests which you can run in a poetry shell with

poetry run pytest

Releasing new versions

  • Bump version in pyproject.toml
  • Update the CHANGELOG
  • Commit the changes with a commit message "Version X.X.X"
  • Tag the current commit with vX.X.X
  • Create a new release on GitHub named vX.X.X
  • GitHub Actions will publish the new version to PIP for you

Legal

Copyright: 2012 by Doist Ltd.

License: BSD

bitmapist's People

Contributors

adamchainz avatar amix avatar daremon avatar dcramer avatar deorus avatar dependabot[bot] avatar dotsbb avatar frankv avatar goncalossilva avatar imankulov avatar proxi avatar tartansandal avatar timgates42 avatar yurtaev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bitmapist's Issues

Error running cohort demo

Running

from bitmapist import mark_event
from bitmapist import cohort as bitmapist_cohort

mark_event('active', 123)
mark_event('song:add', 123)
mark_event('song:play', 123)

html_form = bitmapist_cohort.render_html_form(
    action_url='/_Cohort',
    selections1=[ ('Are Active', 'active'), ],
    selections2=[ ('Played song', 'song:play'), ],
    time_group='days',
    select1='active',
    select2='song:play'
)

dates_data = bitmapist_cohort.get_dates_data('active','song:play', 'days','default')

html_data = bitmapist_cohort.render_html_data(dates_data, 'days')

I get:

  File "/usr/local/lib/python2.7/dist-packages/bitmapist/cohort/__init__.py", line 191, in get_dates_data
    for i in range(0, date_range):
UnboundLocalError: local variable 'date_range' referenced before assignment

Calling `delete_all_events` when db is empty throws exception

This happens when using bitmapist-server as the backend. Possibly with redis too.

def delete_all_events(system='default'):
    """
    Delete all events from the database.
    """
    cli = get_redis(system)
    keys = cli.keys('trackist_*')  # <- None
    if len(keys) > 0:
        cli.delete(*keys)
  File "../lib/python3.7/site-packages/bitmapist/__init__.py", line 272, in delete_all_events
    if len(keys) > 0:
TypeError: object of type 'NoneType' has no len()

UniqueEvents query is very slow

I'm trying to store the user likes in redis, using bitmaps to store this question_id is liked by these users. But apparently, unique events is somehow way slow for the operation.

In [1]: from bitmapist import mark_unique
In [2]: mark_unique("question_likes:1234", 567463)
In [3]: mark_unique("question_likes:1234", 5637363)
In [4]: mark_unique("question_likes:1234", 7363)
In [5]: mark_unique("question_likes:1234", 731263)
In [6]: mark_unique("question_likes:1234", 731263)
In [7]: from bitmapist import UniqueEvents
In [38]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:uids = UniqueEvents('question_likes:1234').get_uuids()
:t = time()
:for i in uids:
:    print i
:print "elapsed time:", time() - t
:--
7363
567463
731263
5637363
elapsed time: 14.1893291473

I'm running the latest bitmapist on Redis server v=4.0.9 on linux mint.

Her is th debug object output for the key:

127.0.0.1:6379> DEBUG OBJECT trackist_question_likes:1234_u
Value at:0x7f979b6a51e0 refcount:1 encoding:raw serializedlength:8036 lru:12900027 lru_seconds_idle:252

Total Events

Is there currently way to determine the total number of occurrences for a given event?

Would there be a way to achieve this within scope's of Weeks|Months|YearsEvents()

should have a method to return an iterable of all set values

I use something like

    def yield_values(self):                                                                                                                                 
        cli = self.redis_client
        s = cli.get(self.redis_key)
        for c in s:
            bits = bin(ord(c))[2:]
            bits = '00000000'[len(bits):] + bits
            for i, b in enumerate(bits):
                v = int(b)
                if v:
                    yield i

    def get_values(self):
        return list(self.yield_values())

on yipit's class based version

A Compliment and Quandary

A co-worker of mine showed me this library. I think this is an excellent use of bitmaps! I think I see another interesting extension of this bitmapist library and I wanted to get your feedback.

I work for a company that needs an almost real-time recommendation engine. I started thinking about the application of bitmaps into collaborative filtering. Basically in collaborative filtering you need a matrix that is comprised of [users * products]

The idea in this instance would basically be to have a SETBIT users:20160614:1 [product_id] 1 for each user that is representative of what product they like. You would also need to have a SETBIT products:20160614:1 [user_id] 1 this would have an index into each user.

Wait I think I see a problem with this. How would we represent an empty space in the matrix? A cooccurrence matrix has 3 states (Like, Dislike, Unknown). I guess you could probably combine them in some way or maybe my bit operations are rusty.

The main benefits in this would be the storage savings that this could have. When I did a quick calculation based on traffic a site like ours would see in a day it would be about (700 products * 300k users) 26.25MB

I understand that you are probably not interested in having this be part of this repo. I'd mainly like your feedback and advice. Thanks!

Behavior of NOT operator on empty bitmaps

Hello,

I noticed a little quirk on NOT operators for empty bitmaps.

Say for example that a bitmap represents a byte of 0's (0000 0000).
When this byte is negated, it should give back (1111 1111) or a value of 255.
Instead, I am getting back another 0.

On non-empty bitmaps, the NOT operator seems to work fine, only up to the highest flagged bit.

Currently, if some bitmaps are empty, I have to manually flag/mark an event with a large dummy ID in order to make the NOT operator work.

Are there better ways of accomplishing this?

Consider removing "BitOpNot" to avoid misusage.

TLDR;

This is not really a "code bug", but rather a potential misusage issue with the "BitOpNot" operator. The "BitOpNot" does exactly what it supposes to do (which is flipping the bit). However, since we are using bitmapist as a stat tool, we expect it to give us the result of negation of a population set, which "BitOpNot" does not provide.

Example

Suppose you have total of 100 users. You use bitmapist to mark active user with event "active". Assume you have 5 active user and the bitmap data look like this:

1011011

Now you want to count how many active user you have

print len(MonthEvents('active', now.year, now.month)) 
#print 5, correct number.

Now suppose you wanna know how many inactive user you have.

print len(BitOpNot(MonthEvents('active', now.year, now.month)))  
#print 2. 

Here the negation of the "active user set" gives us size of 2 instead of 95.

The fundamental problem is that the variable length bitmap data only contain information about "who are active", but it does not contain information about the population size.

Populating the data

How do I populate data into Bitmapist so that I can play with your queries..

Do you have a utility to populate domain/ application specific data and run bunch of the queries as you had put and gauge, how fast it works..

That will be helpful.

Krishna

ResponseError: bit offset is not an integer or out of range

>>> myid
10204510554222024
>>> mark_event('active', myid)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/bitmapist/__init__.py", line 173, in mark_event
    client.execute() 
  File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2578, in execute
    return execute(conn, stack, raise_on_error)
  File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2492, in _execute_transaction
    self.raise_first_error(commands, response)
  File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2526, in raise_first_error
    raise r
ResponseError: Command # 1 (SETBIT trackist_active_2015-3 10204510554222024 1) of pipeline caused error: bit offset is not an integer or out of range
>>> 

Can user IDs only be integers ?

Apparently the user IDs are the offset, so whenever I mark an event, you set 1 on the offset which is the user ID. Does this mean user IDs can never be strings ?

Solution for large ids by adding O(1) lookup

In your documentation it reads:

Using Redis bitmaps you can store events for millions of users in a very little amount of memory (megabytes). You should be careful about using huge ids (e.g. 2^32 or bigger) as this could require larger amounts of memory.

It might be a potential solution to create a hash table that keeps track of huge ids and maps them back down to smaller indexes.

For example, a user with an id of 192329230202 could be mapped to a smaller index 1 in the bitmap. This would require an O(1) lookup before a `SETBIT' so it shouldn't affect time performance, but it would require more space on disk.

Steps to implement:

  1. A new user performs an signup event.
  2. Find the users_counter for the current day GET "users_counter:20160614" which would respond back with something like 2.
  3. Add new User(id:192329230202) to the user_index table and reassign to User(internal_id:3). This would do something like SET "users_index:20160614:192329230202" 3
  4. Once this is completed successfully you would want to run INCR users_count:20160614
  5. SETBIT "events:search:20160614" 3 1 bit into feature signup at User(3) index instead of at the end of the bitmap which requires creating and storing empty bits.

This would most likely add complexity into how your query the data and you would have to store a reference to lookup each user. In my proposal I used a different lookup per day to reset the bitmap indexes everyday, but this might be more trouble than just maintaining one large ongoing table.

Let me know your thoughts and if this something you are interested in promoting into an enhancement.

add mako to setup.py

ImportError: No module named mako.lookup

Your cohort package requires mako, yet it's not listed in the install_requires.

Storing extra data

Is there a more efficient way to store extra data for scenarios like 'user x replied question y correctly|falsely in z seconds'?

I think implementations such as

mark_event("question:y:x:1/0', z)

would be neither effective nor useful for queries.

Some examples on this would be very helpful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.