This repository provides you cdk scripts and sample code on how to count unique items (e.g., unique visitors) with hyperloglog in Amazon MemoryDB for Redis.
HyperLogLog (HLL) is a probabilistic data structure that estimates the cardinality of a set. As a probabilistic data structure, HyperLogLog trades perfect accuracy for efficient space utilization.
Counting unique items usually requires an amount of memory proportional to the number of items you want to count, because you need to remember the elements you have already seen in the past in order to avoid counting them multiple times. However, a set of algorithms exist that trade memory for precision: they return an estimated measure with a standard error, which, in the case of the Redis implementation for HyperLogLog, is less than 1%. The magic of this algorithm is that you no longer need to use an amount of memory proportional to the number of items counted, and instead can use a constant amount of memory.
In this project, we will count unique visitors with HyperLogLog in Amazon MemoryDB for Redis.
Below diagram shows what we are implementing.
The cdk.json
file tells the CDK Toolkit how to execute your app.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .venv
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
$ python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
$ source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
% .venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
(.venv) $ pip install -r requirements.txt
Upload library packages for AWS Lambda Layer to S3
AWS Lambda function uses redis-py-cluster
Python package to access Amazon MemoryDB for Redis.
In this project, we will create AWs Lambda Layer with redis-py-cluster
Python packages.
So, we need to upload library packes to S3 for AWS Lambda Layer.
You can create the python package by running the following comands:
$ cat <<EOF > requirements.txt > redis-py-cluster==2.1.3 > EOF $ docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.11" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.11/site-packages/; exit" $ zip -r redis-py-cluster-lib.zip python > /dev/null $ aws s3 mb s3://my-bucket-for-lambda-layer-packages $ aws s3 cp redis-py-cluster-lib.zip s3://my-bucket-for-lambda-layer-packages/var/
ℹ️ How to create a Lambda layer using a simulated Lambda environment with Docker
Set up cdk.context.json
Then, before deploying the CloudFormation, you should set approperly the cdk context configuration file, cdk.context.json
.
For example,
{ "kinesis_stream_name": "demo-kds", "s3_bucket_lambda_layer_lib": "lambda-layer-resources", "memorydb_cluster_name": "demo-memdb", }
s3_bucket_lambda_layer_lib
option should have the s3 bucket name that contains python packages to be registered to AWS Lambda Layer.
Bootstrap AWS environment for AWS CDK app
Also, before any AWS CDK app can be deployed, you have to bootstrap your AWS environment to create certain AWS resources that the AWS CDK CLI (Command Line Interface) uses to deploy your AWS CDK app.
Run the cdk bootstrap
command to bootstrap the AWS environment.
(.venv) $ cdk bootstrap
At this point you can now synthesize the CloudFormation template for this code.
Let's check all CDK Stacks with cdk list
command.
(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text) (.venv) $ export CDK_DEFAULT_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) (.venv) $ cdk list MemoryDBVPCStack MemoryDBAclStack MemoryDBStack KinesisDataStreamsStack LambdaLayersStack LambdaFunctionStack BastionHostStack
Then, synthesize the CloudFormation template for this code
(.venv) $ cdk synth --all
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
You are now ready to run the cdk deploy
command to build the stack shown above.
(.venv) $ cdk deploy --all
-
Generate test data.
$ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name BastionHostStack \ | jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) |.OutputValue') $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user [ec2-user@ip-172-31-7-186 ~]$ ls -1 gen_fake_data.py redis-6.2.14 redis-6.2.14.tar.gz [ec2-user@ip-172-31-7-186 ~]$ python3 gen_fake_data.py --service-name kinesis --stream-name your-kinesis-data-stream-name --max-count 100
-
Check Amazon MemoryDB for Redis
5~10
minutes later, and you will see data.ℹ️ You can find out Amazon MemoryDB for Redis endpoint by running the following command:
aws cloudformation describe-stacks --stack-name MemoryDBStack \ | jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("MemoryDBClusterEndpoint")) |.OutputValue'
Let's check the items in Amazon MemoryDB for Redis.
ℹ️ The user and password of Amazon MemoryDB are stored in the AWS Secrets Manager as a name such as
MemoryDBSecret-xxxxxxxxxxxx
.$ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name BastionHostStack \ | jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) |.OutputValue') $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user [ec2-user@ip-172-31-7-186 ~]$ ls -1 gen_fake_data.py redis-6.2.14 redis-6.2.14.tar.gz [ec2-user@ip-172-31-7-186 ~]$ redis-cli -c --tls -h clustercfg.demo-memdb.81kuqj.memorydb.us-east-1.amazonaws.com -p 6379 --user user-name -a your-password
Run
pfcount
,ttl
redis commands to find out unique vistors count. For example,Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. demo-memdb-0002-001.demo-memdb.81kuqj.memorydb.us-east-1.amazonaws.com:6379> keys * 1) "uv:site_id=715:20240309" 2) "uv:site_id=283:20240309" demo-memdb-0002-001.demo-memdb.81kuqj.memorydb.us-east-1.amazonaws.com:6379> pfcount uv:site_id=715:20240309 (integer) 20 demo-memdb-0002-001.demo-memdb.81kuqj.memorydb.us-east-1.amazonaws.com:6379> pfcount uv:site_id=283:20240309 (integer) 23 demo-memdb-0002-001.demo-memdb.81kuqj.memorydb.us-east-1.amazonaws.com:6379> ttl uv:site_id=283:20240309 (integer) 82421
Delete the CloudFormation stacks by running the below command.
(.venv) $ cdk destroy --all
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
Enjoy!
- Redis engine versions - Amazon MemoryDB for Redis
- Connect to a MemoryDB cluster (Linux)
redis-cli -c -h {Primary or Configuration Endpoint} --tls -p 6379 --user {user_name} -a {password}
- How to create a Lambda layer using a simulated Lambda environment with Docker
- Amazon MemoryDB for Redis Immersion Day
- Amazon ElastiCache for Redis Immersion Day
- Redis Commands
- HyperLogLog in Redis
- Probabilistic Data Structures in Redis
- Bloom filters, Cuckoo filters, Count-Min Sketch, Top-K, HyperLogLog
- Using HyperLogLog sketches in Amazon Redshift