aws-samples / amazon-cloudfront-access-logs-queries Goto Github PK

Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena.

License: MIT No Attribution

JavaScript 80.86% Python 19.14%

amazon-cloudfront-access-logs-queries's Introduction

Analyzing your Amazon CloudFront access logs at scale

This is a sample implementation for the concepts described in the AWS blog post Analyze your Amazon CloudFront access logs at scale using AWS CloudFormation, Amazon Athena, AWS Glue, AWS Lambda, and Amazon Simple Storage Service (S3).

This application is available in the AWS Serverless Application Repository. You can deploy it to your account from there:

Overview

The application has two main parts:

An S3 bucket <ResourcePrefix>-<AccountId>-cf-access-logs that serves as a log bucket for Amazon CloudFront access logs. As soon as Amazon CloudFront delivers a new access logs file, an event triggers the AWS Lambda function moveAccessLogs. This moves the file to an Apache Hive style prefix.
An hourly scheduled AWS Lambda function transformPartition that runs an INSERT INTO query on a single partition per run, taking one hour of data into account. It writes the content of the partition to the Apache Parquet format into the <ResourcePrefix>-<AccountId>-cf-access-logs S3 bucket.

FAQs

Q: How can I get started?

Use the Launch Stack button above to start the deployment of the application to your account. The AWS Management Console will guide you through the process. You can override the following parameters during deployment:

The NewKeyPrefix (default: new/) is the S3 prefix that is used in the configuration of your Amazon CloudFront distribution for log storage. The AWS Lambda function will move the files from here.
The GzKeyPrefix (default: partitioned-gz/) and ParquetKeyPrefix (default: partitioned-parquet/) are the S3 prefixes for partitions that contain gzip or Apache Parquet files.
ResourcePrefix (default: myapp) is a prefix that is used for the S3 bucket and the AWS Glue database to prevent naming collisions.

The stack contains a single S3 bucket called <ResourcePrefix>-<AccountId>-cf-access-logs. After the deployment you can modify your existing Amazon CloudFront distribution configuration to deliver access logs to this bucket with the new/ log prefix.

As soon Amazon CloudFront delivers new access logs, files will be moved to GzKeyPrefix. After 1-2 hours, they will be transformed to files in ParquetKeyPrefix.

You can query your access logs at any time in the Amazon Athena Query editor using the AWS Glue view called combined in the database called <ResourcePrefix>_cf_access_logs_db:

SELECT * FROM cf_access_logs.combined limit 10;

Q: How can I customize and deploy the template?

Fork this GitHub repository.
Clone the forked GitHub repository to your local machine.
Modify the templates.
Install the AWS CLI & AWS Serverless Application Model (SAM) CLI.
Validate your template:
```
$ sam validate -t template.yaml
```
Package the files for deployment with SAM (see SAM docs for details) to a bucket of your choice. The bucket's region must be in the region you want to deploy the sample application to:
```
$ sam package
    --template-file template.yaml
    --output-template-file packaged.yaml
    --s3-bucket <BUCKET>
```

Deploy the packaged application to your account:

$ aws cloudformation deploy
    --template-file packaged.yaml
    --stack-name my-stack
    --capabilities CAPABILITY_IAM

Q: How can I use the sample application for multiple Amazon CloudFront distributions?

If your data does not need to be partitioned by Amazon CloudFront distribution, you can use the same bucket and path (new/) for more than one distribution. Then you can query the data by host column. If you need to speed up the parquet transformation duration (must stay under 15 minutes) or query duration, deploy another AWS CloudFormation stack from the same template for each distribution. The stack name is added to all resource names (e.g. AWS Lambda functions, S3 bucket etc.) so you can distinguish the different stacks in the AWS Management Console.

Q: In which region can I deploy the sample application?

The Launch Stack button above opens the AWS Serverless Application Repository in the US East 1 (Northern Virginia) region. You may switch to other regions from there before deployment.

Q: How can I add a new question to this list?

If you found yourself wishing this set of frequently asked questions had an answer for a particular problem, please submit a pull request. The chances are good that others will also benefit from having the answer listed here.

Q: How can I contribute?

See the Contributing Guidelines for details.

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

amazon-cloudfront-access-logs-queries's People

Contributors

Stargazers

Watchers

Forkers

masayoshi634 sammiee2000 verran nimacks ronaldoedy sppwf first10 xjbych1224 zmw85 danil-smirnov vijayabaskar erupare starorm trivedisorabh ottowarp dustinstearman bsg-bob guy-remarkety mcallari mcechini guerard fsa-streamotion zxkane job-asfaw mohsanjaffery aleksndrr squeezetoyaliens devopsotrator titanjer nilratn rcodydukew mi-fukai fgeir edwinlititech maksimaniskov mycshq htnosm grolston fiji-flo 00mjk mokocm passeidireto syllogy mpitt rastandy alphaadidas wedneyyuri icebreaker70 ehowatt hiramrosales karasta sivlek14 annieyjlin eijikominami colourboxdevelopment mason0920 uxpin yemartin pi-matsuo duttonw waldensystems zhao-xiang craigspaz kgmedia-gomed designcrowd hoangngo1995 wizonchain

amazon-cloudfront-access-logs-queries's Issues

Expire unused objects in S3

There are unused objects in the bucket that we can remove safely.

athena-query-results - This folder is storing Athena Query Results generated when we transform CSV files into Parquet files. We don't have a reason to keep these results more than one day.
partitioned-gz - This folder is storing raw gzip files generated by Cloudfront. We are storing this data in parquet so I can't find a reason to keep the raw data stored forever.

I would like to hear the opinion from other developers before opening a pull request.

CloudFrontAccessLogsBucket:
  Type: 'AWS::S3::Bucket'
  Properties:
    LifecycleConfiguration:
      Rules:
        - Id: ExpireAthenaQueryResults
          Prefix: athena-query-results/
          Status: Enabled
          ExpirationInDays: 1
        - Id: ExpirePartitionedGz
          Prefix: !Ref GzKeyPrefix
          Status: Enabled
          ExpirationInDays: 180
          Transitions:
            - TransitionInDays: 7
              StorageClass: GLACIER

Multiple Amazon CloudFront distributions pointing to the same bucket

Hi @steffeng ! And first, thanks a lot for this. I have had my eyes on deploying this for a while, I am glad to finally have the opportunity!

I have yet another question about using this for multiple distributions (FYI we have about 15 now, going to have about 30).

I checked #8 and #12, but this is different: I don't care about partitioning the data by distribution (like in #8), or group the entire S3 data structure under subfolders (like in #12).

According the the FAQ:

Q: How can I use the sample application for multiple Amazon CloudFront distributions?

Deploy another AWS CloudFormation stack

What about pointing multiple CF distributions to the same bucket's new/? Skimming the code, I did not see any obvious reason why it would be an issue, but the FAQ answer above makes it sound like the only way is to deploy another stack...

What am I missing? Will it work, or will it break something?

support hostheader as a partition to allow multiple sites to be handled by 1 bucket/stack

Since the host header isn't in the log file name, it would need to be part of the log bucket's prefix configured in CloudFront to keep the lambda stateless when converting new log objects to the partition format.

We have lots of sites, and would like to avoid the overhead of setting up a new stack for each one, and also have the flexibility to query multiple sites at once while still maintaining the cost/performance benefits of using partitions.

Trigger createPartition / transformPartition lambda for historical data

Hi,

I have implemented this stack and it is working well. However I dropped my old cloudfront logs into the new directory and it moved them into the partitioned-gz directory as expected however I am unsure what the best way to trigger the create/transformPartition lambdas is to process them into the partitioned-parquet directory. Only new data is being transformed into parquet format because those lambdas work based on the current date/time.

Any ideas welcome!

Combine View with Day Partitioned Table

Currently, I am doing Day Partitioned Table where anything older than a day will insert into it. Now I would like to combine the view with Day Partitioned Table.

Here is what I am doing.

SELECT *, "$path" as file FROM ${database}.${partitioned_gz_table} WHERE (concat(year, month, day, hour) >= date_format(date_trunc('hour', ((current_timestamp - INTERVAL '15' MINUTE) - INTERVAL '1' HOUR)), '%Y%m%d%H')) UNION ALL SELECT *, "$path" as file FROM ${database}.${partitioned_parquet_table} WHERE (concat(year, month, day, hour) < date_format(date_trunc('hour', ((current_timestamp - INTERVAL '15' MINUTE) - INTERVAL '1' HOUR)), '%Y%m%d%H')) UNION ALL SELECT *, "$path" as file FROM ${database}.${day_partitioned_parquet_table} WHERE (concat(year, month, day) < date_format(date_trunc('day', ((current_timestamp - INTERVAL '15' MINUTE) - INTERVAL '1' DAY), '%Y%m%d))

I am not sure if my query is right for what I am trying to achieve. What is the best way to achieve this?

Parameter for existing bucket

It would be nice to be able to deploy this solution on top of existing buckets, where the bucket-name as a parameter would drive the logical switch to create a new or use existing.

Suppressing the "Partitioned_Parquet" Table in CloudFormation

Is it possible to set this up to suppress the "partitioned_parquet" table at launch? I don't believe that I will be querying that table as what I need is in the "partitioned_gz" table.

Thank you for your help.

S3 bucket does not conform to AWS Foundational Security Best Practices

The access logs bucket does not pass these checks:
[S3.4] S3 buckets should have server-side encryption enabled
[S3.5] S3 buckets should require requests to use Secure Socket Layer

Custom Resource Tags for resources

Would be good to see custom resource tags for cost tracking added in this template. What do you think?

Upgrading to the latest version?

What is the best way to upgrade to the latest version? I originally deployed it using the deploy button in the readme but I'd like to upgrade to the node18 version without losing any data.

Using Management Console to Place Log Files Under Root Bucket Assist

I have multiple CloudFront distributions that I would like to use this for and would like to avoid having many Root S3 Buckets.

Is is possible to change a parameter so that all CloudFront logs went to a given bucket, like for example a root bucket named "cf-logs". Is it possible to make this change using the Management Console so that "cf-logs" is the root bucket and "myapp1" is a folder inside that bucket? For example:

/cf-logs/myapp1

I attempted to make this change by manipulating the template.yaml file on the following lines:

6 Instances of:
arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs

2 instances of:
s3://${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/athena-query-results

and 1 instance of:
BucketName: !Sub "${ResourcePrefix}-${AWS::AccountId}-cf-access-logs"

But unfortunately I'm getting errors when I make these changes.

Is there a way to do this using the Management Console so that I can have the log files go underneath a root bucket?

Thank you for your assistance. This repo is really great and does exactly what I need it to!

field name change between database stack deploys

I created 2 cloudformation stacks for this a few years ago, and the schema fields that represent multiple words like requestip and hostheader are single words as typed here.

In December 2019, the fields were updated to use snake_case to conform to AWS Cloudfront specs:
c76192a

I created a 3rd stack for new product sites after this, and the schemas don't match. I have a service that runs queries to find bot crawl numbers, and I'm going to have to come up a way to construct the WHERE clauses to differ based on which database is accessed.

Is there a way to safely modify the older 2 stacks to the new snake_case field names?

Is it possible that the log files are moved to incorrect partition?

Since the access log files are delivered to S3 asynchronously, a log file E271AZ5HG504X.2022-01-20-07.2bd0b06.gz may contents access log starts from 2022-01-20 08:00:00. If the log file is moved to partition year=2022/month=01/day=20/hour=07, is it possible that a sql with where clause where year='2022' and month='01' and day='20' and hour='08' may lost this part of data?

Lambda Function moveAccessLogs.js Throwing Error - Logs aren't in Expected Format

I set this up by launching the Stack as described.

When I look at the S3 buckets, comparing the Old CloudFront log files to the New CloudFront log files they look the same. What I mean by "look the same" is that they both are in the format: distribution-ID.YYYY-MM-DD-HH.unique-ID.gz

But when I look at the log streams going into CloudWatch I see this format:
2020/10/12/7063690b6a7f417293f736568409fc3e

In Lambda I noticed that "MoveNewAccessLogsFn" is not working. Below is an error I get when I attempt to test:

START RequestId: c16a8df6-2390-4533-aeb9-53e364cf3b1a Version: $LATEST
2020-10-12T22:00:20.488Z c16a8df6-2390-4533-aeb9-53e364cf3b1a ERROR Invoke Error {"errorType":"TypeError","errorMessage":"Cannot read property 'map' of undefined","stack":["TypeError: Cannot read property 'map' of undefined"," at Runtime.exports.handler (/var/task/moveAccessLogs.js:18:31)"," at Runtime.handleOnce (/var/runtime/Runtime.js:66:25)"]}
END RequestId: c16a8df6-2390-4533-aeb9-53e364cf3b1a
REPORT RequestId: c16a8df6-2390-4533-aeb9-53e364cf3b1a Duration: 39.03 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 83 MB Init Duration: 618.96 ms

Am I missing some permissions somewhere? How come I'm not seeing the log files in a year-month-day in S3?

Thanks in advance for your help.