Produce basic website analytics from server or CDN request logs.
Sieve is a tool for compiling basic website analytics from server request logs rather than using front-end libraries and third-party services. It has two parts: an application for computing the metrics, and a basic web server for viewing the metrics.
When we say "basic", we really mean it! The initial metrics are:
- Visitors based on IP address,
- Referrals based on the
Referer
HTTP header, and - Top Pages based on the
uri-stem
(i.e., path) of the HTTP request.
Screenshot of the displayed analytics:
Google Analytics (GA) is too complex for me to understand. It's overkill for all of my basic websites. Additionally, there are general privacy concerns with GA.
For static content websites, I just want to know which pages are popular and who is referring traffic. Sieve does this by compiling server request logs into basic metrics.
Who produces the server request logs? In my case, it's AWS CloudFront, but the approach that Sieve takes should be adaptable to any log format. I host my static files on AWS S3 and then front the site with CloudFront. CloudFront writes logs for each request that it receives, and then forwards the request to the S3 bucket to fetch the resource. (In my case, I turn off CloudFront caching, but that shouldn't impact the completeness of the CloudFront request logs.)
So, if you have an S3 site that is fronted by CloudFront, and the CloudFront distribution has access logs enabled, you can use Sieve right out of the box!
- Download and install a JVM.
- Download and install Scala.
- Install the AWS CLI.
- Install Node
- Download or clone the Sieve Git repository.
- Run:
$ ./make-jar.sh
to create app/target/sieve.jar
.
The basic usage is to periodically synchronize (i.e., download) your CloudFront server request logs from S3 to your local machine, and then compute the metrics from the logs. After that, you can view the metrics in your web browser.
To compute analytics, you need a few pieces of information:
$ ./compute.sh <S3-bucket+path> <local-logs-dir> <outfile.json>
Description:
-
S3-bucket+path
- the S3 bucket name where your CloudFront logs are written, followed by the S3 path prefix that your specific CloudFront distribution uses. -
local-logs-dir
- the local directory where your CloudFront logs will be copied. By default, Sieve only syncs the most recent 3 months of logs from CloudFront. -
outfile.json
- the JSON file that Sieve will create with the computed analytics. Generally, you should specify a path that writes this file toweb/data/
so that it will be automatically updated for the server.
Here's an example command line you might use for a website named my-site.com
:
$ ./compute.sh my-logs-bucket/cloudfront-logs/my-site/ data/my-site/ web/data/my-site.json
Note that you can save that command into a file named run-my-site.sh
in the repository,
and the .gitignore
file will exclude it.
This allows you to have multiple run-X.sh
files for each website that you track.
The server loads the index.html
page which fetches the file at web/data/sites.json
.
This file is ignored in Git so you can add your own my-site.json
file (computed from above).
The format of this file is very basic:
{
"sites": [
{"id": "my-site", "name": "My Special Site"},
{"id": "other-site", "name": "My Other Site"}
]
}
The id
field must match the filename (ignoring filetype suffix) of your computed .json file.
You only need to add your site entry to this file once and
as long as future computations of your data update the same file,
then the server will serve the latest version.
To view the computed analytics for any sites with a .json
file in web/data/
,
run the static server with:
$ ./server.sh
and
then navigate to http://localhost:8080/
.
You should see a Visitors graph and other metrics. You can toggle between your sites with the dropdown at the top left.
As mentioned already, these metrics are basic. Furthermore, there are some issues, including:
-
Visitors are only identified by IP address, but multiple users behind a NAT will share the same IP address.
-
Support for excluding certain traffic is limited to string-based IP prefixes, and only the Google crawlerbot traffic is blocked by default.
-
The "Referrals" and "Top Pages" tables might be noisy. For example, the site itself is usually the top referer. Some entries in these tables should ideally be filtered out.
-
and probably many more ...
Thanks for checking it out.