segment-logs

Load and process Segment logs from S3.

Getting started

Local processing using Spark

Install dependencies

brew install parallel

python -m venv .venv
source .venv/bin/activate
pip install pyspark

Fetch data

aws configure --profile lightspeed-telemetry
export AWS_PROFILE=lightspeed-telemetry
export AWS_S3_URL=s3://host/prefix
aws s3 sync $AWS_S3_URL data/raw

(Optional) Extract local data and cache (for better local performance)

find data/raw -type f | parallel --bar 'gzcat {} | jq -c "."' > data/all.jsonl

Start PySpark

pyspark

Process data in PySpark

spark.conf.set("spark.sql.debug.maxToStringFields", 1000)
df_all = spark.read.json("./data/all.jsonl")

Example queries

All accepted recommendations that share the same prompt

from pyspark.sql.functions import col

completions = df_all.alias("completions").alias("completions")
feedback = df_all.filter("event == 'inlineSuggestionFeedback'").alias("feedback")

top = completions.join(feedback, col("completions.properties.suggestionId") == col("feedback.properties.suggestionId")) \
    .where(col("feedback.properties.action") == 0) \
    .groupBy(col("completions.properties.request.prompt")) \
    .count() \
    .orderBy("count", ascending=False)

top.show()

Completion events for a specific suggestion ID

from pyspark.sql.functions import col

completions = df_all.alias("completions").alias("completions")

completion = completions.where(col("properties.suggestionId") == "9807b2af-0c26-4653-b0c6-97e090e14c82") \
    .select("properties.request.prompt")

completion.show()

Display object schemas

from pyspark.sql.functions import col

completions = df_all.alias("completions").alias("completions")

# print entire completions schema
completions.printSchema()

request = completions.select("properties.request")

# print the request schema
request.printSchema()

Remote processing using Jupyter

Run the notebook

AWS_S3_URI=s3://<bucket>/segment-logs/<path> \
AWS_PROFILE=<profile name> \
jupyter notebook load.ipynb

Development

Install dependencies

python -m venv .venv
source .venv/bin/activate
pip3 install -r requirements-dev.txt

Run checks

tox -e linters

gebhardtr / segment-logs Goto Github PK

segment-logs's Introduction

segment-logs

Getting started

Local processing using Spark

Install dependencies

Fetch data

(Optional) Extract local data and cache (for better local performance)

Start PySpark

Process data in PySpark

Example queries

All accepted recommendations that share the same prompt

Completion events for a specific suggestion ID

Display object schemas

Remote processing using Jupyter

Run the notebook

Development

Install dependencies

Run checks

segment-logs's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org