Senior Data Engineer

The take home coding challenge for a Senior Data Engineer should assess their knowledge on various Data Engineering related concepts as well as the candidates programming knowledge.

Criteria

This test was developed and tracked in the Jira Ticket DA-5786.

It should be completable within 7 days of receipt, assuming the candidate spends a few hours working on it at most.

We will be providing this test and its test data to the candidate as a zip file since the data being used should be relatively small in size. This also lets the candidate develop in whatever way they find comfortable.

Ideally, the candidate should be uploading their solution to some kind of Git repository, be it a public or private one on GitHub, GitLab, Bitbucket etc.

Attempts can be made in either the Python or Scala programming languages as these are languages used within the team. The submitted code must compile/run and should have at least some level of testing, ideally Unit Tests.

The Challenge

Build an ETL application which processes the provided customer, products and transaction datasets from the local file system and land the processed dataset back to a file system at a different location.

The application should support the following -

Data Cleanup : Remove or quarantine rows that don't adhere to the expected data structure or constraints. Make practical assumptions regarding data structure and constraints.
Anonymisation : When given a customer id or email address, the solutions should be able to hash the personally identifiable information of that customer, rendering it anonymous, within the processed data without compromising the overall shape of the data. Anonymisation requests come in the dataset erasure-requests.json.gz, described below.
Logs operational statistics on each raw dataset being processed.

Datasets

The datasets include a simple customer dataset, a products dataset, a transactions dataset, and an erasures dataset.

All the datasets come in the format of a single row per line. Each row is in JSON format and the overall files are gzipped.

Data will be arriving into the system at multiple intervals per day, as it lands from raw data sources, so the solution should be able to process this data as it arrives.

As this is a coding challenge, the data has been placed into a folder structure to simulate this arrival pattern that looks like the following:

date=2022-01-23/
    hour=0/
        customers.json.gz
        products.json.gz
        transactions.json.gz
        erasure-requests.json.gz
    hour=1/
        customers.json.gz
        transactions.json.gz

Plainly, there are top level folders for each date, which contain subfolders for each hour of that day, and within these hourly folders are the individual datasets. Every hour may not contains every dataset, as explained in the individual dataset descriptions below.

Note: These paths are using the standard UNIX conventions for directories. Any solution provided should work in such an environment.

Customer Dataset

The customer dataset is: customers.json.gz

An example, formatted entry looks like this:

{
  "id": "347984",
  "first_name": "Georgia",
  "last_name": "Lewis",
  "date_of_birth": "2009-09-27",
  "email": "[email protected]",
  "phone_number": "01632 960 972",
  "address": "Studio 99\nMorley tunnel",
  "city": "Alana Ville",
  "country": "United Kingdom",
  "postcode": "E09 9TW",
  "last_change": "2020-03-12",
  "segment": "sports"
}

The following constraints should be true for each entry:

The following fields should be populated:
- id
- first_name
- last_name
- email
id should be unique within the whole dataset

Customer data arrives throughout the day alongside transaction data.

Products Dataset

The products dataset is: products.json.gz

An example, formatted entry looks like this:

{
  "sku": 23822,
  "name": "PHidyNvZH",
  "price": "25.00",
  "category": "vitamin",
  "popularity": 0.746141024720593
}

The following constraints should hold true for each entry:

All fields should be populated
popularity should always be a value above 0
sku should be unique within the whole dataset
price should resolve to a positive amount of currency

Product data arrives at the start of each day before customer or transactions data arrives. It can be seen as updates to the product database.

Transactions Dataset

The transactions dataset is: transactions.json.gz

An example formatted entry looks like this:

{
  "transaction_id": "6a8bb2c0-02f5-467a-8c83-6bb9a8b192b1",
  "transaction_time": "2022-07-01T16:05:08.618160",
  "customer_id": "325795",
  "delivery_address": {
    "address": "275 Nicole fall",
    "postcode": "E90 2FT",
    "city": "Maria Ville",
    "country": "United Kingdom"
  },
  "purchases": {
    "products": [
      {
        "sku": 71227,
        "quanitity": 1,
        "price": "30.98",
        "total": "30.98"
      }
    ],
    "total_cost": "30.98"
  }
}

The following constraints should be true for every entry:

The transaction_id field should be unique amongst all transactions
The customer_id should refer back to an existing customer in the customer dataset
The product sku entries should correspond to existing products in the product dataset
total_cost should match the total amount that all products purchased total up to

Transactions data arrives alongside customer data at multiple intervals during the day. For the sake of this challenge, you can consider it arriving once an hour. The transaction_time fields within each dataset may not align with the date or time they arrive on.

Erasure Dataset

The erasure dataset is: erasure-requests.json.gz

An example formatted entry looks like this:

{
  "customer-id": "325795",
  "email": "[email protected]"
}

The following constraints apply to every entry:

At least one of the 2 fields must be populated

Unlike other datasets that arrive multiple times per day, erasure requests are collected over the course of a day and arrive at the start of the next day.

Note

The project should satisfy the following non-functional requirements:

There should be clearly documented way of demonstrating the project using README.
The project should work on common Linux distributions and/or OSX.
If the project requires external platform dependencies, they should be available as a Docker container so the project can be easily tested.
The project should have test cases testing all core workflows.
Should support processing the data as it arrives. How this can be achived using the application developed should be documented in the README.
It should follow best practices.

There are no restrictions on technologies used beyond using Python or Scala as the primary programming language for the project.

Solutions :

I have generated a solution where we will have processed data under a processed-data folder, To run the file we can directly run it as a python like

Python3 app.py

I have also include a docker if we wish to run it using docker we need to give following command

docker build -t app .
docker run -rm app

It will create a processed-data folder under the docker container

ravisanchala / etl-python-demo Goto Github PK

etl-python-demo's Introduction

Senior Data Engineer

Criteria

The Challenge

Datasets

Customer Dataset

Products Dataset

Transactions Dataset

Erasure Dataset

Note

Solutions :

etl-python-demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent