Coder Social home page Coder Social logo

etl-python-demo's Introduction

Senior Data Engineer

The take home coding challenge for a Senior Data Engineer should assess their knowledge on various Data Engineering related concepts as well as the candidates programming knowledge.

Criteria

This test was developed and tracked in the Jira Ticket DA-5786.

It should be completable within 7 days of receipt, assuming the candidate spends a few hours working on it at most.

We will be providing this test and its test data to the candidate as a zip file since the data being used should be relatively small in size. This also lets the candidate develop in whatever way they find comfortable.

Ideally, the candidate should be uploading their solution to some kind of Git repository, be it a public or private one on GitHub, GitLab, Bitbucket etc.

Attempts can be made in either the Python or Scala programming languages as these are languages used within the team. The submitted code must compile/run and should have at least some level of testing, ideally Unit Tests.

The Challenge

Build an ETL application which processes the provided customer, products and transaction datasets from the local file system and land the processed dataset back to a file system at a different location.

The application should support the following -

  • Data Cleanup : Remove or quarantine rows that don't adhere to the expected data structure or constraints. Make practical assumptions regarding data structure and constraints.
  • Anonymisation : When given a customer id or email address, the solutions should be able to hash the personally identifiable information of that customer, rendering it anonymous, within the processed data without compromising the overall shape of the data. Anonymisation requests come in the dataset erasure-requests.json.gz, described below.
  • Logs operational statistics on each raw dataset being processed.

Datasets

The datasets include a simple customer dataset, a products dataset, a transactions dataset, and an erasures dataset.

All the datasets come in the format of a single row per line. Each row is in JSON format and the overall files are gzipped.

Data will be arriving into the system at multiple intervals per day, as it lands from raw data sources, so the solution should be able to process this data as it arrives.

As this is a coding challenge, the data has been placed into a folder structure to simulate this arrival pattern that looks like the following:

date=2022-01-23/
    hour=0/
        customers.json.gz
        products.json.gz
        transactions.json.gz
        erasure-requests.json.gz
    hour=1/
        customers.json.gz
        transactions.json.gz

Plainly, there are top level folders for each date, which contain subfolders for each hour of that day, and within these hourly folders are the individual datasets. Every hour may not contains every dataset, as explained in the individual dataset descriptions below.

Note: These paths are using the standard UNIX conventions for directories. Any solution provided should work in such an environment.

Customer Dataset

The customer dataset is: customers.json.gz

An example, formatted entry looks like this:

{
  "id": "347984",
  "first_name": "Georgia",
  "last_name": "Lewis",
  "date_of_birth": "2009-09-27",
  "email": "[email protected]",
  "phone_number": "01632 960 972",
  "address": "Studio 99\nMorley tunnel",
  "city": "Alana Ville",
  "country": "United Kingdom",
  "postcode": "E09 9TW",
  "last_change": "2020-03-12",
  "segment": "sports"
}

The following constraints should be true for each entry:

  • The following fields should be populated:
    • id
    • first_name
    • last_name
    • email
  • id should be unique within the whole dataset

Customer data arrives throughout the day alongside transaction data.

Products Dataset

The products dataset is: products.json.gz

An example, formatted entry looks like this:

{
  "sku": 23822,
  "name": "PHidyNvZH",
  "price": "25.00",
  "category": "vitamin",
  "popularity": 0.746141024720593
}

The following constraints should hold true for each entry:

  • All fields should be populated
  • popularity should always be a value above 0
  • sku should be unique within the whole dataset
  • price should resolve to a positive amount of currency

Product data arrives at the start of each day before customer or transactions data arrives. It can be seen as updates to the product database.

Transactions Dataset

The transactions dataset is: transactions.json.gz

An example formatted entry looks like this:

{
  "transaction_id": "6a8bb2c0-02f5-467a-8c83-6bb9a8b192b1",
  "transaction_time": "2022-07-01T16:05:08.618160",
  "customer_id": "325795",
  "delivery_address": {
    "address": "275 Nicole fall",
    "postcode": "E90 2FT",
    "city": "Maria Ville",
    "country": "United Kingdom"
  },
  "purchases": {
    "products": [
      {
        "sku": 71227,
        "quanitity": 1,
        "price": "30.98",
        "total": "30.98"
      }
    ],
    "total_cost": "30.98"
  }
}

The following constraints should be true for every entry:

  • The transaction_id field should be unique amongst all transactions
  • The customer_id should refer back to an existing customer in the customer dataset
  • The product sku entries should correspond to existing products in the product dataset
  • total_cost should match the total amount that all products purchased total up to

Transactions data arrives alongside customer data at multiple intervals during the day. For the sake of this challenge, you can consider it arriving once an hour. The transaction_time fields within each dataset may not align with the date or time they arrive on.

Erasure Dataset

The erasure dataset is: erasure-requests.json.gz

An example formatted entry looks like this:

{
  "customer-id": "325795",
  "email": "[email protected]"
}

The following constraints apply to every entry:

  • At least one of the 2 fields must be populated

Unlike other datasets that arrive multiple times per day, erasure requests are collected over the course of a day and arrive at the start of the next day.

Note

The project should satisfy the following non-functional requirements:

  • There should be clearly documented way of demonstrating the project using README.
  • The project should work on common Linux distributions and/or OSX.
  • If the project requires external platform dependencies, they should be available as a Docker container so the project can be easily tested.
  • The project should have test cases testing all core workflows.
  • Should support processing the data as it arrives. How this can be achived using the application developed should be documented in the README.
  • It should follow best practices.

There are no restrictions on technologies used beyond using Python or Scala as the primary programming language for the project.

Solutions :

I have generated a solution where we will have processed data under a processed-data folder, To run the file we can directly run it as a python like

Python3 app.py

I have also include a docker if we wish to run it using docker we need to give following command

docker build -t app .
docker run -rm app

It will create a processed-data folder under the docker container

etl-python-demo's People

Contributors

ravisanchala avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.