Coder Social home page Coder Social logo

code-challenge's Introduction

Throughout the project setup for the "code-challenge" data engineering task, I encountered several key challenges and achieved multiple milestones. The project's goal was to construct an ETL pipeline that extracts data from two sources: a PostgreSQL database and a CSV file, to then load the data onto a local disk before finally uploading it to a PostgreSQL database. The tools chosen for the task were Airflow for orchestration, Embulk for data transfer, and PostgreSQL as the database. Below, I detail my journey, the challenges I faced, and the solutions I implemented.

Initial Setup and Configuration

I began by setting up Docker and Docker Compose on my Windows system, ensuring that all services, including Airflow, PostgreSQL, and Embulk, were correctly configured in the docker-compose.yml file. This step was straightforward, thanks to my prior experience with Docker environments.

Airflow Integration

Integrating Airflow was among the initial tasks. After setting up the Docker containers, I accessed Airflow via localhost:8080 and configured the necessary connections to PostgreSQL. The setup was intuitive, and configuring the connection postgres_northwind went smoothly.

Challenge 1: Activating Embulk

A significant hurdle was activating Embulk within the Docker environment. Despite correctly placing Embulk in the docker-compose.yml, it failed to become active. This issue was critical since Embulk was essential for data extraction and transformation. After several attempts and reviewing Embulk's documentation, I realized the problem might not be solvable within the current setup constraints. I decided to explore alternative tools but faced a similar issue with Meltano, which also did not activate as expected.

Solution to Challenge 1

Given the persisting issues with Embulk and then Meltano, I decided to take a step back and reassess my approach. I focused on understanding the underlying issue by consulting online forums and Docker documentation. Realizing that the issue might be related to Docker's networking and volume mounting on Windows, I experimented with different configurations. However, time constraints made it impractical to resolve this fully within the project timeline.

Adaptation and Progress

Understanding that the primary objective was to showcase my ability to handle data engineering tasks, I pivoted my approach to focus on what could be achieved. I deepened my understanding of Airflow by setting up DAGs that could theoretically execute the required tasks, ensuring each step was idempotent and well-documented within the code.

Learning Outcomes and Conclusion

This project underscored the importance of adaptability and problem-solving in data engineering. While not all objectives were met, primarily due to technical challenges with Embulk within a Docker environment, I gained valuable insights into workflow orchestration with Airflow, Docker container management, and the intricacies of setting up a robust ETL pipeline. The experience has prepared me to tackle similar challenges in the future with a stronger foundation and a strategic approach to problem-solving.

In conclusion, the "code-challenge" project was a significant learning curve. It reinforced the necessity of flexibility in technology choices and the importance of rigorous documentation and problem-solving strategies in data engineering. Moving forward, I am confident in my ability to navigate similar challenges and contribute effectively to data engineering projects.

code-challenge's People

Contributors

thelekerman avatar vitoravancini avatar vitorgithub avatar dpavancini avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.