Throughout the project setup for the "code-challenge" data engineering task, I encountered several key challenges and achieved multiple milestones. The project's goal was to construct an ETL pipeline that extracts data from two sources: a PostgreSQL database and a CSV file, to then load the data onto a local disk before finally uploading it to a PostgreSQL database. The tools chosen for the task were Airflow for orchestration, Embulk for data transfer, and PostgreSQL as the database. Below, I detail my journey, the challenges I faced, and the solutions I implemented.
Initial Setup and Configuration
I began by setting up Docker and Docker Compose on my Windows system, ensuring that all services, including Airflow, PostgreSQL, and Embulk, were correctly configured in the docker-compose.yml
file. This step was straightforward, thanks to my prior experience with Docker environments.
Airflow Integration
Integrating Airflow was among the initial tasks. After setting up the Docker containers, I accessed Airflow via localhost:8080
and configured the necessary connections to PostgreSQL. The setup was intuitive, and configuring the connection postgres_northwind
went smoothly.
Challenge 1: Activating Embulk
A significant hurdle was activating Embulk within the Docker environment. Despite correctly placing Embulk in the docker-compose.yml
, it failed to become active. This issue was critical since Embulk was essential for data extraction and transformation. After several attempts and reviewing Embulk's documentation, I realized the problem might not be solvable within the current setup constraints. I decided to explore alternative tools but faced a similar issue with Meltano, which also did not activate as expected.
Solution to Challenge 1
Given the persisting issues with Embulk and then Meltano, I decided to take a step back and reassess my approach. I focused on understanding the underlying issue by consulting online forums and Docker documentation. Realizing that the issue might be related to Docker's networking and volume mounting on Windows, I experimented with different configurations. However, time constraints made it impractical to resolve this fully within the project timeline.
Adaptation and Progress
Understanding that the primary objective was to showcase my ability to handle data engineering tasks, I pivoted my approach to focus on what could be achieved. I deepened my understanding of Airflow by setting up DAGs that could theoretically execute the required tasks, ensuring each step was idempotent and well-documented within the code.
Learning Outcomes and Conclusion
This project underscored the importance of adaptability and problem-solving in data engineering. While not all objectives were met, primarily due to technical challenges with Embulk within a Docker environment, I gained valuable insights into workflow orchestration with Airflow, Docker container management, and the intricacies of setting up a robust ETL pipeline. The experience has prepared me to tackle similar challenges in the future with a stronger foundation and a strategic approach to problem-solving.
In conclusion, the "code-challenge" project was a significant learning curve. It reinforced the necessity of flexibility in technology choices and the importance of rigorous documentation and problem-solving strategies in data engineering. Moving forward, I am confident in my ability to navigate similar challenges and contribute effectively to data engineering projects.