Stefen's Projects
Welcome to the Google Analytics 360 Dataset Project! This repository is designed for anyone interested in working with realistic Google Analytics data. Whether you're a data scientist, a student, or a marketing analyst
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage deep learning and deep transfer learning to solve popular tasks in NLP including Classification, Information Retrieval, Sentiment Analysis, Search Engines, Clustering, Paraphrase Mining, Summarization, Language Translation, Q&A systems
The Pipeline for updating data between OLTP and OLAP environments
Azure Data Pipeline
we’ll explain Big O notation an real-world Python examples to illustrate how it can be applied to various time complexities.
Data Engineering Practice Problems
dbt / Amazon Redshift Demonstration Project
Data Engineering com Apache Spark
This repository contains a collection of bash scripts for common DevOps tasks, such as installing software, setting up environments, and managing resources.
directory with different docker-compose file to quickly start an infrastructure
Our project is a testament to this need, offering a comprehensive solution that combines modern technologies and architectures to create a powerful document search engine. This engine is not just a tool but a sophisticated ecosystem designed to handle complex data processing and retrieval tasks.
Apartments Data Pipeline using Airflow and Spark.
This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)
The objective of this guide is to demonstrate how to automate the deployment of a data pipeline on AWS using Terraform. The pipeline will utilize AWS services such as Lambda, Glue, Crawler, Redshift, and
EventMusic Producer is a Dockerized application designed to read data and output them to a Kafka topic, using Avro schemas for data serialization. It integrates seamlessly with Kafka and the Schema Registry to manage the flow of event data linked to music event information.
real-time flight status data pipeline using a myriad of technologies such as Kafka, Schema Registry, Avro, GraphQL, Postgres, and React.
This script facilitates the automation of fetching emails from a user's Gmail account and storing them into a MongoDB database. The emails fetched are filtered by specific labels such as Promotions, Social, Updates, and Forums. The script is intended to run continuously, checking for new emails every minute.
The main motivation for this mini-project is to get familiar with using Bash Scripting and the AWS CLI to automate command line tasks. This particular repo contains a configuration script that automatically creates an EC2 instance, accesses it via SSH, installs dependencies and hosts a simple Flask application using the image taken from Docker Hub.
The goal is to develop an intuitive platform where users can search for Airbnb apartments based on a target city, budget, and duration of stay, all powered by the intelligent language model, GPT-3.
To provide a deeper understanding of how the modern, open-source data stack consisting of Iceberg, dbt, Trino, and Hive operates within a music streaming platform, let’s delve into the detailed workflow and benefits of each component.
Big data application for multi-source data ingestion
Jenkins Delta pipeline
In the following post, we will learn how to build a data pipeline using a combination of open-source software (OSS), including Debezium, Apache Kafka, Kafka Connect.