This is the repo for the blog at Data Engineering Project: Stream Edition
You will need to install
- docker (make sure to have docker-compose as well)
- pgcli to connect to our
postgres
instance - git to clone the starter repo
- Optional: tmux
The data will be generated by a data generation script at src/main/scala/com.startdataengineering/ServerLogGenerator.scala
.
Everything is dockerized. Run the below commands in the project directory.
docker-compose up -d # -d mean run in detached mode (in the background)
docker ps # display all running containers
Do some manual checks using
docker exec -t beginner_de_project_stream_kafka_1 kafka-console-consumer.sh --bootstrap-server :9092 --topic server-logs --from-beginning --max-messages 10 # used to check the first 10 messages in the server-logs topic
docker exec -t beginner_de_project_stream_kafka_1 kafka-console-consumer.sh --bootstrap-server :9092 --topic alerts --from-beginning --max-messages 10 # used to check the first 10 messages in the alerts topic
and
pgcli -h localhost -p 5432 -U startdataengineer events
password is password
select * from server_log limit 5; -- should match the first 5 from the server-logs topic
select count(*) from server_log; -- 100000
\q -- to exit pgcli
take down all the running containers using down in the project repository
docker-compose down
website: https://www.startdataengineering.com/
twitter: https://twitter.com/start_data_eng