This is a personal project of mine to develop my data engineering skills as well as showcase the discussions on Reddit of the fifty largest cities across the USA. Work on this project is ongoing and involves the following steps:
Pull Reddit submissions and comments from PRAW API once per day.Design JSON format for data storageStream JSON files into MongoDB database.- Design Plotly visualization for data presentation.
Deploy Python scripts to AWS EC2 InstanceDeploy AWS DocumentDB cluster- Set up EC2 cron job for daily stream of Reddit data
- Connect EC2 Instance to AWS DocumentDB cluster
- Create Bash script to automatically push Plotly generated visualizations onto GitHub Page.
- Present HTML Plotly visualization on portfolio website.