Coder Social home page Coder Social logo

kartik-banga / automated-etl-pipeline-for-playstore-data Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 3.04 MB

Implemented ETL pipeline on AWS for Playstore data using Lambda, Glue Crawlers, and Glue ETL Jobs. Orchestrated workflow with Step Functions and achieved seamless integration, optimal data merging, and enhanced data quality/accessibility.

aws-glue aws-glue-crawler aws-lambda aws-s3 aws-step-functions cloud-computing data-analysis data-cleaning data-engineering data-visualization

automated-etl-pipeline-for-playstore-data's Introduction

Automated-ETL-Pipeline-for-Playstore-Data

I recently concluded an end-to-end data pipeline project that seamlessly utilizes AWS services with Power BI for comprehensive analytics. Focused on Play Store data, the project comprised distinct phases, demonstrating a well-orchestrated data flow.

  1. Data Cleaning with AWS Lambda: Initiated by employing AWS Lambda functions and Python scripts (utilizing Pandas and NumPy) for comprehensive data cleaning of the 'playstore_review' dataset. This approach ensured meticulous attention to data quality and integrity. The refined dataset was then securely stored in an S3 bucket, laying the foundation for subsequent stages of the pipeline.

  2. Metadata Extraction with AWS Glue Crawlers: Leveraged AWS Glue Crawlers to extract essential metadata from the cleaned review dataset, establishing a structured foundation for the pipeline.

  3. ETL with PySpark in AWS Glue Job: Orchestrated a PySpark script within an AWS Glue Job, seamlessly integrating SQL queries through 'spark.sql,' along with Spark functions, to execute an inner join between 'playstore_apps' and the cleaned 'playstore_review' datasets. This multifaceted approach not only facilitated dataset merging but also enabled the application of advanced SQL-based data cleaning techniques, resulting in a thoroughly polished dataset.

  4. Automated Workflow with AWS Step Functions: Implemented AWS Step Functions for seamless orchestration, automating the entire data pipeline. This streamlined execution of Lambda functions, Glue jobs, and other processes, ensuring a coherent end-to-end workflow.

  5. Storage and Accessibility on S3:

    • The decision not to use Amazon Redshift was primarily driven by cost considerations. Given the scale and scope of the project, leveraging Amazon S3 for storage proved to be a more cost-effective solution, aligning with budget constraints while still meeting analytical requirements.*

    Stored the final, cleaned, and merged dataset in an S3 bucket in CSV format, establishing a scalable and accessible storage solution.

  6. Power BI Analysis for Actionable Insights: Utilized Power BI for in-depth analysis, creating visually compelling dashboards and reports. This phase provided actionable insights into Play Store apps, enhancing decision-making capabilities.

automated-etl-pipeline-for-playstore-data's People

Contributors

kartik-banga avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.