The Seattle Public Library provides a few data sets with millions of rows that we can leverage for analytics. We want to build a data analytics tool that leverages this data source.
This dataset includes monthly snapshots of all of the physical items in the Seattle Public Library’s collection. Consistent monthly data begins with a snapshot taken August 1, 2016, continuing to the present. Additionally, this dataset contains snapshots taken on: January 1 in the years 2012, 2013, 2014, and 2016. ref
This dataset consists of monthly counts by title of checkout for all physical and digital items from 2005 to the present. It’s, of course, a hefty dataset with more than 25 million. Checkout data comes from multiple current and historical sources. For digital items, the media vendors: Overdrive, hoopla, Freegal, and RBDigital provide usage data. For historical physical item checkouts from April 2005 to September 30, 2016, this data source is the Legrady artwork data archives. From October 1, 2016, to the present, the source is the Horizon ILS.
This dataset includes a log of all physical item checkouts from Seattle Public Library. The dataset begins with checkouts that occurring in April 2005. Renewals are not included.
Can we mimic the process described by altexsoft on data engineering and data pipelines?
- Use Pyspark and your all-spark-notebook to complete your investigation.
- Use streamlit and docker to build your interactive dashboard.
- Potentially connect to their API for auto-updates.
- Use one of the Spark ML models with a predictive idea of your team's creation to assist the library or library users.
- Incorporate that predictive model into your dashboard.