Taking Taxi rank location data for Johannesburg, South Africa and clustering them geographically optimally, so that we can build service stations for all taxi ranks in that cluster.
Pre-requisites To be able to understand to understand the code and perform the task, a basic knowledge of the following topics is assumed.
- Basic Matplotlib skills for plotting 2-D data clearly.
- Basic understanding of Pandas and how to use it for data manipulation.
- The basic concepts behind clustering algorithms. We will be working with K-Means, DBSCAN and HDBSCAN.
Outline
We will divide the project into 7 parts.
- Exploratory Data Analysis: Do some data cleaning and initial visualizations to get a sense of the data.
- Visualizing Geographical Data: Plot the data onto a geospatial map. Will use the folium library for this.
- Clustering Strength / Performance Metric: Work with dummy data to understand how clustering (K-Means) works. Will explore the influence of number of clusters on performance.
- K-Means Clustering: Clustering data using K-Means and evaluating the clusters formed.
- DBSCAN: Density-Based Spatial Clustering of Applications with Noise.
- HDBSCAN: Hierarchal Density-Based Spatial Clustering of Applications with Noise.
- Addressing Outliers: Address outliers from HDBSCAN (also called as singletons) and see how we can assign them.
Github doesn't support Map visualization. So view the 'Project Final.ipynb' file with geo spatial visualizations click here: https://github.com/maha-prathamesh/Clustering-Geolocation-Data-in-Python/blob/main/Project%20Final.ipynb