Country_data_clustering

The objective of this project was the categorisation of world countries using socio-economic and health factors that indicate the overall development of the country.

The dataset was taken from here https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data.

Data from 167 countries were given in csv format Country_data_clustering with the following features:

country Name of the country;

child_mot Death of children under 5 years old per 1000 live births;

exports Exports of goods and services per capita. Given as %age of the GDP per capita;

health Total health spending per capita. Given as %age of GDP per capita;

imports Imports of goods and services per capita. Given as %age of the GDP per capita;

Income Net income per person;

Inflation: The measurement of the annual growth rate of the Total GDP;

life_expec: The average number of years a new born child would live if the current mortality pattern remains the same;

total_fer: The number of children that would be born to each woman if the current age-fertility rate remains the same;

gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

In the notebook called Country_data_clustering_kmeans.ipynb, I applied k-means algorithm, whilst in this one Country_data_clustering_DBSCAN_Birch.ipynb I applied DBSCAN and Birch.

Firstly, I imported the libraries and read the dataset. Then, I explored the datasets looking at the main statistical parameters and calculating the correlation matrix for all the numerical features.

()

I plotted the countries in the World and in Europe with their respective value for each feature. The interactive plots can be found at the following links:

Afterwards, I plotted a violin plot to represent the frequency of the values for each feature. I scaled the data and I applied the K-means algorithm, plotting the inertia and the silhouette score for each chosen number of cluster:

According to the plot of the inertia, the optimal number of cluster is 4 since the curve has an "elbow" at 4 cluster. The silhouette score indicates a high value at 4 clusters, too. In this case, instead, I decided to choose 3 clusters since the algorithm isolates better the countries that need more help.

Next, I plotted an interactive plot able to visualize the clusters (represented with 3 different colors) in a better way. Below, it is possible to check out both the static and interactive plots (click on the link below the figure).

Each feature can be bounded to some particular values, clicking on the bar associated with each feature and unclicking when the user is satisfied with the range of values.

Features vs Labels Kmeans: Interactive Plot

Click here to check the interactive plot --> Features vs Labels Kmeans: Interactive Plot

Below, instead, I plotted the different clusters on the globe. Each cluster can be associated with countries that have similar development conditions.

Kmeans: Needed Help Per Country

Click here to check the interactive plot --> Kmeans: Needed Help Per Country

At the end, a correlation plot was plotted enhancing the 3 different clusters and showing how they were separated in the feature hyperspace.

Kmeans clustering scatterplots

Click here to download the plot --> Kmeans: scatterplots

DBDSCAN and Birch were also applied (take a look to the following notebook Country_data_clustering_DBSCAN_Birch.ipynb), showing the following results:

DBSCAN: Needed Help Per Country

Birch: clustering scatterplots

Note: The interactive plots and the other graphs used for Kmeans with the other algorithms, can be found in the notebooks.

It can be observed that DBSCAN found a consistent number of outliers, even though different hyperparameters were tested.

Using Birch, the result is similar to Kmeans, apart from few countries that were not considered in the same Kmeans classes.

iron486 / country_data_clustering Goto Github PK