Coder Social home page Coder Social logo

trellixvulnteam / k_means_project_ddzh Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rhynerj/k_means_project

0.0 0.0 0.0 450 KB

An implementation and application of the k-means algorithm. Also includes PCA implentation for dimensionality reduction. Used to analyze a genetic dataset and a psychological dataset.

Python 14.07% Jupyter Notebook 85.93%

k_means_project_ddzh's Introduction

READ ME

K-means Project

This repository stores files for the a project that implements and evaluates the k-means algorithm. It was originally created as the final project in a Mathematics of Data Models course. This repository includes:

  • a Python script with:
    • code for extracting and cleaning a genetic and a numerical dataset
    • principal component analysis to reduce data dimensonality
    • a custom-engineered k-means algorithm to conduct unsupervised machine learning on data
    • code for visualizations of results
  • final report interpreting results
  • a Jupyter Notebook with all code, results, and analysis
  • a file with various tests run during development

K-means.py Imprementation Instructions

Once downloaded, the k-means.py file can be run in the command line. Using argparse, the main function takes in three required arguments, in order: function to run ("elbow" or "cluster"), the name of the dataset to retrieve, and k (k represents the maximum k to try if the elbow function is being called or the k value to use in k-means if the cluster function is being called). The main function also takes in five optional arguments: number of principle components (default: 2), maximum times to loop in k-means (default: 100), x-axis label (default: x), y-axis label (default: y), and title (default: title). The number of principle components is only adjustable with the elbow function, while the axis and title labels are only used with the cluster function.

If the elbow function is called, input data is set to the specified dataset and the clean function is called. The cleaned_data function takes in a dataset name, and uses an if statement to call the retrieval function associated with that dataset. The retrieval function for each dataset downloads that dataset from its online location, extracts the files within, and reads the data file into an array. This file downloading portion was hard-coded for testing purposes, and unfortunately, I ran out of time to fix this, so the command-line arguments only works for the two datasets I worked with. To run the code with another dataset, the cleaned_data function could be updated with another if statement to retrieve data using custom clean functions to parse that data into a numpy array. The min-max scaling, PCA, and k-means portions of the code all accept numpy arrays, so the code should work as long as the data retrieval function reads the data into a numpy array.

In the cleaned_data function, the data is scaled using min-max scaling, and the specific number of principle components is chosen. Under the elbow cluster condition, preprocessed data is passed into elbow_plot function, which runs k-means on the data using k-values ranging from zero to the given maximum k-value. The inertia (SSE) for each value of k is determined (by creating a k-means instance, running the fit function on it, and extracting the SSE value) and k-values and inertias are then plotted on a line graph, which can be used to identify the optimal value of k.

Using this value of k, k-means clustering can then be visualized and assessed using the plot_clusters function. The plot_clusters function is called when “cluster” is passed into the command line. The plot_clusters function calls cleaned_data internally. This was intended to force the number of principle components selected to two so that the data points could be plotted, and I ran out of time to find a better way to do this. The plot_clusters function creates a k-means instance with the specified k and uses that to plot all data points colored by cluster (using a randomly generated unique color) and the centroids. The plot also displays the number of data points included in each cluster in the legend and the SSE at the bottom of the plot.

Due to time constraints, this is the only functionality I was able to successfully implement in the main function. However, the “Preprocessing” and “K-means” sections are very modular and can easily be run on any input data that has been formatted into a numpy array. These sections are further explained below:

Preprocessing

This section includes the min_max_scale and pca functions. The min_max_scale function normalizes the given data using min-max scaling. It is modeled after MinMaxScaler() in sklearn. The pca function conducts PCA on the given data and returns the data reduced to the specific number of components.

K-means

This section includes the k-means algorithm as a class. An instance of k-means is initialized with a specific k, and an optional maximum looping parameter. The default for max_loop is 100. Once a k-means instance has been created, the k-means algorithm is run be calling k-means.fit, which takes in data as a numpy array. It randomly selects k rows in the data array as initial centroids. It calculates the new centroids by calculating the distance from each data point to each centroid and storing that in an array, finding the minimums of that array, and the using those minimums to update each centroid to the mean of the data points assigned to its cluster. The cluster assignments and centroids are both stored as attributes of the k-means instance for later retrieval. Looping continues until the centroids at the start of the loop match the calculated mean centroids or the max number of loops is reached. Once the loop is finished, the inertia (SSE) is calculated and the cluster sizes are determined and both stored as k-means attributes.

k_means_project_ddzh's People

Contributors

rhynerj avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.